sing memory-mapped files can get you tremendous performance gains, but memory-mapped files are very delicate: any change is immediately reflected in the filesystem. This article describes a checkpointing system that lets you make atomic changes to a memory-mapped file, using techniques borrowed from the world’s commercial relational databases.
This allows you to benefit from the tremendous speed and flexibility of memory-mapped files and at the same time feel confident that your application is robust, even in the event of a catastrophic failure.
Data is very delicate. We rely on sophisticated techniques to keep it safely on our disk drives, and most of the time we don’t have to think about it. But when you’re the one writing the software that manipulates the data, then you do have to think about it.
Let’s consider a scenario using a common business application a word processor. When a user opens a document and makes some changes to it, there are, at that moment, two different versions of the document: one in memory and one on disk (see Figure 1).
Let’s imagine that our word processor?and the underlying operating system?are written naively. You might imagine that when the user hits “Save,” such an application would immediately begin to write the new file over the old one (see Figure 2).
If the save process completes, this works fine. However, what if power fails in the middle of the process? The user winds up with a file that comprises half of the original version and half of the new version. Not only does he lack a single complete version, but the file itself might be too corrupt for the word processor to read, in which case, the user will have nothing at all.
The goal is for the process of saving a file to be atomic. An atomic process is one that is either completely carried out or isn’t carried out at all. In other words, partially finished jobs are not tolerated.
Furthermore, when a programmer defines an atomic process, it’s imperative to include not only the fate of the new file, but the fate of the original file as well. The entire process needs to be atomic?either the entire new file is written to disk or the entire old file is left intact.
This is something that database engines do all the time. On a number of different levels, relational databases allow for certain activities to be done atomically. In particular, they allow transactions, which are clusters of data updates that are performed atomically. A transaction can be committed, which means that its updates are permanently stored; or a transaction can be rolled back, which means that none of its updates are stored.
Filesystems, on the other hand, don’t do this as a matter of course. Files are written to the filesystem using commands such as open(), write(), and close() (or something similar). There’s no command to roll back the changes made since the file was opened. With a few exceptions, filesystems aren’t databases.
In practice, however, this isn’t really a problem because of the way that programmers generally use the filesystem. For example, a program might write each new chunk of data to a different file. Or, you might take the trouble to write a document to a temporary file, and then replace the original file with the temporary one. Because replacing a file is generally atomic, this trick provides an easy way to get atomic writes using a filesystem and is commonly used by application developers.
Writing out an entire file often lacks efficiency. For a word processor it might be fine, because the files aren’t too large and they aren’t written to disk too often. But let’s say you were implementing some kind of high-throughput store. If you’re trying to make atomic changes to a 100MB file, and you are doing it 100 times per second, then you aren’t going to want to write the entire file out each time you make an atomic change.
The most common approach in this scenario is to design the system to cache its data in memory, which is a lot faster than disk. However, data can’t remain in memory forever?you have to write it out to disk at some point and when you do, it needs to be atomic. This is called checkpointing.
But in the case of the 100MB file, it’s simply too large to write out in its entirety. You have to make changes directly to the file. And because of that our data has become delicate again.
We can’t use any of the brute-force solutions described above because they are designed for low-throughput situations. On the other hand, we are reluctant to make changes directly to our file because we are scared of catastrophic failures.
There’s a solution to this problem of course and the solution lies in making the changes themselves atomic. There’s no reason to accept that atomic processes have to be slow. Clearly, this must be so, since much of the world’s business runs on efficient database systems that deal safely with lots of data.
The problem is that these systems are necessarily enormous, expensive pieces of software, based on decades of experience and testing. They use special algorithms and protocols?generally based on the order in which one makes file modifications?to preserve the integrity of their data. But I don’t want to spend forty years working on my I/O routines, and neither do you; we just want some safety.
In the next section, I’ll explain a method that centers around the CKPTFile class. This method gives a pretty reasonable tradeoff: It provides very safe atomic data writing, it’s fast, and it’s not too complicated.
The CKPTFile Class
Before we can design our algorithm, we have to decide what part of the Java I/O subsystem we are going to use. Ideally, you would implement the checkpointing facilities as part of the New I/O (NIO) packages that come with JDK 1.4. You could, for example, try to create a class called CKPTMappedByteBuffer, which would be just like a MappedByteBuffer, except that it could process its changes atomically.
|What You Need|
| JDK 1.4 from Sun Microsystems, or a compatible Java development kit
A Java IDE or a suitable command shell
NIO, however, doesn’t make it easy to create subclasses of its buffer classes, and the implementation of such a class would be a distraction from the topic at hand. Instead, we’ll consider the basic functionality?safe, atomic data writing?in a form that could be adapted later to be a buffer.
The main class of our implementation (see Listing 1) is called CKPTFile.
CKPTFile isn’t a buffer, but it’s similar. It allows you to read and write bytes, either one-at-a-time or in bulk. When doing bulk transfers, you can do them to or from byte arrays, or you can do them to or from actual ByteBuffers. Unlike a buffer, a CKPTFile doesn’t keep track of a read/write cursor you have to specify where you want to read or write each time you do it.
(If you want to jump ahead, take a peek at the source for Pound.java this is a test program that uses CKPTFile. In its main method, you can see the CKPTFile in action.)
Memory-Mapped File I/OEven though CKPTFile isn’t a Buffer, it uses a Buffer?MappedByteBuffer, to be precise. To read and write the data, use memory-mapped file I/O, which maps the contents of the file directly into an in-memory array (see Figure 3).
The array contains the bytes of the file: to read the file, all you have to do is read the array. This might sound like a terrible waste of RAM, but in fact this is invariably implemented by a low-level operating system service called demand paging. Demand paging allows the file to be loaded into memory only as needed. Thus, the system only loads data that you actually access.
Likewise, changes made to the array go directly into the file on disk. As with loading, the disk access doesn’t necessarily happen right away. In this case, the operating system writes the data out after a delay, according to a scheme that maximizes efficient use of the disk and minimizes delay.
Despite this operating-system trickery, the array really is the file. There is no semantic difference between reading and writing the array and reading and writing the file. And this is dangerous because any change you make can take effect immediately, and if you leave the file in an inconsistent state?by only completing some of a larger set of changes?you can corrupt the data.
In fact, this kind of I/O is often even more dangerous than reading and writing files in the regular way, because file I/O often involves writing a file from scratch, whereas memory-mapped I/O is usually used for modifying a file directly.
An Abstraction Layer
In order to make file access safer, the CKPTFile controls all access to the array. Instead of accessing the underlying array directly, you must go through the CKPTFile interface.
The underlying array is itself behind another abstraction barrier?in this case, it’s hidden inside a MappedByteBuffer object. The CPKTFile uses the MappedByteBuffer to read and write the file (see Figure 4).
The get() methods read from the CKPTFile and the put() methods write to it. The get methods have a simple implementation?they just call the get() methods of the MappedByteBuffer. The put() methods are a little more complicated, and this is where we get to the heart of our implementation.
The Atomicity Architecture
The whole point of the CKPTFile class is to make it possible to checkpoint the file. This has the effect of turning the set of writes between two checkpoints into a kind of de facto transaction (see Figure 5). If the writes between two checkpoints are a transaction, then we want to carry out this transaction in its entirety or not at all. This means that in the event only some of the writes are completed, the file should be restored to its original state.
Figure 6 shows a hypothetical example in which a file is written and checkpointed. The first checkpoint occurs after a set of writes has been performed and completes successfully. The second set of writes begins, but before the second checkpoint can happen, the system crashes, leaving the file in an inconsistent state. The only thing to do in this scenario is repair the file so that it reverts to the state it was in just after the last checkpoint. How do you do that? The answer is surprisingly simple: get it from another file.
The Physical Log
To restore the unmodified document in this scenario you have to copy the data that is already there to another file before you overwrite any data in the main file. Then, if there is a failure before any given checkpoint can be completed, the old data can be retrieved from this other file.
The other file is called the physical log, and its use is shown in Figure 7.
The physical log acts as a copy of your original data. This is conceptually similar to keeping the original file around while writing the new data to a temporary file?at a certain point, you have both the old and new data on disk. But the physical log is different?and better?because instead of having a copy of the entire file, you have a copy of only the data that has changed, saving a lot of space.
Each of the put() methods in CKPTFile calls a method called markDirty(). This method tells the CKPTFile that a particular region of the file is about to be changed (i.e. made dirty). The CKPTFile first makes a copy of this region of data to the physical log and then carries out the write.
It’s important to write to the physical log before writing to the datafile. If you write to the datafile first and the system crashes before writing to the physical log, you’ll have no copy of the original data.
The physical log is just another file, wrapped in a class called PhysicalLog. It is opened by CKPTFile the first time a put() method is called. It contains a series of entries, implemented by a class called PhysicalLogEntry (see Listing 2). Each entry contains a chunk of data, along with the start and end positions within the data file.
To checkpoint a data file, you have to flush all written data to the file. Remember that most operating systems cache filesystem data in memory, making reads and writes faster. When you write data to a file, the data isn’t necessarily written to disk immediately. Rather, it stays in the buffer for a short time, while other writes happen. When either enough writes have accumulated or enough time has passed, the operating system writes the modified data to disk.
This greatly increases filesystem speed, but it also makes the system less reliable?you can write data to a file and think it is safely there when it isn’t. If the system crashes before the real write actually happens, the data won’t be on disk.
This is where checkpointing comes in. To checkpoint a data file, simply tell the system to really force the data out to disk. This can be done in NIO using the MappedByteBuffer.force() method. This method forces any changes out to disk. When this method returns, you can safely assume that the data really is on disk.
You can checkpoint a CKPTFile at any time by calling its checkpoint() method. However, note that this method is synchronized, as are all the get() and put() methods. This assures that checkpointing won’t happen at the same time that a write is in progress. This is good; we only want checkpoints to happen in the safe zone between writes.
Once checkpointing completes, you can delete your physical log. You won’t need it now.
The Repair Class
The physical log contains copies of only those sections of the data file that have changed. To restore data from the physical log, in the case of a failure, you open the log and apply each entry to the data file. The PhysicalLog class constructor opens the physical log, and its readEntry() method lets you read PhysicalLogEntry objects.
All of this is carried out by a class called Repair (see Listing 3). Repair also has a main() method, which means you can run it from the command line to repair a data file after a crash. In Listing 3, look especially at the first method, repair(), which does the actual repair.
It’s safe to call Repair.repair() on a file every time you start your application. It will first check to see if the file actually needs to be repaired. If it does, it repairs it. It can tell whether or not a file needs to be repaired simply by checking to see if there is a physical log for that file lying around. If there is, then the file didn’t checkpoint successfully the last time the program was run, and is therefore corrupt. If there is no physical log, then the file was checkpointed properly, and no repair is necessary.
(Author’s Note: Actually, this explanation isn’t precisely correct. If the checkpointing process completes, but the program crashes before it deletes the physical log, then repair will think it’s damaged even though it isn’t. This scenario is very unlikely but it can happen. Fortunately, the only cost of repairing an already-repaired file is a little bit of wasted time.)
Crashes can happen during regular writing, but they can also occur during checkpointing or the repair process. Luckily, this doesn’t complicate things for us much: we can just repair the file in the normal way.
You’ll also notice that there’s a flag called repairing in CKPTFile. When Repair is repairing a datafile, it does this by writing to it, which of course generates yet another physical log. If repairing is set to true, then this file is given a different name, so that it doesn’t conflict with the real physical log, which is needed to actually repair the file.
Putting it All Together
We’ve now seen all of the parts of our system, but it’s important to see the big picture. Figure 8 shows the entire process of writing to the datafile, both for successful checkpoints and for crashes.
There are a couple of options you can set in the code:
PhysicalLog.justArchive. This option tells the system what to do with physical logs once they are no longer needed after a checkpoint. If it is set to false, they are deleted. If it is set to true, the files are renamed and kept. This can be useful because a full set of physical logs contains the entire history of the changes made to the datafile.
PhysicalLog.safely. This option determines whether or not PhysicalLog flushes its data after each write. Setting this to true turns the flushing on, which is safer but much slower. Note that this doesn’t control the flushing of the datafile (which happens during checkpointing) but of the physical log. If you go the unsafe route (false) your data will survive a program crash but may not survive a powerful failure. If you go the safe route (true), your data will survive either situation.
Testing the System
For systems like this, testing is critical. Without extensive testing, it’s easy to miss some unusual set of circumstances that you couldn’t predict when you were writing the code. In this section, you’ll learn how to try to destroy your own data.
The implementation included with this article comes complete with a class/program called Pound (see Listing 4), which pounds on a datafile as fast as it can. More precisely, it fills the datafile with a single timestamp value; when it fills up the file, it calls checkpoint(). Thus, after each checkpoint, the file should contain nothing but the repetition of a single 8-byte timestamp. If, upon examination, the file has different timestamps in different parts of the file, then it has been corrupted. Just before commencing the next round of writing, Pound checks the file to see if it has been corrupted.
Run Pound like this:
If you kill the program (or even pull the plug on your machine, assuming you have the PhysicalLog.safely flag turned on), you can just restart Pound and it will do a repair (if necessary) and then resume pounding.
Even harsher than Pound is pummel, which is a perl script. pummel starts up a child process running Pound and then kills it at a random moment. Then it starts it up again. It keeps doing this, incessantly and repeatedly. This isn’t as good as pulling the plug, but it’s still pretty good.
Run pummel like this:
Finally, just for good measure, there’s a program called Corrupt, which can be used to corrupt a file by changing random bytes in it. Run this while pummel is running and you’ll see that the repair process reports a corrupted file. Corrupt is a great utility for testing programs (see Listing 5).
Run Corrupt like this (and make sure pummel is already running):
java Corrupt 8192
(The numerical argument to Corrupt is just the number of bytes to randomly change.)
Practicing Fault Tolerance
Designing the CKPTFile program is really an exercise in fault tolerance. Programming for fault tolerance requires a different way of thinking about code. You have to clarify what assumptions you make at each point in your code, and then consider situations where those assumptions are violated.
The CKPTFile program demonstrates that if you are aware of the assumptions you are making in your programming, and carefully consider what happens if those assumptions are violated, it is possible to write programs that operate correctly under terrible conditions.
CKPTFile is a prototype implementation of a fault-tolerant I/O system. It would serve well as the low-level I/O facility for a database engine or persistence system and would work well as a component in a long-running server.