Testing the System
For systems like this, testing is critical. Without extensive testing, it's easy to miss some unusual set of circumstances that you couldn't predict when you were writing the code. In this section, you'll learn how to try to destroy your own data.
The implementation included with this article comes complete with a class/program called Pound (see Listing 4), which pounds on a datafile as fast as it can. More precisely, it fills the datafile with a single timestamp value; when it fills up the file, it calls checkpoint(). Thus, after each checkpoint, the file should contain nothing but the repetition of a single 8-byte timestamp. If, upon examination, the file has different timestamps in different parts of the file, then it has been corrupted. Just before commencing the next round of writing, Pound checks the file to see if it has been corrupted.
Run Pound like this:
If you kill the program (or even pull the plug on your machine, assuming you have the PhysicalLog.safely flag turned on), you can just restart Pound and it will do a repair (if necessary) and then resume pounding.
Even harsher than Pound is pummel, which is a perl script. pummel starts up a child process running Pound and then kills it at a random moment. Then it starts it up again. It keeps doing this, incessantly and repeatedly. This isn't as good as pulling the plug, but it's still pretty good.
Run pummel like this:
Finally, just for good measure, there's a program called Corrupt, which can be used to corrupt a file by changing random bytes in it. Run this while pummel is running and you'll see that the repair process reports a corrupted file. Corrupt is a great utility for testing programs (see Listing 5
Run Corrupt like this (and make sure pummel is already running):
java Corrupt 8192
(The numerical argument to Corrupt is just the number of bytes to randomly change.)
Practicing Fault Tolerance
Designing the CKPTFile program is really an exercise in fault tolerance. Programming for fault tolerance requires a different way of thinking about code. You have to clarify what assumptions you make at each point in your code, and then consider situations where those assumptions are violated.
The CKPTFile program demonstrates that if you are aware of the assumptions you are making in your programming, and carefully consider what happens if those assumptions are violated, it is possible to write programs that operate correctly under terrible conditions.
CKPTFile is a prototype implementation of a fault-tolerant I/O system. It would serve well as the low-level I/O facility for a database engine or persistence system and would work well as a component in a long-running server.