he previous article in this series
on trees showed how to build a sorted binary tree. Inserting and locating items in a sorted tree is relatively quick and easy, so it's natural to wonder if you could use a sorted tree as an index for a database. To find a record, you could look up its key in the tree and then the node containing that key would contain a reference to the complete record's location. This method is simple and effective, but a normal sorted tree has some drawbacks that prevent it from behaving as you would expect.
This article discusses the reasons why ordinary sorted trees don't work well as indexes and explains B-trees (pronounced "bee trees"), the data structures that are actually used by databases to build index structures. It also offers a B-tree program (in both Visual Basic and C#) for download, but none of the code is included the article body because it's fairly involved.
But first, you need to learn a bit about tree sizes and behavior so you can understand the performance issues.
To insert an item in a sorted binary tree, you compare the item's value with the value stored in the root node. If the new value is smaller than the root's value, you move to the root's left child. If the new value is greater than the root's value, you move to the root's right child. You continue moving down to the nodes' appropriate left or right children until you reach a leaf node. Then you insert the new item as that node's left or right child as appropriate.
Locating an item in the tree is similar. You start at the root node and move through the tree, following each node's left or right branches depending on the target value, until you either find the target value or you reach a leaf node and can conclude that the value isn't in the tree.
When either inserting or locating an item, you start at the root node and move down through the tree. You might need to go all the way to a leaf, which gives an upper bound on how far you might need to search. The question then is how tall might the tree be?
You can prove by induction that a full binary tree of height H contains 2^H – 1 nodes. I haven't had an excuse to use induction for a while, so I'm going to jump at the chance! For the base case, note that a tree of height H = 1 is just a single node. That tree contains 2^1 – 1 = 2 – 1 = 1 nodes, as desired.
For the inductive step, assume I've shown that a full binary tree of height N contains 2^N – 1 nodes. To finish the proof, I must show that a full binary tree of height N + 1 contains 2^(N + 1) – 1 nodes. I can build a full binary tree of height N + 1 by building two trees of height N and connecting each of their root nodes to a new root node. Figure 1 shows a tree of height 3 built from two trees of height 2 (surrounded by dashed lines).
|Figure 1. Growing Trees: You can build a tree of height N + 1 by joining two trees of height N with a new root node.|
The tree of height N + 1 contains all of the nodes in the trees of height N plus one new root node so the total number of nodes is (2^N – 1) + (2^N – 1) + 1 = 2 * 2^N – 2 + 1 = 2^(N + 1) – 1, which is what I need to prove.
I can use a similar argument to show that a full K-ary tree of height N contains (K^N – 1)/(K – 1) nodes. For example, a full tree of degree 10 and height 4 contains (10^4 – 1)/(10 – 1) = 1111 nodes.
If you ignore constants, a full K-ary tree of height H has on the order of K^H nodes. Conversely, a full K-ary tree containing N nodes has height on the order of LogK(N). This is good news for those wanting to use trees for indexes. It means that the height of the tree, and therefore the maximum length of the search for an item in the tree, grows very slowly as the number of items in the tree grows.
For example, suppose you have a full binary tree containing one million nodes with a height roughly log2(1 million) or around 20. You would need to compare the target value to around only 20 values before you find it or conclude that the value isn't in the tree. That's not much work for searching a database containing a million values!
If that were the end of the story, you could easily use a binary tree to store and retrieve values. Unfortunately, this entire performance discussion depends on the tree being relatively full. If the tree is full, it is fairly wide and, more importantly, short. However, if you add items to a sorted tree in different orders, you get trees with different widths and heights. Some orders produce a tree that is very tall and skinny. For example, consider the sorted binary trees shown in Figure 2.
|Figure 2. Laurel and Hardy Trees: Depending on the order in which you add items, the tree may end up short and fat or tall and thin.|
The tree on the left was built by adding items to the tree in the order 4-2-3-1-6-5-7. The tree on the right was built by adding the same items to the tree in the order 7-1-6-5-2-4-3. Searching for an item in the left tree will take at most three steps, but searching for an item in the right tree might take as many as seven.
Searching these little trees is no big deal, but suppose you had a big customer database containing a few hundred thousandor even a millionrecords. In that case, it could take a long time to search a tall, skinny tree.
The problem with tall, thin trees is magnified by the fact that you won't be able to store such a large database all in memory at the same time. Disk accesses typically take 100 times as long as memory accesses, so hammering the disk 100,000 times to find values to examine could take a very long time indeed.
In practice, if you build a sorted tree and the values are more or less randomly arranged, you'll get a tree that isn't as short as possible but that also isn't too tall and thin. However, a few common scenarios can really wreak havoc with your tree. For example, suppose you pull data out of an existing database and add it to your tree. You may end up adding the values to the tree in sorted order, which creates a tall, thin tree.