This post is very delayed, but things have been busy.
I'm working my way up to finger trees, which are a wonderful
functional data structure. They're based on trees - and like many
tree-based structures, their performance relies heavily on the balance
of the tree. The more balanced the tree is, the better they perform.
In general, my preference when I need a balanced tree is a red-black
tree, which I wrote about before. But there are other kinds of balanced trees,
and for finger trees, many people prefer a kind of tree called a 2/3 tree.
A two-three tree is basically a B-tree with a maximum of two values per node.
If you don't know what a B-tree is, don't worry. I'm going to explain all the details.
A two-three tree is a tree where each node contains either one or two values. If
the node isn't a leaf, then its number of children is one more than its number
of values. The basic value constraints are the same as in a binary search tree:
smaller values to the left, larger values to the right.
The big difference in the 2/3 tree is balance: it's got a much stronger
balance requirement. In a 2/3 tree, every path from the tree root to the leaves
is exactly the same length.
Before getting into the insert algorithm (which is where the beauty and
the trickiness of the structure resides), let's get a bit more formal about what a two-three tree is. The rules of a two-three
- Every node has either one or two values.
- Every node has either 0, 2, or 3 children.
- For a given node n, the 1st child subtree contains only values
less than the 1st value of n; the 2nd child subtree contains values
greater than or equal to the 1st value of n but less than the second;
and the 3rd child subtree contains values greater than or equal to
the second value in n.
- The leaves of the tree are all on the same level.
How can it do that? The concept is pretty simple, but the implementation is
tricky. (B-trees in general are incredibly error-prone to implement. There's a lot
of subtleties to get the implementation correct. They're not difficult, but there
are a lot of tricky cases.)
To insert a node, what you do is a straightforward search for the correct leaf to
insert it into. Then you insert it into the correct position in that leaf. If after
doing that, the leaf has three values, then you take that node, and split it. You take
the middle value, and make a new single-value tree node. Then you take the smallest value
in the node, make it into a single-value node, and make it the left child of the middle node. Similarly, with the largest value, you make a new single value node, and make it the right child of the middle node. That gives you a simple three-node two-three tree where you
used to have a single node.
Now you need to re-attach it to the tree. To do that, you go to the parent of the node where you did the insert. You remove the pointer that used to point to the node where you did the insert, and then you splice in the new node at that position. If that results in
the parent node having more than two values, then you split the parent, and continue.
As usual, this is much clearer with an example. Here's a two-three tree.
Now, suppose we want want to insert the value 15. The insert node is going to be the third child - the rightmost node in the tree, which has two values, 14 and 17. So we go ahead and insert it into that node.
Now we've got a leaf node with three values, 14, 15, and 17. That's no good. So we split
it out. We make a new node with the value 15, and make it a left-child node containing 14, and a right child containing 17. Then we insert that into the parent.
So now we've got a root node with three values: 6, 12, and 15. So we need to split it.
We do the same basic thing: take the middle value (12), and make it a new node. Then we
make 6 its left child, and 15 its right. And now we've got a two-three tree that satisfies all the rules, and is beautifully balanced.
Deleting a node uses similar tricks, but based on merging nodes instead of splitting.
The trick in deletes is making sure that you don't lose levels. For example, suppose
you wanted to remove "6".
Looking at our tree, if you deleted "6", the easiest way would be to merge 4 and 7 into a single node, which was the left child of 12, as in this diagram.
But when you do that, you've removed a level from the leftmost branch of the tree: the left branch of the root node would have a depth of two, and the right branch would have a depth of three. So to do the delete successfully, you need to shuffle things around a bit. Since you're removing a level from the left subtree, you need to also remove one from the right. You do that by merging the root of the right subtree into the tree root:
Finally, if the merge of the right branch had caused the root to contain three values, we would have needed to do a split, exactly the way we did in insert.
It's a beautiful structure. But as I said, implementing it is a pain. Personally,
I'd rather take the relatively small performance penalty of the red-black tree,
which I find a lot easier to get right. In general, two-three trees are beautiful in concept,
but more trouble than they're worth in practice. But at least some functional folks love
them; and I suppose it's possible that in a functional language with strong formal
properties, it's easier to get them right than in a typical imperative language.
I'm not going to show you an implementation of two-three trees - as I said, the implementation gets rather hairy, covering all of the cases. I put one together, but looking at it, I don't think it's got much in the way of explanatory value. In the next data structure post, I'm going to finally talk about finger trees. One of the canonical implementations of them is based on a Haskell implementation of two-three trees - so if you really want to
see that, there'll be a link in the next post.