B-Trees - Balanced Search Trees for Slow Storage

Jul 06 2008 Published by under Data Structures

Another cool, but frequently overlooked, data structure in the tree family is called the B-tree. A B-tree is a search tree, very similar to a BST in concept, but optimized differently.

BSTs provide logarithmic time operations, where the performance
is fundamentally bounded by the number of comparisons. B-trees also
provide logarithmic performance with a logarithmic number of
comparisons - but the performance is worse by a constant factor. The
difference is that B-trees are designed around a different tradeoff. The B-tree is designed to minimize the number of tree nodes that need to be examined, even if that comes at the cost of doing significantly more
comparisons.

Why would you design it that way? It's a different performance tradeoff.
The B-tree is a da
ta structure designed not for use in memory, but instead for
use as a structure in hard storage, like a disk drive. B-trees are the
basic structure underlying most filesystems and databases: they provide
an efficient way of provide rapidly searchable stored structures. But
retrieving nodes from disk is very expensive. In comparison to
retrieving from disk, doing comparisons is very nearly free. So the design
goal for performance in a B-tree tries to minimize disk access; and when disk access is necessary, it tries to localize it as much as possible - to minimize the number of retrievals, and even more importantly, to minimize the number of nodes on disk that need to be updated when something is inserted.

An order-N B-tree is a search tree with the following properties:

  1. each node contains no more than N children.
  2. Every node (except the root and the leaves) contains at least
    N/2 children.
  3. All leaves are empty nodes, and occur in the same level of the tree.
  4. Every node with N children contains N-1 keys.
  5. Given a node N1, with a K'th child node NK, the keys in
    NK are greater than the K-1th key in N1, and less that
    the Kth key in N1.

The brilliance of the B-tree is that it's a structure that minimizes access to disk while maintaining a nearly perfect balance. It's really an amazingly beautiful and elegant structure.

The key to it is the beautiful insert operation. The tree grows from the leaves - but it only adds levels at the root. That makes the balancing work - and work in a way that guarantees that you'll be able to keep the tree in balance doing nothing but modifying nodes in a single linear chain from the leaf up to the root.

To insert a value, you start by picking the correct place to insert. The set of non-empty nodes at the bottom of the tree is guaranteed to be a sorted list of values,
running from the left-most node over to the right, in order.

You find its place in that list - and that gives you the insertion node. If the
insertion node already has N values, that you split the node. You take the
median value in the node, and use it as a pivot. You take the nodes smaller than the
pivot, and turn them into a new node named "L"; and the nodes larger that the pivot,
and make them into a new node named "R". Then you move the pivot into the parent
of the insert node, and make L its left child, and R its right. So now the former leaf node is now two nodes, with one of its values promoted up into its parent.

If the parent node already had N values, then you split it, and so on. You keep bumping up the tree, until you get to the root - if you need to split the root, then you take the split pivot node, and turn it into a new root with only one value. (That's why the second invariant makes an exception for the root node.)

As usual, examples help to understand. Look at the order-3 B-tree in the figure below.

btree.png

In an order-4 B-tree, every node (except the root) must have at least two values. Suppose we wanted to insert the value "20". That would go in the middle child node, between 18 and 20. But that would result in a new node with 4 values. But the maximum
number of values per node is three. So we need to split it. We'll pick 18 as the pivot. That will give us two new nodes - one with only one value (15), and one with two (20, 21). The pivot node, 18, will be inserted into the parent, giving us a parent with the three nodes 10, 18, 30, and four children. The resulting B-tree is shown below.

btree-one-insert.png

Now, suppose we wanted to insert the value 7. That would split the leftmost child. Using 8 as a pivot, we'd get two new nodes: (3, 7) and (9), and we'd insert 8 into the root node. But that would result in the root node having four values. So we split the root node. The root pivot is 18, so we get two new nodes, (8, 10) and (30), and
a new root, containing only (18), as shown below.

btree-second-insert.png

Deletes are very similar, except that they involve contractions. When a node shrinks to be too small because of deletions, it gets merged with a neighbor; after the contraction, if the resulting merged node has too many values (for example, if the neighboring node was full before the merge), then you split the newly merged node. So either you end up removing a node from the tree, or you redistributing the children between the two nodes in a balanced manner.

As usual for my data-structure posts, there'll be an example implementation in another post, to follow as soon as I have time to write it.

10 responses so far

  • llewelly says:

    Anyone who enjoys a bit of data structure coding fun should implement a B-tree, and possibly one of its kin, such as a B+-tree or a B*-tree.

  • Pseudonym says:

    Deletes are very similar, except that they involve contractions.

    This is a piece of fiction that we tell to undergraduates.
    In fact, we never tell people the full B-tree deletion algorithm. The reason is that it's nowhere near similar to the insertion algorithm, and full of special cases, and generally horrid. Try implementing it and you'll see what I mean.
    Remember, the goal is to minimise disk operations and, if possible, to maintain some degree of concurrency. It's therefore important to avoid operations that change the structure of the tree (splitting or merging nodes) where possible. So, for example, on delete, it's almost always best to shuffle keys between leaves if you can get away with it.
    One more thing. In a true concurrency situation, it's almost always better to occasionally violate the structure restrictions on any kind of trees, B-Trees included, in the interests of each operation completing in a reasonable amount of time. It turns out that this has additional benefits, in that both insertion and deletion algorithms become much simpler to implement.

  • Ahruman says:

    This article might be easier to understand if you started with the same example you finished with. ­čÖé

  • Mozglubov says:

    Indeed, I agree with #3. I think something went a little wrong when selecting the initial example image...

  • Andrew says:

    In your first diagram the in middle child node has the value sequence 11,18,21,27, but in your second you suddenly introduce "15", having inserted 20. Am I missing something?

  • Micah says:

    He switched 11 to 15. Understandable, I suppose, as the both begin with 1...
    In any case, pick one or the other in you mind (until he fixes it) and the example makes sense.

  • Mark C. Chu-Carroll says:

    In case any of you noticed it, the "web design" comment posted at 3:29 is actually comment spam. It's a reproduction of a comment on Reddit, from a spammer who's hit this site before.

  • Ian says:

    Mark,
    Are these the same B-trees I used to use several decades ago in lieu of a "database"?

  • There's something about m-ary...
    If I may cut and paste from a Weisstein-Wolfram page:
    B-trees were introduced by Bayer (1972) and McCreight. They are a special m-ary balanced tree used in databases because their structure allows records to be inserted, deleted, and retrieved with guaranteed worst-case performance. An n-node B-tree has height O(lg n), where lg is the logarithm to base 2. The Apple® Macintosh® (Apple, Inc., Cupertino, CA) HFS filing system uses B-trees to store disk directories (Benedict 1995). A B-tree satisfies the following properties:
    1. The root is either a tree leaf or has at least two children.
    2. Each node (except the root and tree leaves has between [m/2] and m children, where [x] is the ceiling function.
    3. Each path from the root to a tree leaf has the same length.
    Every 2-3 tree is a B-tree of order 3. The number of B-trees of order 3 with n=1, 2, ... leaves are 1, 1, 1, 1, 2, 2, 3, 4, 5, 8, 14, 23, 32, 43, 63, ... (Ruskey, Sloane's A014535). The number of order-4 B-trees with n=1, 2, ... leaves are 1, 1, 1, 2, 2, 4, 5, 9, 15, 28, 45, ... (Sloane's A037026).
    SEE ALSO: Red-Black Tree, Tree
    REFERENCES:
    Aho, A. V.; Hopcroft, J. E.; and Ullmann, J. D. Data Structures and Algorithms. Reading, MA: Addison-Wesley, pp. 369-374, 1987.
    Bayer, R. and McCreight, E. "Organization and Maintenance of Large Ordered Indexes." Acta Informatica 1, 173-189, 1972.
    Benedict, B. Using Norton Utilities for the Macintosh. Indianapolis, IN: Que, pp. B-17-B-33, 1995.
    Beyer, R. "Symmetric Binary B-Trees: Data Structures and Maintenance Algorithms." Acta Informat. 1, 290-306, 1972.
    Knuth, D. E. "B-Trees." The Art of Computer Programming, Vol. 3: Sorting and Searching, 2nd ed. Reading, MA: Addison-Wesley, pp. 482-485 and 490-491, 1998.
    Ruskey, F. "Information on B-Trees." [see web page for hotlink]
    Skiena, S. S. The Algorithm Design Manual. New York: Springer-Verlag, p. 178, 1997.
    Sloane, N. J. A. Sequences A014535 and A037026 in "The On-Line Encyclopedia of Integer Sequences."
    CITE THIS AS:
    Weisstein, Eric W. "B-Tree." From MathWorld--A Wolfram Web Resource

  • Stephen says:

    I worked for a company in the early 90's that had a project in which data was stored on CDs. CD drives were very slow, and often had really long seek times. But the idea was that you could distribute the phone bill for a 30,000 person campus on CD and people could look at it cheaper and easier than if it were printed.
    Of course CD drives got faster. And in 1995, putting that kind of information on the web became feasible. And now pocket computers can have 650 MB of RAM, not just desktops and servers. So the idea is dead.
    I recall, it took a high end machine a day or two to create a CD database.

Leave a Reply