B-Trees: An Efficent Structure for Searching Data on External Memory


In this lecture we will examine data structures and algorithms that use more
space than is available in main memory. We will analyze them with respect to
how often they move blocks of data between main and external memory (discussed
in the previous lecture). Typically the cost of such memory transfers dominates
the cost of executing code on the block while it is in memory. That is, a
transfer between main and external memory might take 10 milliseconds, while
processing the data in the block might take microseconds (a few 1/1000ths of a
millisecond). A processor executing 10^9 instructions per second can process
10^7 (10 million) instructions in 10 ms. So, we can virtually ignore the time
taken to process each block retrieved from memory.


Searching:

We will start with a searching task, using a special kind of N-ary Search Tree
(NST - contrast to BST) called a B-tree.

There are many different kinds of B-trees. We characterize each B-tree in terms
of it order, b: each memory block stores b-1 keys and b subtrees/children. For
b = 5, for example, we can visualize each node in the tree as containing two
arrays: the key array (storing the keys aka values in the B-tree) and the
subtree/children array, storing references to children nodes, ther roots in
subtrees. We will use the familiar terminology of root, internal node, and leaf
to describe these trees.

For example, here is the data for a B-tree of order 5: 4 keys, 5 children.

            0   1   2   3
          +---+---+---+---+
key       |   |   |   |   |
          +---+---+---+---+
          0   1   2   3   4
        +---+---+---+---+---+
subtree |   |   |   |   |   |
        +---+---+---+---+---+

By the order property (see it below)
all values in subtree[0] are less than key[0]
all values in subtree[1] are greater than key[0] and less than key[1],
all values in subtree[2] are greater than key[1] and less than key[2],
all values in subtree[3] are greater than key[2] and less than key[3],
all values in subtree[4] are greater than key[3]

These arrays don't have to be filled, but all nodes except the root must be
at least half filled (see the structure property below).

Ultimately, b will be a large value, such that storing the arrays in the node
will take up the size of a large block easily readable from external storage 
(typically containing thousands of keys/subtree references). So, if it were
convenient to transfer blocks of 1,000 words of memory, we would choose b to
be 500, having 499 (b-1) keys and 500 (b) subtrees/children. It is not 
unreasonable to transfer blocks of 10K or even 100K or 1M words.

In the examples below, we will choose b to be 5. This b is big enough to be
interesting (and different than a BST) but small enough to cause lots of node
merges and splits when doing insertion and deletion: if b were 500, we would
have to insert 500 values before we required a second block of memory).

We will now characterize the order and structure properties of B-trees

Order Property
  1) The keys in a node are sorted left to right, key[i]<key[i+1]
  2) A subtree between key[i] and key[i+1] contains values x such that
      key[i]<x<key[i+1]. For a leftmost subtree, it contains values x only such
      that x < key[0]; for a rightmost subtree k it contains values x such that
      x > key[k]

Structure Propery of B-trees of order b
  1) All leaves are at the same depth and store their information as keys only
     (all the subtree references for leaves, of course, store null, so this
      space gets wasted)
  2) Every non-leaf node with k keys contains k+1 children
  3) Every node has at most b-1 keys
  4) Every node except the root has >= b/2 keys (it is at least 1/2 full)
  5) The root may be empty and it may also be a leaf
  
Here is the psuedocode for searching for, inserting, and removing values in a
B-tree. Aside: I expect you to know (i.e., memorize) how to search and insert,
but not delete (which is a bit complicated).


Searching for x in a node (start searching at the root):
  If there is no root node (followed null reference), x is not in the tree
  If there is a root node
    Using the order property, do a binary search on the keys in the node
      a) if there is an i such that keys[i] = x, x is in the tree at this node
      b) otherwise, if x < key[0] search subtree with index 0
                    if x > the last key[k] search the subtree with index k+1
                    if keys[i] < x < keys[i+1] search the subtree at index i+1

*Note that we do a binary search on each B-Node block, which take a trivial
amount of time compared to bringing the block into memory.

Insertion of x (assuming it is not already in the tree; for simplicity assume
unique values, as in sets and maps):

  Search the tree to find in which LEAF x belongs

  If there is room in that leaf, put x in the key array at the correct index
  If there is NO room in the leaf (the node's key array is full)
    Choose the median value among the values in the leaf and the new value
    Using the median as the separator value, split the leaf into two new nodes
      (one with all values<median, one with all values>median)
    Insert the median in the parent's node and adjust the children (now 1 more)
      (might causing more splitting in the parent, going back to the If above)

  It is possible, if this process goes back to a full root node, that the
  root node will itself need to split into two children of a newly created
  root, which will have just one value (allowed at the root by the structure
  property). So, unlike BSTs, B-trees grow at the root, not the leaves. This 
  ensures the property that all the leaves are at the same depth.

Note that because the median is inserted in the parent, the two children will
be well balanced, each containing half the values in the original leaf node
that was split (no matter what order they were inserted). Also, each split node
will be only 1/2 full, so it will have room to add more keys later, without
growing.


Deletion for x:

The general idea of deletion here is actually similar to deletion in a BST.
Either the value is deleted from a leaf (which may require merging the leaf
and its ancestors), or from an internal node (which replaces x by the largest
value smaller than x or the smallest value bigger than x, deleting that value
from its leaf - such a value will always be in a leaf- according to the
leaf-deleting rules).

  Search the tree to find in which node x is present (let's assume only 1 node)
  If it is not present, "there is nothing left to do" (TINLTD)

  If the node containing x is is a leaf
    remove x
    if the node that contained x is also the root, TINLTD
    if the # of values is still >= b/2, TINLTD
    if the # of values is now < b/2 and not the root, we call the node with
      x removed "deficient" (it doesn't have >= b/2 values); perform the code
      labeled "REBALANCE" below, which will restore a deficient node but
      possible make its parent deficient (possibly processing nodes all the way
      up to the root).

  If the node containing x is an internal node (not a leaf)
    if x separates two minimal subtrees (each having <= b/2 values), remove x
      and combine these subtrees; REBALANCE the node that contained x if needed
    else, one or both subtrees is non minimal (has > b/2 values),
      replace x by an "extremal" value (largest value in the left non-minimal
      subtree or smallest value in the right non-minimal subtree) and then
      remove that value (which is a leaf: see above for how to delete).
     REBALANCE the leaf node that contained x if needed


REBALANCE: A deficient node (DN) has < b/2 values

Here is the "rebalance" code when deleting x from a node that is becomes
deficient. This code may terminate, or bring the deficiency from child to
parent, where it is executed again. Not that it is ok for a root to be
deficient.

Let DN be the deficient node (from which x was removed). Choose one of the 3
possibilities below: Note 1 and 2 are symmetric for right/left siblings

1) If DN's right sibling has > b/2 values
     (it won't be deficient after moving one value)
  add the parent's separator of these siblings to the DN at the end
    (remove it from the parent)
  update the parent's separator to be the first element in the right sibling
     (remove it from the right sibling, shifting the keys/subtrees left by 1)
  if DN is not a leaf
    append what was the first child of the left sibling as last child in DN

2) If DN's left sibling has > b/2 values
  add the parent's separator to the DN at the beginning, shifting the
    keys/subtrees right by 1 (remove it from the parent)
  update the parent's separator to be the last element in the left sibling
     (remove it from the left sibling, shifting the keys/subtrees right by 1)
  if DN is not a leaf
    append what was the last child of the right sibling as first child in DN

3) If DN's left and right siblings are both too small (<= b/2 values)
  create a new node with DN's values, the values of one sibling, and the
    separator in the parent of DN and its sibling (it will have <= b-1 values)
  remove the separator from parent, if the parent is deficient (but not root)
    repeat this rebalancing


Analysis:

Recall that we are choosing a B-tree of order b. In the best case, the height
of an N value tree will be Logb N (with each node filled with values); in the
worst case it will be Logb/2 N (with each node half filled with values).

The idea is to choose b to be as big as possible (the biggest possible base
for the logarithm) so that as many the keys and subtree references as possible
fit into one block of memory that can be easily transferred between main and
external memory. Recall that getting/putting information from/to external
memory takes about the same amount of time no matter how much information is
transferred, so let's transfer a lot each time. We make b so big that it stores
thousands of keys and subtree references in each memory block. Note that for
complexity class analysis, all log bases are the equivalent, but Log2 N is
going to be a lot bigger than Log1,000 N (by a factor of about 10).

For all the B-tree operations, we must get the block for the root (we might as
well pre-fetch/cache this permanently in main memory), then get the block for
each subtree node we visit on the path to each x we are searching for,
inserting, or deleting. In the worst case for deletion, we must find the
extremal node in one of the leaves and delete it. Thus, in the worst case we do
one access to external memory for each depth in the tree from root to leaf (and
all leaves are at the same depth). In the worst case of insertion/deletion we
must revisit every node back on the path to the root. So, in the worst case we
do 2(Logb/2 N) block transfers between external memory and main memory and back
to external memory.

Finally, I will pass out a sheet of paper and do examples on the board in class
today, illustrating inserting and removing nodes from a B-tree of order 5: so
every node (but the root) will store between 2-4 values (and 3-5 references to
subtrees). The pictures are available online (see the Weekly Schedule for
today's lecture).



Sorting:

Our sorting analysis will be much shorter, because we are going to use a
variant of Merge Sort, which can be run efficiently in terms of external
storage use.

To be concrete, suppose that we need to sort 10^9 (a billion) ints stored in
file using only about 10^6 (1 million) words of storage for main memory. So,
we can store only about 1/1,000th of the data in memory at any given time.

1) We start by separating all the input data into blocks of 10^6 (there are
1,000 such blocks) and bringing each into memory, sorting each using a standard
O(N Log2 N) sorting method, and writing each back out to a sorted file.

2) Now we need to bring the early part of each of these 1,000 blocks (the
parts with the smallest numbers) and reserve space in main memory for an
output block as well. Given 10^6 words of storage and 1,001 blocks (let's make
the output block the same size as each input block), we can allocate 10^6/1,001
ints for each block: that is 999 ints, but lets assume for simplicity each
is 1,000 ints (so we really need 1,001,000 words of storage).

Let's call these 1,000 ints a "page". Note that since we are using pages that
are each 1/1,000th the size of each input block, we will have to perform 1,000
page transfers from file to main memory to process all the values in each input
block. We will see that we need each page only once, using all the values in
it before bringing in the next page, and never needing those values again.

3) We set up a tournament-tree (talked about more below) that allows us to
produce a new value for the output page in Log2 1000 operations (~10). We
repeated do this until all the input blocks are exhausted (refilling their
pages as necessary).

If after putting a new value in the output page it is filled, we transfer the
page to disk and start with a new empty output page. If we consume all the
values in any input page, we overwrite that page by transfering the next page
from that block into main memory (or mark the block as exhausted, if all 1,000
pages of the block have been transferred).

We visualize this as follows: recall that each block contains 100 pages, and
only the first page of every block is in main memory.

				   +-+-+...+-+-+
				   | | |   | | |Input Block   0: 1 Page in main
				   +-+-+...+-+-+

				   +-+-+...+-+-+
				   | | |   | | |Input Block   1: 1 Page in main
				   +-+-+...+-+-+
   +-+-+...+-+-+		   
   | | |   | | |  tournament-tree  .....
   +-+-+...+-+-+
Output Block: 1 Page in main	   +-+-+...+-+-+
				   | | |   | | |Input Block 998: 1 Page in main
				   +-+-+...+-+-+

				   +-+-+...+-+-+
				   | | |   | | |Input Block 999: 1 Page in main
				   +-+-+...+-+-+

Overall, we must transfer each input page 3 times: once from external memory
-> main memory for the initial sort (step 1), once from main memory -> external
memory when that sort is finished, and once more from external memory -> main
memory for the tournament. In addition, we create a new output file that
contains all the sorted values, and each output page is transfered once from
main memory -> external memory. Thus, we need a total of 4 memory transfers
per page.

A tournament-tree quickly produces the lowest value from the 1000 input blocks,
over and over again, while transfering pages when necessary. This is called
multi-way merging.

To examine tournament-trees more simply, lets assume we are using 4 input
blocks (easiest if a power of 2) with each having 8 values in it, and 4 values
in each page, so each block consists of 2 pages and each block has 1 page
(4 words) in main memory at a time.

Assume that each input block has already been sorted (see step 1 above). The
tournament-tree is designed the way a single-elimination tournament among M
teams would be designed (e.g., the NCAA Sweet 16 tournaments). Here, the winner
is the smaller value.

                                       Pages in        |    Pages in
                                       main memory     |   external memory
			             +---+---+---+---+ | +---+---+---+---+ 
			B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 |  B0
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            | 4 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
			             | 4 |25 |27 |40 | | |43 |44 |53 |56 |  B1
				     +---+---+---+---+ | +---+---+---+---+ 
Page in main memory Total Winner                       | 
 +---+---+---+---+   +---+                             | 
 |   |   |   |   |   | 4 |                             | 
 +---+---+---+---+   +---+                             | 
                                                       | 
				     +---+---+---+---+ | +---+---+---+---+ 
			B2/B3 Winner | 8 |11 |30 |39 | | |42 |47 |54 |65 |  B2
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            | 8 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
				     |13 |16 |19 |33 | | |37 |46 |52 |58 |  B3
				     +---+---+---+---+ | +---+---+---+---+ 

Now, we transfer 4 to the first spot in the output page. We cross out 4 in B1
and advance to 25 in B1. We fill in B0/B1 Winner with 12 (since 12 < 25) and
fill in Total Winner with 8.

                                       Pages in        |    Pages in
                                       main memory     |   external memory
			             +---+---+---+---+ | +---+---+---+---+ 
			B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 |  B0
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            |12 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
			             |   |25 |27 |40 | | |43 |44 |53 |56 |  B1
				     +---+---+---+---+ | +---+---+---+---+ 
Page in main memory Total Winner                       | 
 +---+---+---+---+   +---+                             | 
 | 4 |   |   |   |   | 8 |                             |
 +---+---+---+---+   +---+                             | 
                                                       | 
				     +---+---+---+---+ | +---+---+---+---+ 
			B2/B3 Winner | 8 |11 |30 |39 | | |42 |47 |54 |65 |  B2
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            | 8 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
				     |13 |16 |19 |33 | | |37 |46 |52 |58 |  B3
				     +---+---+---+---+ | +---+---+---+---+ 


Now, we transfer 8 to the first spot in the output page. We cross out 8 in B2
and advance to 11 in B2. We fill in B2/B3 Winner with 11 (since 11 < 13) and
fill in Total Winner with 11.

                                       Pages in        |    Pages in
                                       main memory     |   external memory
			             +---+---+---+---+ | +---+---+---+---+ 
			B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 |  B0
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            |12 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
			             |   |25 |27 |40 | | |43 |44 |53 |56 |  B1
				     +---+---+---+---+ | +---+---+---+---+ 
Page in main memory Total Winner                       | 
 +---+---+---+---+   +---+                             | 
 | 4 | 8 |   |   |   |11 |                             |
 +---+---+---+---+   +---+                             | 
                                                       | 
				     +---+---+---+---+ | +---+---+---+---+ 
			B2/B3 Winner |   |11 |30 |39 | | |42 |47 |54 |65 |  B2
			    +---+    +---+---+---+---+ | +---+---+---+---+ 
                            |11 |                      | 
                            +---+    +---+---+---+---+ | +---+---+---+---+ 
				     |13 |16 |19 |33 | | |37 |46 |52 |58 |  B3
				     +---+---+---+---+ | +---+---+---+---+ 


We repeat this process until the output page is filled, then transfer it to
external memory (as the first/next page in the final result, a sorted file) and
clear the output page in memory (contains no values again). If any of the pages
for B0-B3 have all their values crossed out, we replace that page in memory
with the next page from that block (if there is still another). Eventually,
every value in the blocks B0-B3 makes its way to the output page and appears
in the correct spot of the sorted file.

I will animate this tournament-tree in class, showing how to use it as an
efficient multi-way merge from the pages representing the 4 blocks to a fully
sorted output block. It needs some small amount of extra storage to represent
the tree (and extra space so each winner knows from where it came, so new
values in that block can be used).

More generally, say you have to sort N values and have M words in memory
that you can use to do it. Assume N >> M. Then you should divide the N values
into B = N/M blocks, each of size M. You can transfer each of these blocks from
external memory into main memory, sort it, and then transfer the sorted block
back out to external memory. When doing the multi-merge, the page size (P)
should be P = M / (B+1) (the amount of memory divided by the number of blocks,
with one extra block for output) which is also M / (N/M+1); N/M >> 1 then this
is M^2/N.

If there are too many blocks, so the page size is too small for doing effective
transfers between external and main memory, we can apply this algorithm
recursively to sort fewer but longer blocks.