B-Trees: An Efficent Structure for Searching Data on External Memory In this lecture we will examine data structures and algorithms that use more space than is available in main memory. We will analyze them with respect to how often they move blocks of data between main and external memory (discussed in the previous lecture). Typically the cost of such memory transfers dominates the cost of executing code on the block while it is in memory. That is, a transfer between main and external memory might take 10 milliseconds, while processing the data in the block might take microseconds (a few 1/1000ths of a millisecond). A processor executing 10^9 instructions per second can process 10^7 (10 million) instructions in 10 ms. So, we can virtually ignore the time taken to process each block retrieved from memory. Searching: We will start with a searching task, using a special kind of N-ary Search Tree (NST - contrast to BST) called a B-tree. There are many different kinds of B-trees. We characterize each B-tree in terms of it order, b: each memory block stores b-1 keys and b subtrees/children. For b = 5, for example, we can visualize each node in the tree as containing two arrays: the key array (storing the keys aka values in the B-tree) and the subtree/children array, storing references to children nodes, ther roots in subtrees. We will use the familiar terminology of root, internal node, and leaf to describe these trees. For example, here is the data for a B-tree of order 5: 4 keys, 5 children. 0 1 2 3 +---+---+---+---+ key | | | | | +---+---+---+---+ 0 1 2 3 4 +---+---+---+---+---+ subtree | | | | | | +---+---+---+---+---+ By the order property (see it below) all values in subtree[0] are less than key[0] all values in subtree[1] are greater than key[0] and less than key[1], all values in subtree[2] are greater than key[1] and less than key[2], all values in subtree[3] are greater than key[2] and less than key[3], all values in subtree[4] are greater than key[3] These arrays don't have to be filled, but all nodes except the root must be at least half filled (see the structure property below). Ultimately, b will be a large value, such that storing the arrays in the node will take up the size of a large block easily readable from external storage (typically containing thousands of keys/subtree references). So, if it were convenient to transfer blocks of 1,000 words of memory, we would choose b to be 500, having 499 (b-1) keys and 500 (b) subtrees/children. It is not unreasonable to transfer blocks of 10K or even 100K or 1M words. In the examples below, we will choose b to be 5. This b is big enough to be interesting (and different than a BST) but small enough to cause lots of node merges and splits when doing insertion and deletion: if b were 500, we would have to insert 500 values before we required a second block of memory). We will now characterize the order and structure properties of B-trees Order Property 1) The keys in a node are sorted left to right, key[i] key[k] Structure Propery of B-trees of order b 1) All leaves are at the same depth and store their information as keys only (all the subtree references for leaves, of course, store null, so this space gets wasted) 2) Every non-leaf node with k keys contains k+1 children 3) Every node has at most b-1 keys 4) Every node except the root has >= b/2 keys (it is at least 1/2 full) 5) The root may be empty and it may also be a leaf Here is the psuedocode for searching for, inserting, and removing values in a B-tree. Aside: I expect you to know (i.e., memorize) how to search and insert, but not delete (which is a bit complicated). Searching for x in a node (start searching at the root): If there is no root node (followed null reference), x is not in the tree If there is a root node Using the order property, do a binary search on the keys in the node a) if there is an i such that keys[i] = x, x is in the tree at this node b) otherwise, if x < key[0] search subtree with index 0 if x > the last key[k] search the subtree with index k+1 if keys[i] < x < keys[i+1] search the subtree at index i+1 *Note that we do a binary search on each B-Node block, which take a trivial amount of time compared to bringing the block into memory. Insertion of x (assuming it is not already in the tree; for simplicity assume unique values, as in sets and maps): Search the tree to find in which LEAF x belongs If there is room in that leaf, put x in the key array at the correct index If there is NO room in the leaf (the node's key array is full) Choose the median value among the values in the leaf and the new value Using the median as the separator value, split the leaf into two new nodes (one with all valuesmedian) Insert the median in the parent's node and adjust the children (now 1 more) (might causing more splitting in the parent, going back to the If above) It is possible, if this process goes back to a full root node, that the root node will itself need to split into two children of a newly created root, which will have just one value (allowed at the root by the structure property). So, unlike BSTs, B-trees grow at the root, not the leaves. This ensures the property that all the leaves are at the same depth. Note that because the median is inserted in the parent, the two children will be well balanced, each containing half the values in the original leaf node that was split (no matter what order they were inserted). Also, each split node will be only 1/2 full, so it will have room to add more keys later, without growing. Deletion for x: The general idea of deletion here is actually similar to deletion in a BST. Either the value is deleted from a leaf (which may require merging the leaf and its ancestors), or from an internal node (which replaces x by the largest value smaller than x or the smallest value bigger than x, deleting that value from its leaf - such a value will always be in a leaf- according to the leaf-deleting rules). Search the tree to find in which node x is present (let's assume only 1 node) If it is not present, "there is nothing left to do" (TINLTD) If the node containing x is is a leaf remove x if the node that contained x is also the root, TINLTD if the # of values is still >= b/2, TINLTD if the # of values is now < b/2 and not the root, we call the node with x removed "deficient" (it doesn't have >= b/2 values); perform the code labeled "REBALANCE" below, which will restore a deficient node but possible make its parent deficient (possibly processing nodes all the way up to the root). If the node containing x is an internal node (not a leaf) if x separates two minimal subtrees (each having <= b/2 values), remove x and combine these subtrees; REBALANCE the node that contained x if needed else, one or both subtrees is non minimal (has > b/2 values), replace x by an "extremal" value (largest value in the left non-minimal subtree or smallest value in the right non-minimal subtree) and then remove that value (which is a leaf: see above for how to delete). REBALANCE the leaf node that contained x if needed REBALANCE: A deficient node (DN) has < b/2 values Here is the "rebalance" code when deleting x from a node that is becomes deficient. This code may terminate, or bring the deficiency from child to parent, where it is executed again. Not that it is ok for a root to be deficient. Let DN be the deficient node (from which x was removed). Choose one of the 3 possibilities below: Note 1 and 2 are symmetric for right/left siblings 1) If DN's right sibling has > b/2 values (it won't be deficient after moving one value) add the parent's separator of these siblings to the DN at the end (remove it from the parent) update the parent's separator to be the first element in the right sibling (remove it from the right sibling, shifting the keys/subtrees left by 1) if DN is not a leaf append what was the first child of the left sibling as last child in DN 2) If DN's left sibling has > b/2 values add the parent's separator to the DN at the beginning, shifting the keys/subtrees right by 1 (remove it from the parent) update the parent's separator to be the last element in the left sibling (remove it from the left sibling, shifting the keys/subtrees right by 1) if DN is not a leaf append what was the last child of the right sibling as first child in DN 3) If DN's left and right siblings are both too small (<= b/2 values) create a new node with DN's values, the values of one sibling, and the separator in the parent of DN and its sibling (it will have <= b-1 values) remove the separator from parent, if the parent is deficient (but not root) repeat this rebalancing Analysis: Recall that we are choosing a B-tree of order b. In the best case, the height of an N value tree will be Logb N (with each node filled with values); in the worst case it will be Logb/2 N (with each node half filled with values). The idea is to choose b to be as big as possible (the biggest possible base for the logarithm) so that as many the keys and subtree references as possible fit into one block of memory that can be easily transferred between main and external memory. Recall that getting/putting information from/to external memory takes about the same amount of time no matter how much information is transferred, so let's transfer a lot each time. We make b so big that it stores thousands of keys and subtree references in each memory block. Note that for complexity class analysis, all log bases are the equivalent, but Log2 N is going to be a lot bigger than Log1,000 N (by a factor of about 10). For all the B-tree operations, we must get the block for the root (we might as well pre-fetch/cache this permanently in main memory), then get the block for each subtree node we visit on the path to each x we are searching for, inserting, or deleting. In the worst case for deletion, we must find the extremal node in one of the leaves and delete it. Thus, in the worst case we do one access to external memory for each depth in the tree from root to leaf (and all leaves are at the same depth). In the worst case of insertion/deletion we must revisit every node back on the path to the root. So, in the worst case we do 2(Logb/2 N) block transfers between external memory and main memory and back to external memory. Finally, I will pass out a sheet of paper and do examples on the board in class today, illustrating inserting and removing nodes from a B-tree of order 5: so every node (but the root) will store between 2-4 values (and 3-5 references to subtrees). The pictures are available online (see the Weekly Schedule for today's lecture). Sorting: Our sorting analysis will be much shorter, because we are going to use a variant of Merge Sort, which can be run efficiently in terms of external storage use. To be concrete, suppose that we need to sort 10^9 (a billion) ints stored in file using only about 10^6 (1 million) words of storage for main memory. So, we can store only about 1/1,000th of the data in memory at any given time. 1) We start by separating all the input data into blocks of 10^6 (there are 1,000 such blocks) and bringing each into memory, sorting each using a standard O(N Log2 N) sorting method, and writing each back out to a sorted file. 2) Now we need to bring the early part of each of these 1,000 blocks (the parts with the smallest numbers) and reserve space in main memory for an output block as well. Given 10^6 words of storage and 1,001 blocks (let's make the output block the same size as each input block), we can allocate 10^6/1,001 ints for each block: that is 999 ints, but lets assume for simplicity each is 1,000 ints (so we really need 1,001,000 words of storage). Let's call these 1,000 ints a "page". Note that since we are using pages that are each 1/1,000th the size of each input block, we will have to perform 1,000 page transfers from file to main memory to process all the values in each input block. We will see that we need each page only once, using all the values in it before bringing in the next page, and never needing those values again. 3) We set up a tournament-tree (talked about more below) that allows us to produce a new value for the output page in Log2 1000 operations (~10). We repeated do this until all the input blocks are exhausted (refilling their pages as necessary). If after putting a new value in the output page it is filled, we transfer the page to disk and start with a new empty output page. If we consume all the values in any input page, we overwrite that page by transfering the next page from that block into main memory (or mark the block as exhausted, if all 1,000 pages of the block have been transferred). We visualize this as follows: recall that each block contains 100 pages, and only the first page of every block is in main memory. +-+-+...+-+-+ | | | | | |Input Block 0: 1 Page in main +-+-+...+-+-+ +-+-+...+-+-+ | | | | | |Input Block 1: 1 Page in main +-+-+...+-+-+ +-+-+...+-+-+ | | | | | | tournament-tree ..... +-+-+...+-+-+ Output Block: 1 Page in main +-+-+...+-+-+ | | | | | |Input Block 998: 1 Page in main +-+-+...+-+-+ +-+-+...+-+-+ | | | | | |Input Block 999: 1 Page in main +-+-+...+-+-+ Overall, we must transfer each input page 3 times: once from external memory -> main memory for the initial sort (step 1), once from main memory -> external memory when that sort is finished, and once more from external memory -> main memory for the tournament. In addition, we create a new output file that contains all the sorted values, and each output page is transfered once from main memory -> external memory. Thus, we need a total of 4 memory transfers per page. A tournament-tree quickly produces the lowest value from the 1000 input blocks, over and over again, while transfering pages when necessary. This is called multi-way merging. To examine tournament-trees more simply, lets assume we are using 4 input blocks (easiest if a power of 2) with each having 8 values in it, and 4 values in each page, so each block consists of 2 pages and each block has 1 page (4 words) in main memory at a time. Assume that each input block has already been sorted (see step 1 above). The tournament-tree is designed the way a single-elimination tournament among M teams would be designed (e.g., the NCAA Sweet 16 tournaments). Here, the winner is the smaller value. Pages in | Pages in main memory | external memory +---+---+---+---+ | +---+---+---+---+ B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 | B0 +---+ +---+---+---+---+ | +---+---+---+---+ | 4 | | +---+ +---+---+---+---+ | +---+---+---+---+ | 4 |25 |27 |40 | | |43 |44 |53 |56 | B1 +---+---+---+---+ | +---+---+---+---+ Page in main memory Total Winner | +---+---+---+---+ +---+ | | | | | | | 4 | | +---+---+---+---+ +---+ | | +---+---+---+---+ | +---+---+---+---+ B2/B3 Winner | 8 |11 |30 |39 | | |42 |47 |54 |65 | B2 +---+ +---+---+---+---+ | +---+---+---+---+ | 8 | | +---+ +---+---+---+---+ | +---+---+---+---+ |13 |16 |19 |33 | | |37 |46 |52 |58 | B3 +---+---+---+---+ | +---+---+---+---+ Now, we transfer 4 to the first spot in the output page. We cross out 4 in B1 and advance to 25 in B1. We fill in B0/B1 Winner with 12 (since 12 < 25) and fill in Total Winner with 8. Pages in | Pages in main memory | external memory +---+---+---+---+ | +---+---+---+---+ B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 | B0 +---+ +---+---+---+---+ | +---+---+---+---+ |12 | | +---+ +---+---+---+---+ | +---+---+---+---+ | |25 |27 |40 | | |43 |44 |53 |56 | B1 +---+---+---+---+ | +---+---+---+---+ Page in main memory Total Winner | +---+---+---+---+ +---+ | | 4 | | | | | 8 | | +---+---+---+---+ +---+ | | +---+---+---+---+ | +---+---+---+---+ B2/B3 Winner | 8 |11 |30 |39 | | |42 |47 |54 |65 | B2 +---+ +---+---+---+---+ | +---+---+---+---+ | 8 | | +---+ +---+---+---+---+ | +---+---+---+---+ |13 |16 |19 |33 | | |37 |46 |52 |58 | B3 +---+---+---+---+ | +---+---+---+---+ Now, we transfer 8 to the first spot in the output page. We cross out 8 in B2 and advance to 11 in B2. We fill in B2/B3 Winner with 11 (since 11 < 13) and fill in Total Winner with 11. Pages in | Pages in main memory | external memory +---+---+---+---+ | +---+---+---+---+ B0/B1 Winner |12 |24 |26 | 34| | |41 |49 |50 |57 | B0 +---+ +---+---+---+---+ | +---+---+---+---+ |12 | | +---+ +---+---+---+---+ | +---+---+---+---+ | |25 |27 |40 | | |43 |44 |53 |56 | B1 +---+---+---+---+ | +---+---+---+---+ Page in main memory Total Winner | +---+---+---+---+ +---+ | | 4 | 8 | | | |11 | | +---+---+---+---+ +---+ | | +---+---+---+---+ | +---+---+---+---+ B2/B3 Winner | |11 |30 |39 | | |42 |47 |54 |65 | B2 +---+ +---+---+---+---+ | +---+---+---+---+ |11 | | +---+ +---+---+---+---+ | +---+---+---+---+ |13 |16 |19 |33 | | |37 |46 |52 |58 | B3 +---+---+---+---+ | +---+---+---+---+ We repeat this process until the output page is filled, then transfer it to external memory (as the first/next page in the final result, a sorted file) and clear the output page in memory (contains no values again). If any of the pages for B0-B3 have all their values crossed out, we replace that page in memory with the next page from that block (if there is still another). Eventually, every value in the blocks B0-B3 makes its way to the output page and appears in the correct spot of the sorted file. I will animate this tournament-tree in class, showing how to use it as an efficient multi-way merge from the pages representing the 4 blocks to a fully sorted output block. It needs some small amount of extra storage to represent the tree (and extra space so each winner knows from where it came, so new values in that block can be used). More generally, say you have to sort N values and have M words in memory that you can use to do it. Assume N >> M. Then you should divide the N values into B = N/M blocks, each of size M. You can transfer each of these blocks from external memory into main memory, sort it, and then transfer the sorted block back out to external memory. When doing the multi-merge, the page size (P) should be P = M / (B+1) (the amount of memory divided by the number of blocks, with one extra block for output) which is also M / (N/M+1); N/M >> 1 then this is M^2/N. If there are too many blocks, so the page size is too small for doing effective transfers between external and main memory, we can apply this algorithm recursively to sort fewer but longer blocks.