Program 3

Trees Implementing Collections:
Priority Queue via Max-Heap and Map via BST

Fundamental Data Structures
ICS-23


Introduction This programming assignment is designed to ensure that you know how to write code that processes trees, by implementing more efficient versions of the collection classes that we have been studying. First, we will write a priority queue where the add/remove methods are both O(Log N) instead of one being O(1) and one O(N). If we add/remove N values in such a priority queue, the complexity is O(N Log N) instead of O(N^2): this gives us a way to sort in time O(N Log N).

Second, we will write a map where the get/put methods are both O(Log N) (for a well balanced tree: most random BSTs are well balanced) instead of each being O(N) as in the array implementation (and worst case for pathological trees).

When writing your classes, you will run each against my JUnit tests to verify (a bit too strong of a word here) that it is correct. You may also find it useful to test your classes with the DriverForOrderedCollection or DriverForMap, which you can run from collection.jar once you build a path to this library in the project. With this driver, you can individually test any methods in your classes interactively, and see their results (returned values and state changes, mostly using toString).

Download and unzip the following Eclipse project Start and use it to start working on this program. For each part of this assignmnment you will update and turn in a single .java file in the project (see the Checkmate submission for this assignment for more details).

Only one student should submit the assignment, but both student's names should appear in the comments at the top of each submitted program. Please turn in each program as you finish it, so that I can accurately assess the progress of the class as a whole during this assignment.

Print this document and carefully read it, marking any parts that contain important detailed information that you find (for review while you work on this assignment and before you turn in the files).


Problem #1: Priority Queue via Max-Heap In this program you will write a more efficient version of a priority queue, by implementing it as a max-heap (maximum value at the root). You can test this program with TestPriorityQueue, a JUnit test (how we will ultimately grade it), and with a Supermarket shopping simulation. With the DriverForOrderedCollection program, you can individually test any methods in your classes interactively, and see their results (returned values and state changes, mostly using toString).

The HeapPriorityQueue class must implement the PriorityQueue interface using standard priority queue semantics (any-in/highest priority out). You have already implemented this behavior using linked lists. Here, the underlying data structure is a heap, which uses an array to represent a tree compactly (so it has more similarities to the priority queue implemented by an array than the one implemented by a linked list). In this representation, the index of a child node can be computed from the index of its parent node, and vice versa, by the formulas discussed in class and reviewed below. So, we should think "tree" but program "array".

Because the main data structure is an array (even though we think "tree"), you should examine the code in the ArrayPriorityQueue class. Pay special attention to the toString, shallowCopy, ensureCapacity, and trimToSize methods.

Here are some issues to think about when writing your class.

  1. Array Length/Size and Mapping:
    • Recall that with the "simple mapping" of indexes (for parent i, the left child is 2*i and the right is 2*i+1; for child i (either left or right) the parent is i/2) the first node in the heap must be stored in index 1 in the array -not in index 0. This changes how the underlying array must be allocated, because an array that must store n values must be allocated to be size n+1 (0, and the indexes 1 through n to store n values).

    • We can simplify array allocation (using the familiar 0 through n-1 allocation to store n values) if we make the mapping of indexes more complex: for parent i, the left child is 2*i+1 and the right is 2*i+2; for child i (either left or right) the parent is (i-1)/2. Become familiar with this mapping by looking at an example heap with 10-15 values. You may use either this mapping or the one described above. Given the ensureCapacity and trimToSize methods work on arrays with indexes 0 to N-1, I recommend using the mapping explained in this paragraph.

    • Choose which way you will represent the array and its mapping (see above). Not carefully allocating/using the array can be a big source of bugs so be careful. I suggest putting these mappings into four simple private static methods (inHeap, leftChild, rightChild, and parent, the first determining whether or not an index is in the heap being stored and the final three each taking an int index and returning another int index) to simplify your code (by hiding the mapping details), no matter which approach you take, at a slight decrease in speed. Or you can put the raw calculations directly into your code. One interesting approach is to use the methods first, to make debugging your code easier, and then when he code works, remove the methods and "hardwire" in the calculation. If you don't find much speed difference, why not go back to using the methdods, because that makes the code clearer.

    • Finally, you should also write private helper methods for the following three operations, and call them wherever necessary.
      • swap: called with two indexes, it swaps the values stored in those indexes in the array representing the heap.
      • percolateUp: called with one index, it percolates the value at that index as high as it should go.
      • percolateDown: called with one index, percolates that value at that index as low as it should go.

      The "percolate" methods do most of the interesting work in maintaining the heaps. This code is not trivial to write, but it can be written simply: I wrote percolateUp in 2 lines and percolateDown in 10 lines, although some of these lines are dense). I suggest you constantly think about simplifying your code for these methods as an aid for writing them correctly.

      Of course, to do so you must really understand what is going on during the percolation phases (hint: hand simulation). Recall that for the second mapping, once a value reaches index 0, there is no parent index to process; for a value at index i, it has a left child if 2*i+1 < objectCount and it has a right child if 2*i+2 < objectCount. In the alternative mapping the bounds are a bit different (based on both a different mapping and the fact that values are stored in indexes 1 through N instead of 0 through N-1).

      Don't attempt to process non-existant children! This is especially true in percolate down where to find the biggest child for a possible swap, we may have to check just one or both children.

  2. Constructors:
    • The first two constructors create arrays of the appropriate lengths; remember to do this based on which mapping you are using.

    • The array version should do the linear (offline) operation to build a heap discussed in class, rather than adding values to the heap one at a time.

    • The iterable version should put all the values in a simple collection, convert it to an array, and use the array constructor to actually build the heap.

  3. toString
    • The toString method should return the heap values as an array. Here is an example of what my method returns (I'm using the second mapping) when using the DriverForOrderedCollection:

         toString = HeapPriorityQueue[3/4:[0]=a,[1]=c,[2]=b]

      Here is an example of what my method would return if it used the first mapping, when using the DriverForOrderedCollection:

         toString = HeapPriorityQueue[3/5:[0]=not used,[1]=a,[2]=c,[3]=b]

      Notice that this array is of length 5, because in my code the index 0 is not used (so there is room for 4 values in the heap: the heap lengths is go from 2, 3, 5, 9, 17, etc. always 1 bigger than a power of 2).

  4. Iterator: hasNext and Next
    • Recall that the iterator should iterate through the values in priority order (highest down to lowest). But unlike the array and linked implementations, the heap implementation NEVER stores all the values IN ORDER in the array, so we cannot just "traverse" the array, with each call to the next method taking O(1) time Here are two ways that I can think of to implement an iterator.

      1. When constructing the iterator, actually allocate another array in the iterator, which contains all the values in the heap-array; then sort it. With this approach, creating an iterator, even if it is not used, requires O(N Log N) time (for sorting). Given this sorted array, you can traverse it in the iterator, as one would traverse any array, with each call to next requiring O(1) time.

      2. When constructing the iterator, actually store a SHALLOW COPY of the HeapPriorityQueue in the iterator. With this approach, creating an iterator, even if it is not used, requires O(N) time. Given this shallow copy, each call to next can call remove on the copy of the heap to get the next value, requiring O(Log N) time. shallowCopy must work correctly for this method to work; since it is copying an array, it is easy to write.

    • In both these approaches, the array storing the "real" priority queue is different from the array storing the one being iterated through. Constructing the iterator and iterating through all values is O(N Log N) in each case. Be careful: don't try to use an iterator when constructing the iterator! If you see "infinite recursion" or run out of storage space, this might be your problem.

  5. Iterator: remove
    • As you might expect, this is going to be complicated to describe (but it turns out, not very hard to implement). Recall that we know how to remove a value from the top of the heap: put the value at the end of the array at the top, and swap it down to its correct position. We can remove a value from the "middle" of a heap almost as simply: put the value at the end of the array in the "middle" (overwriting the value there), and swap it up or down to its correct position (depending on how it compares to its parent and children -if any). You might think that the value swapped to the middle will always be less than its parent, but that is not so. Construct an example of a heap so that the last value is greater than the parent of the value being removed: my example was a heap with 6 nodes (and I think it is minmal).

      So, if we keep track of which value we have "seen", and are asked in the iterator to remove it, we can do a linear search the REAL heap array (not the one the iterator is using, it is gone from there if returned by next), find that value (use ==; the total operation might take O(N)), and then use this algorithm to remove it from the REAL heap.

    • In fact, you should discover that remove from the heap itself and remove from the iterator share code that can be made into a parameterized private method, usable in both places (for precolate up/down to the correct position: see percolateUp and percolateDown described in the first section of this document).

    • Implement next and hasNext first, and remove at very end. Feel free to ask questions to help you understand -which is just one of many things questions are for.
If you have problems getting the iterator correct (or just before you implement it), test the other methods with DriverForOrderedCollection. Recall the JUnit test does most of its testing using the toArray method, which will not work unless the iterator works.

Finally, try running the supermarket simulation program with the ArrayPriorityQueue and your working HeapPriorityQueue. Choose 2 checkout lines: one with the default number of times (effectively infinite) and one with 6 or fewer items. Run the simulation on 5 shoppers with tracing set to true to help you understand what is happening; then run it with 50,000 or 100,000 shoppers. With 50,000 shoppers, my speedup using the HeapPriorityQueue over the ArrayPriorityQueue was from 24 seconds to 1.27 seconds.


Problem #2: Map via BST In this program you will write a more efficient version of a map, by implementing it as a binary search tree. We have seen that for all but pathological trees (and randomly built trees are much closer to being well-balanced than pathological), most operations operations have a complexity of O(Log N); we can say O(tree-height) to play it safe.

You can test this program with TestMap, a JUnit test (how we will ultimately grade it), and with an updated version of the WordGenerator program (included in the download) from Program #1 part 6, on the file huck.txt (also included in the download). With the DriverForMap program, you can individually test any methods in your classes interactively, and see their results (returned values and state changes, mostly using toString). I will supply a toString method that uses recursion to return a String that prints as a tree rotated 90 degress to the left, so you can more easily see the structure of the tree you are building.

Before running the JUnit test, you must get the iterator's hasNext and next methods working correctly. Otherwise many tests will fail, because many tests use the iterator indirectly: e.g., they construct sets of values by iterating through maps; if the iterator works incorrectly, these sets will not be constructed correctly, and the JUnit test will fail for this reason, not because the operations really being tested are incorrect.

The BSTMap class must implement the Map interface using standard map semantics. You should look at my ArrayMap class, which contains code that implements the same Map interface, some of which (or code like it) will belong in the BST implementation. Here, the underlying data structure is a binary search tree, whose node class is complete within the implementation, which also includes some helper methods like locate, add, and remove; these are either completley written or just their header is specified.

Here are some issues to think about when writing your class.

  1. Multiple Tree "Searches"
    • Because recursive helper methods that mutate the BST, like insert, must return a BST node, I found it convenient to first call the locate helper method, and based on the result, call (or not call) one of the recursive mutator method: sometimes I can manipulate the located reference directly. For example, in my put method, my code first tries to locate the key in the tree: if successful, my code changes the value associated with that key (not changing the structure of the tree); if it fails, my code inserts a new entry into the tree, with the appropriate key and value.

    • At worst, these two searches slow down some methods (i.e., put, remove in the map, remove in the iterator) by a factor of two. There is a way to squeeze a bit more speed out of the code, by avoiding duplicate searches. But, to do so required carefully changing the recursive methods and using another instance variable. I suggest that you code using the "double search" approach, until you have everything working; then, if you have the inclination and some more time, if you explore writing the more complicated code; wait until then to even start thinking about this improvement. The number of changes are small, but they are high in technical complexity.

  2. Map.Entry
    • Recall that Map interface defines the Entry interface, which we refer to as Map.Entry. When we build our tree, we need to use a class that implements Map.Entry at each node. Note that the AbstractMap class (which all classes that implement Map use as a superclass), defines a concrete class called SimpleEntry which implements Map.Entry. So, our code can construct and use SimpleEntry objects (see its constructor).

  3. toString
    • I have written a toString method that returns a 2-d "picture" of the BST (rotated left by 90 degrees) It is usefull for quickly quickly looking at a small BST. Here is an example of what my method returns when using the DriverForMap, showing 5 keys and their values.

      toString = BSTMap[5:
          x->5
        m->4
            c->3
          b->2
            a->1
        ]
      Here m is the root, with 3 nodes in its left subtree (whose root key is b) and 1 node in it right subtree (whose root key is x).

  4. Iterator: hasNext and Next
    • As in the heap priority queue, the iterator for this class will iterate through another data structure: in this case a Stack: it's maximum size will be the height of the tree. We discussed this kind of iterator at the end of the notes covering Binary Search Trees and in class. I'll summarize the algorithm below; see the notes for details.

      To construct the iterator, allocate a Stack (use an ArrayStack data structure) and add to it the root (if the BST isn't empty) and all the left descendants of the root.

      hasNext returns whether the Stack is empty. next removes and returns the reference at the top of the Stack and adds to the Stack the root of its right subtree (if it isn't empty) and all the left descendants of the root of its right subtree (see the code you wrote in the constructor). It also remembers this reference, in case the iterator's remove is called.

    • Iterator: remove
      • Use the removeAt method to delete from the BST the previous value returned by next. This operation does not affect the Stack.

You should also run the updated WordGenerator program, with the large text file (huck.txt) The update allows you to "comment" out code to select either the ArrayMap or BSTMap (and its appropriate comparator): it is that simple to change which data structure implements the data type. It also tracks and prints the amount of time needed to build the map: on my machine it took 18 minutes using ArrayMap and 1 second using BSTMap (a speed-up of a factor of over 1,000), when I used an order statistic of 3. Both ultimately producing a map that contains 90,705 keys with each key mapping to a list of between 1 and 46 words.