ICS H23, Winter 2005

Lab #5

Due: Wednesday February 16, 2005, noon

Instructions

The Task:

Imagine you've been hired as a data jockey by a national chain of retail stores, due to open shortly. Marketing has (what they think) is a brilliant approach to notifying people about these new stores. In addition to standard print advertising, telemarketers are going to telephone all homes within a 10 mile radius of each new store to personally invite consumers to come on in during the grand openings.

The marketing research branch has prepared a list of the ZIP codes which correspond to these invitation areas; it has obtained a disk file with an ASCII file which contains, for every area code and prefix, a corresponding ZIP code. (Since one prefix might be found in several ZIP code areas, there may be several entries in the file which have the same area code and prefix, but have different ZIPs.) They call this the master file. Marketing has also procured a list (on paper) of telephone customers' names, addresses and phone numbers, in phone number order.

Marketing wants a look-up program that, for a given ZIP code, will show the area codes and prefixes that correspond to it; this program will be used by the telemarketing branch. A person there will have the program running; as a telemarketer requests the phone information for a ZIP, this coordinator will use your program to produce the information to the requesting telemarketer. This is not a one-time program. As each ZIP code area is exhausted, another set of numbers will be needed; as new stores open, the program will be needed again. And marketing, always time-pressured, wants a request for a ZIP's area codes and prefixes as quickly as possible.

Marketing also informs you that the firm from which the master file was obtained sends an update file containing from about 25 to 500 additional area code/prefix/ZIP items once every three months. These new codes must be placed into the master file. (Since ZIP codes are rarely removed from service, your program need not handle deletions.)

Your Job

Your job is to write a program that will perform three chores: 1) read in the original data file and stores it in a B-tree 2) display the area codes and prefixes that correspond to a typed-in ZIP code and 3) add items to the master ZIP file. Here's the general flow:

When the applet (or JFrame Program) is run, two text fields appear on the applet window. One has a label which prompts the user for the name of an update file. The other prompts the user for a zip code. When the return button is hit in either text field, the ActionPerformed method of your applet is called. Your applet should determine which textfield was used and perform the corresponding action. You can use your program to build the original data structure by typing in the name of the data file which is provided for you. Your program should then read the data and insert each data item into the tree.

Technical Issues

  1. Remember that there are often several area code/prefix pairs with the same ZIP code, and it's also possible that the same area code/prefix pairs may appear in different ZIP codes.
  2. The update files as well as the data file is a sequential text file, with each line of the file having the form:
    < area code> < one or more spaces> < prefix> < one or more spaces> < ZIP code>
    
    < area code > and < prefix > are three digits in length; < ZIP code > is five digits. Remember to make up several versions of this update file so you can thoroughly test your program. (Test for happenings such as these: the update file is empty, has a few ZIP codes in it, has (at least) tens of ZIP codes, contains only ZIP codes that already exist in the master file, and so on.)
  3. We would like the B-tree to be persistent beyond the lifetime of the execution of your program. For this reason, the BTree data structure will be stored in a random access file. For this you will need to use a class RandomAccessFile from the Java API. At the beginning of your execution, your program should open the RandomAccessFile, create an instance of BTree and hand BTree a pointer to the RandomAcessFile. At the end of execution, the file should be closed.

    You can think of a RandomAccessFile as a long sequence of bytes. Our goal in implementing a BTree is to minimize the number of times we need to write to or read from this file (since we are assuming it is so large that it is stored on disk). You should have a class BTreeNode . The data is stored in the file in binary and is written as a sequence of BTreeNodes. Class BTreeNode should include the following two methods:

  4. The root of the BTree will always be the first BTreeNode of the file. When you need to add a new node, it should be appended to the end of the file. Instead of maintaining traditional Java references to the children, you should maintain the index in the file for the location of the BTreeNode corresponding to a particular child. Note that we are assuming here that each BTreeNode takes up exactly the same amount of space in the file. The amound of data within a BTreeNode can vary a bit, but you should make sure that they are written in a uniform manner so that they all have the same length.
  5. You should make your BTree as generic as possible. (And you should implement it as a generic class). However, you will need to assume a few things about what is being stored. Your BTree will store key, value pairs. The zip codes are the keys and the values consist of a prefix and an area code. You will need to assume of course that class Key implements Comparable. It will also be convenient to assume that value does as well. When inserting a key, value pair into the BTree, it will be easier if all the entries with the same key are ordered according to their value. In addition, you will need to have some means of extracting the data from the keys and values so that they can be written to the file and writing to the date using bytes from the file. You will also need to know how many bytes will be necessary. For this reason, your Key and Value classes should implement the following interface:
    public interface ConvertibleToFromBytes
    {
    	// this method returns the number of bytes required to
    	// record all the data in the class
    	public int getSize();
     
    	// this returns an array of bytes with all the data
    	// in the class written in  the array
    	public byte[] getBytes();
    
    	// this method takes an array of bytes and uses the
    	// values to fill in the values of the member data
            public void writeBytes( byte[] )
    }
    
    

  6. The first time your program is run, it will need to decide on the branching factor of the tree. The "order", to use the term from your book. This should be passed in to BTree when it is created and to every instance of BTreeNode when it is created. A reasonable number to use for the data you will be given is 50. The number should be written at the beginning of the random access file in a short (2 bytes). On future executions of your program, your BTree will read the branching factor from the random access file. A reasonable way for your BTree to determine whether this is the first execution of your program is to test whether the random access file has any data.

  7. It would be very expensive to maintain parent pointers since that would mean updating roughly 25 pointers every time a node is split. However, you need a means of travelling back up the tree when inserting a new item which causes nodes to split. I recommend using a recursive search so that when you return from the recursive call, you have the parent node accessible. Alternatively, you could explicitly store the path from the root to the current node in a stack as you search for the correct location for the new item.

  8. When searching through a tree node to find a key, you should use binary search.
  9. Here is the text file for constructing the original BTree. Of course, you should construct smaller test cases to test your code before you try this one:



    Written 2/3/05. Adapted from a lab due to Norm Jacobson.