Searching through data is a frequent necessity in data processing; finding what you want quickly is typically the goal. Yet, even with a fast machine, the algorithm employed is the major determiner of data retrieval speed. This project will give you practice in using two searching techniques, ProxmapSearch and ChainSearch, and helps give you a feel for the practical meaning of O-notation and “improvement by a constant factor.” You’ll practice code optimization, a commonly used (and abused) programming activity. This lab also demonstrates how programs can be used as test vehicles to answer design questions for larger programs; it also provides practice in preparing reports based upon program-generated data.
Imagine you’ve been hired as a “data jockey” by a national chain of retail stores, due to open shortly. Your boss briefs you on your first assignment:
Marketing has (what it thinks) is a brilliant approach to notifying people about these new stores: In addition to standard print and Web advertising, telemarketers are going to telephone and text homes within a ten-mile radius of each new store to personally invite consumers to “come on in” during the grand openings.
The marketing research branch has prepared a list of the 5-digit ZIP codes which correspond to these invitation areas; it has obtained a file that lists, for every phone group, a corresponding ZIP code. A phone group consists of a three-digit area code followed by a three-digit phone prefix (the first three digits of the phone number). Since one phone group might be found in several ZIP code areas, there may be several entries in the file that have the same phone group, but have different ZIPs. And clearly there could be several different phone groups in the same ZIP code.
Marketing calls this list of ZIP codes with their phone groups the master file. Details of its structure are below.
Marketing has also procured a list of telephone customers' names and phone numbers, in phone number order; this is the phone list.
Marketing wants a look-up program that will print on the screen, in an easy-to-read format, the phone groups that are in a chosen ZIP code. A telemarketing coordinator will use this program to assign a ZIP's phone groups to telemarketers, who will then use the phone list to make calls and texts to potential customers (who are not on the Federal Do Not Call List). Always time-pressured, marketing wants a request for a ZIP codes#146;s phone groups filled as quickly as possible.
This is not a one-time program. As calls and texts to each ZIP code’s list are completed, another set of numbers will be needed; as new stores open, marketing will need additional ZIP codes’ phone groups.
Requirements and constraints known at this time:
Prior study has narrowed down the choices for the searching algorithm to ProxmapSearch or ChainSearch—so those are the ones you will investigate.
ProxmapSearch is O(1) in the best and average cases, which occurs when the map key is chosen so that subarrays are of roughly equal size and short compared to the total number of items to be searched. But ProxmapSearch degrades to O(n) in the worst case, which occurs when most of the keys end up in one subarray—and that happens when the key is poorly chosen. The same holds for inserting an item in a subarray (which is just a search to locate where to insert the item, followed by a constant time opertion to add it into the subarray). A description of the ProxmapSort and ProxmapSearch algorithms has been provided here.
ChainSearch is also O(1), but only if the map key is chosen so the chains are roughly of the same length and reasonably short. Adding a new item to a chain is also typically O(1), but again only if the chains are reasonably short. ChainSearch can degrade to O(n) performance, both when searching and inserting items, if the map key results in one or a few very long chains instead of a bunch of short ones. So, here too, choosing a good map key is crucial. A diagram of the Chain structure has been provided here.
Since both searches have the same general performance, which one is faster will highly depend upon the implementation used and the properties of Java and the operating system; for instance, whether finding something via cell reference in an ArrayList is faster or slower than finding something by following a reference (pointer).
The program will be started up at the start of the day, at which time the master file’s information is placed into the proxmap or chain list; searches are then made using this memory structure. The program remains running all day, and is shut down at the end of the workday. Of course, it is possible, because of a system crash or other malfunction, that the program may need to be shut down or restarted during the day.
The master file will be a sequential text file; each physical line is
3-digit-area-code space(s) 3-digit-prefix space(s) up-to-5-digit-ZIP-code end-of-line-mark
Area codes and prefixes never start with 0, so they are always three digits. ZIP codes can start with leading zeroes. In the master file, leading zeroes are not included in the ZIP code, because some languages, like Java, take a leading zero to mean the integer is octal (base 8), rather than base 10...so reading in a ZIP of, say, 00010, will be interpreted as an 8! Lack of leading zeroes in ZIPs is not a problem; since we know ZIPs are really five digits, the right number of leading zeroes (if any are needed) can be included when printing them.
You can assume the master file always has the correct format and that there are no duplicate records—the file will have been “cleaned” before it is provided for use by this program.
The master file will always have at least a few hundred lines in it, and may have up to several hundreds of thousands of lines. The lines of the file are in no particular order.
Your boss tells you that your task, perhaps surprisingly, is not to write this program; rather, it is to investigate and report on which data structure ought to be used to make the search for a ZIP code, and returning a list of the phone groups associated with it, as fast as feasible. The program requirements team will use your results in determining the requirements for the search algortihm they give to the design and programming teams. The boss also admonishes you that time (as always) is of the essence, so you are to focus your efforts only on comparing ProxmapSearch and ChainSearch; in particular, you are not to go off looking at other search approaches.
You are to report your results in this company-standard format:
Requirements given in this section are stated explicitly (by such phrases as “We require you to...”). The hints and suggestions here are just that; they are not requirements. As long as what you do meets the requirements, you may do and name things differently, even use a completely different approach.
We require you to code up classes for the proxmap and chain list structures. We have provided skeletons for Proxmap and ChainList classes to get you started, in the SearchForABetterWay project zip file; you may change or remove any of the private items, and add in other private items, as you see fit, but be sure to use good style. Don’t change any of the public signatures. In particular, avoid using Java features that aren’t really needed to get the job done cleanly and quickly, and may unnecessarily add to running time or memory usage. (For instance, we did not use the Java LinkedList class for the chains of the chain list; LinkedLists are doubly-linked and we only need singly-linked lists. Using our own singly-linked list saves the space used by the “previous” pointers of a LinkedList, thereby dropping the memory requirements of a chain by about one-third, a significant savings.)
Also in the project file is a sample master file TestZIP.txt, of about 97,000 lines. We encourage you to use it, portions of it, and larger files based on it, as appropriate, when testing and timing your routines. (Of course, use small files for your first tests; when things look like they are working, move on to larger test files.)
With the exception of ArrayList, you are not allowed to use the predefined Java “collection” classes, such as java.util.TreeMap, in your solution. (The collection classes are the ones that store a collection of data, and include such classes as LinkedList, HashMap, Vector, Hashtable, and TreeMap.) Among other things, these classes are overblown for our needs, and so unnecesarily increase memory usage or the time to complete the searches.
You need methods to add the data from the master file to the proxmap or chain list and to search those structures. Write four methods to accomplish this task; call them Proxmap.buildList() and Proxmap.lookup() (to build and search the sorted array) and ChainList.buildList() and ChainList.lookup() (to build and search the chain list). Signatures for these methods are provided in the skeleton classes we’ve provided.
To test ProxmapSearch, write a method called testProxmapSearch that takes as parameters the name of the master file and the name of a text file that has a list of ZIP codes for which to search. testProxmapSearch builds the proxmap using the master file data and reports how long that took. Then testProxmapSearch records the time it takes to search for the passed-in ZIP codes (including the time it takes to return its list of phone groups or throw an exception if the ZIP code is not found) divided by the number of searches done; it reports this average search time and exits.
When testing chain searching, you’ll need to try out several different map key functions. The easiest way to do this is to extend from the provided abstract class AbstractMapKey a concrete class that represents a map key; do this for each map key you want to try. Then, in your test program, you construct the map key(s) you want to use and pass them in turn to the Proxmap constructor. Details on this approach are given in comments in the AbstractMapKey class.
To test ChainSearch, do what you did for proxmap search: write a method called testChainSearch that takes as parameters the name of the master file and the name of a text file that has a list of ZIP codes for which to search. (Make sure this file has the same format as the one used with testProxmapSearch, so the comparison is fair). Then build the chain list using the master file data and report how long that took. Then, as with testProxmapSearch, testChainSearch reports the average time it takes to search for a ZIP code in the list of passed-in ZIPs and exits.
As with proxmap search, you’ll need to try out several different map key functions to see which produces the best results. Note that the map key which produces the best proxmap search may be different than the one that produces the best chain search.
We require that you write a driver program that causes your testing methods to be called with various sizes of master files, with master files containing various distributions of keys, with several map key methods and with various sizes of, and distributions of, ZIP codes in the file that lists the ZIPs for which to search. By having the driver call your test routines repeatedly with well-chosen combinations of these choices, you can have it produce a file containing all of your results—and if you format the file well, its data can be pasted directly into a spreadsheet for further analysis. This approach will save you significant time over entering choices via the keyboard, one set at a time, and then copying the results from the screen by hand into a spreadsheet or your report.
By choosing your test cases so they illuminate distinct behaviors, rather than just repeatedly looking at the same sort of behavior, you can save yourself numerous tests and lots of work. For example, the number of items in the proxmap or chain list, which is determined by the number of lines in the master file, could easily affect the performance of the search. One good set of sizes for the master file is {100, 500, 1000, 10,000, 100,000, 250,000} records. These sizes more than cover the sizes your boss said were possible—they allow a safety margin: if the specs change somewhat, our code should still work. We require that you test your code with master files of the aforelisted sizes. Include whatever additional sizes you think are needed to gauge the relative performance of ProxmapSearch and ChainSearch. If your program runs out of memory for lists beyond a certain size, be sure to report that finding! But we must say that a good implementation of these algortihms should not run out of memory for sizes up to at least 250,000.
As for the searches that will be done during the business day, three scenarios come to mind (you might think of others). The first is that a telemarketer will focus on one store opening, then another and then another, and so one. That would imply that the ZIP codes used in the search would be in groups that are numerically near each other, as ZIP codes are roughly assigned in ascending order as one moves from east to west across the U.S. So run tests where a group of the ZIP codes are numerically close to each other, then followed by a different group are close to each other (but not close to the first group) and then another such group, and so on.
Another scenario is where an unplanned search for one ZIP code is needed, then nothing happens for a while, then another ZIP code is checked, and so on. This would lead to a test file where there are a number of ZIP codes where each one may or may not be numerically close to the one preceding it. Run a test using such a group of codes.
The third situation is a mixture of the first two: the coordinator looks up a bunch of ZIPs for a store, then one ZIP for a query that has come up, then perhaps another ZIP code, then perhaps another store’s group, and so on. Run at least one test with a file of ZIP codes that represents this situation.
For the map key, try at least these two approaches: (1) mapkey(key) = key (that is, use the ZIP code) and (2) mapkey(k) = int(key/100) (that is, the first three digits of the zip code). Again, try others you think appropriate to produce as fast an approach as feasible.
Run the test program with good test data; analyze the results; write the required report. The report should be as clear and well written as you can make it. We encourage you to use the features of Microsoft Office or other productivity tools to help you with your presentation. You must include tables of times used by various approaches, along with other appropriate data and discussion, to support your conclusions; we also require that you present your results graphically— it’s much easier to discern patterns in the data when looking at graphs than at tables. In the appendix, include the names of the files in which your programs and results can be found; of course, turn those files in with your report, as part of your project.
Finally and importantly: Don’t get so wrapped up with the experimentation that you leave no time to write your report!
Java has a few different library classes that can help you time how long it takes to do a particular search. As we suggestedin Assignment 2, using the System.currentTimeMillis( ) provides an easy way. When the task begins, do this:
long startTime = System.currentTimeMillis();
...and when it ends, do this:
long endTime = System.currentTimeMillis();
Subtract the start time from the end time to get the elapsed time.
Remember the following:
Before you do any timings, be very sure your build routines work properly and your search routines actually search correctly! Timings of incorrect algorithms are useless at best, and potentially very damaging—imagine what could occur if your company put into production the worse of the two searches, or a version of a search that didn’t work at all, because of the errors you made in your code or your analysis. In particular, first try out your routines on a very small master file and a small list of searches; the programs run quickly and the structures are small and easy to check compared to a test run with, say, a hundred-thousand-record master file and 100 searches. Once your routines work on these small files, then you can test them using larger ones.
Only count the actual time it takes to build the list and the time to search it—don’t include setup time or the time to write out the timing results. An easy way to remove overhead is to time the actual routine, then compute the time again, but call a fake routine that passes in the same parameters but then immediately returns, without doing any actual work. Subtract the time this routine took from the time for your real routine, and you are left with just the actual build or search time.
For a small number of searches, the time needed to complete a search could be much less than one millisecond! Further, your program shares resources with other programs, and so is not always immediately given access to the system clock when it requests the current time. As a result, the elapsed time can be somewhat greater than the actual time taken: it might take a few milliseconds from when your search ends until your program gets access to the system clock and records the time. For long algorithms, this “choppy” time measurement doesn’t matter, at most adding a few milliseconds to a large time. But for fast algorithms, it can look as if a task is taking much longer than it really is; a millisecond is a lot of time for a modern computer. It can also happen that your algorithm completes so fast that the clock registers that 0 millisecs elapsed, an obviously incorrect result. This happens when the clock is immediately available when your algorithm starts and stops, and the algorithm takes much less than 1 millisecond to complete.
So, to time fast algorithms more accurately, run the algorithm many times (by placing it inside a loop) and then divide the time it took by the number of times the algorithm was executed. The loop has to be repeated many, sometimes very many, times to get accurate measurements— the shorter the time, the more measurements required.
So, to compute an accurate time for a fast search, the code would look something like this:
double searchTime = computeSearchTime(); double overheadTime = computeOverheadTime(); double actualSearchTime = searchTime - overheadTime; // computeSearchTime() long startTime = System.currentTimeMillis(); for (int i = 1; i <= NUMBER_OF_REPETITIONS; ++i) doThisSearch(parameters); long endTime = System.currentTimeMillis(); long elapsedTime = endTime - startTime; return (double) elapsedTime / NUMBER_OF_REPETITIONS; // computeOverheadTime() long startTime = System.currentTimeMillis(); for (int i = 1; i <= NUMBER_OF_REPETITIONS; ++i) doFakeSearch(parameters); long endTime = System.currentTimeMillis(); long elapsedTime = endTime - startTime; return (double) elapsedTime / NUMBER_OF_REPETITIONS;
If your timings do not agree with theory, there is something wrong with your algorithm or your timings—not with the theory! For instance, if you graph the times of ChainSearch for the various sizes of the master file and see an n2 curve, you know something is very, very wrong.
Marketing informs you that the firm from which the master file was obtained sends an update file containing from about 25 to 500 new phone groups once every three months. It has the same format as the master file. This new information must be placed into the master file. Enhance your program so it has a function to read in the master file to the proxmap or chain list, adds in the new items from the update file, and then writes out a new master file. Report on whether the adds are done more quickly when using a proxmap or a chain list, and whether this difference changes your recommendation as to which structure should be used. Remember this update is done only every three months.
Similarly, marketing also gets an update file every three months that contains zip codes that have been retired; that is, they no longer exist. It is a text file, which each retired zip code on its own line. Enhance your program to process these deletions (by removing them from the proxmap or chain list and then writing out a new master file), and again report on whether it’s faster to accomplish on a chain list or a proxmap. Report whether this results changes your recommendation about which structure to use.
Also, marketing gets another update file every three months that contains area code changes. It is a text file with each change on one line; first appears the old area code, then a blank, then the new area code. Enhance your program to update these area codes (by updating the entries in the proxmap or chain list and writing out a new master file); as above, report on the relative speed of the operation on the two kinds of list and whether this changes your overall recommendation on which structure to use.
Turn in your project file, as a ZIP file, to Checkmate. Be sure it includes the following:
A PDF file, or a Microsoft Word ".doc" document, containing the report of your findings and conclusions, in the format and containing the information described above. If you submit the report as a Word file, be sure you submit the report as a ".doc" file, and not ".docx"—the version of Word we can guarantee our graders can access cannot read ".docx" files. (If you are using a newer version of Word, you will need to use the Save As... command to save the docment as a ".doc", rather than ".docx", file.) Put your report in the main project folder, not a subfolder.
The Java programs that you used to create the results presented in your report, and their output, in the src folder inside the project.
Your grade for this assignment is based on your programs’ correctness and style and especially on the report’s quality—its conclusions, how well its recommendations are supported and the quality of its composition and presentation. The higher weight is on the report because the programs are just an investigative tool; what matters most is the report’s conclusions and how well those conclusions are supported. So we allocate 1 point for your ProxmapSearch working and having run the required tests of its performance, 1 point for the ChainSearch working and having run the required tests of its performance, and 2 points for the report itself. (As always, you can earn up to 1 extra point for well-done optional work, and for an amazing job on the assignment, even if optional work is not undertaken.)