Hashing Introduction: Hash tables are a data structure for storing and retrieving unordered information, whoseand its primary operations are in complexity class O(1) - independent of the amount of information stored in the hash table. We saw that digital trees had this same property, but only for special keys (that were digital: meaning we could decompose them into a first part of the key, a second part of the key, etc. as we can with digits in a number and characters in a String). Hash tables work with any kind of key. The most commonly used implementations of the Set and Map collection classes in Java (which are unordered) are implemented by hash tables. We will also implement a Map via a HashTable in Program #4. Here are some terms that we need to become familiar with to understand (and talk about) hash tables: hash codes, compression function, bins/buckets, overflow-chaining, probing, load factor, and open-addressing. We will discuss each below. Our Approach: We will start by discussing linear searching (using a linked list) of a collection of names. If we instead used an array of 26 indexes and put in index 0 a linked list of all names starting with "a", and in index 1 a linked list of all names starting with "b", ... and in index 25 a linked list of all names starting with "z", we could search for a name about 26 times faster by looking just in the right index for any name (according to its first letter). This speed increase assumes each letter is equally likely to start a last name, which is not a realistic assumption. In fact, if we used an array of 26x26 (676) indexes, storing in index 0 a linked list of all names starting with "aa", and in index 1 a linked list of all names starting with "ab", ... and in index 676 a linked list of all names starting with "zz", we could search for a name 676 times faster by looking just in the right index for any name (according to its first two letters). This speed increase assumes each letter pair is equally likely to start a last name, which is not a realistic assumption. Of course, this speedup isn't achieved unless we have at least 676 names, and each box is equally likely to have a name (which isn't true: few names start with combinations like "bb", etc). And what about looking-up information that isn't Strings: for example the WordGenerator uses a List of Strings as the key for its map. So while this approach seems promising, we need to modify it to be truly useful. Hash Codes: Hashing is that modification. We declare an array with any number of "bins or buckets" and use a "hash code" to compute an int value for any piece of data that can go into the hash table. It must always compute the same hash code for the same value, so it cannot use random numbers. We should design such a hash code to generate the widest variety of numbers (over the range of all integers), with as small a probability as possible of two different values hashing to the same number. Of course, in the case of using Strings as values, there are more Strings than int values. There are only about 4 billion different ints -actually, exactly 4,294,967,296- but an infinite number of Strings, which can be of any length, meaning any number of characters: even if we consider only Strings with lower-case letters, there are 26^N different Strings with N chacters; 26^7 is 8,031,810,176, so there are already more 7-letter Strings than ints. Once we have a hash code function, we use a "compression function" to convert the hash code to a legal index in our hash table. One simple compression function computes the absolute value of the hash code (hash codes should cover both negative and positive values but array indexes are always non-negative) and then computes the remainer (using the % operator) using the hash table length as the 2nd operand, producing a number between 0 and length-1 of the hash table. Other compression functions use-bitwise operations to compute a bit pattern in the correct range. The hashCode method must be important: it is one of the few methods declared in the Object class, so every class can override it (it is as fundamental as toString and equals, which are also declared in Object). Here is a slightly simplified hashCode for the actual String class in Java (we will see the exact code later). public int hashCode() { int hash = 0; for (int i = 0; i < chars.length; i++) { hash = 31*hash + chars[i]; //promotion of char -> int: its ASCII value return hash; } "a".hashCode() returns 97 ('a' has an ASCII value of 97; you can actually call .hashCode on any String literal, which is really replaced by a String object storing that value) and "aa".hashCode() returns 3104 (31*97 + 97). Generally, if String.length() is n (the chars array contains n values), then its hashed value is given by the formula chars[0]*31^(n-1) + chars[1]*31^(n-2) + ... + chars[n-2]*31^1 + chars[n-1] So, "ICS23".hashCode() returns 69,494,394, and "Richard Pattis".hashCode() returns -125,886,044! Yes, because of arithmetic overflow and the standard properties of binary numbers, the result might be negative (and overflow of negative numbers can go positive again). Recall that Java does not throw any exceptions when arithmetic operators produce values outside of the range of int: hashing is one of the few places where this behavior produces results that are still useful. Generally the hashCode for all the numeric types is a numeric value with that bit pattern. Characters hash to their ASCII values. Every other type in Java is built from from these: for examples Strings are an ordered sequence of chars and we compute the hash code of the String by looking at every char in it. Note that "ab".hashCode() != "ba".hashCode(), which is fine because those two Strings are not equal: the order of the letters is important. In fact, a good hashCode function for any ordered data type will produce different values for different orders of the same values. Likewise, here is the Java hash code for a List (defined in the AbstractList class). It relies on computing the hash code for every value in the list (and if a null value is in the list, using 0 for its hash code). Computing the hash code for a sequence of values in a list is similar to computing the hash code for a sequence of characters in a String, because the order is important. public int hashCode() { int hashCode = 0; for (E e : this) { hashCode = 31*hashCode + (e==null ? 0 : e.hashCode()); } return hashCode; } Note that if we store an explicit null in a List (this is legal) we cannot call the hashCode method on it: Java would throw a null pointer exception, so we use just the value 0 for its hash code. Recall in the WordGenerator program we used a key that was a list of String. So we would iterate through the list of String values, computing the hash code for each String in the list, and combining them as shown above. It is critical that the equals and hashCode methods for a class are compatible. The key property is that if a.equals(b) then a.hashCode() == b.hashCode(). Of course the opposite is not true, because many different Strings have equal hashCodes, because there are more Strings than ints: of course it is unlikely that many different Strings actually used in some problem will have the same hash code (if there are only millions, not billions, of them). This compatibility requirement is very important for unordered collections like Sets. Typically we iterate through the values of a Set to compute the hashCode. But, values in the Set can be stored in (and iterated through in) any order. Regardless of the order these values are processed, they must compute the same hash code each time hashCode is called (because set implementations that happen to store their values in different order are still .equals). So, hash codes are different, but only slightly, for unordered collections (like a Set). Since no matter what order the values are stored, the hash codes should be the same, we cannot use the hash code method above, but instead must use something that accumulates the hash code values of its elements without regard to their order. Here we just add together (without the weighting of 31*) all the values. public int hashCode() { int h = 0; for (E e : this) { if (e != null) h += e.hashCode(); //or we could use h *= e.hashCode() } return h; } Thus, if we add or multiply together all the hash codes, it doesn't make a difference what order we do the addition or multiplication: a+b+c, b+a+c, c+b+a, etc. all compute the same value (as does a*b*c, b*a*c, c*b*a, etc.) Finally, here is something that I found interesting, when I came across it when I was reading the .java String class). The real hashCode method in String looks like the following, with cachedHash being an instance variable for all objects in the String class, which is initially set to 0. The first time hashCode is called, cachedHash is 0 so it computes the hash value and stores it in cachedHash before returning it. For every other time it is called, it immediately returns cachedHash, doing no further computation. Remember that Strings are immutable, so once they are constructed their contents do not change, so once the hashCode is computed for a String object, that String object will always return the same result. public int hashCode() { if (cachedHash != 0) return cachedHash; int hash = 0; for (int i = 0; i < chars.length; i++) { hash = 31*hash + chars[i]; //promotion of char -> int return cachedHash = hash; } If a String's compute hashCode is 0, even after its hashCode is computed and cached, it will be recomputed (because with the != 0 test, Java cannot tell the difference between a hash code that has not been computed and a hash code that has been computed with value 0). Typically, the only String whose hash code is 0 is ""; most other Strings will have a non-0 hashCode. Recomputing the hash code of "" is very quick, because it stores no values (chars.length is 0, so the loop immediately exits). We could include an extra boolean instance variable named hashCached, initialized to false and set it to true after caching. So we would have public int hashCode() { if (hashCached) return cachedHash; int hash = 0; for (int i = 0; i < chars.length; i++) { hash = 31*hash + chars[i]; //promotion of char -> int hashCached = true; return cachedHash = hash; } ...but that is a bit overkill. Of course, we also could compute the hashCode of every String WHEN IT IS CONSTRUCTED, storing it in cachedHash, and never checking this value, always returning the cached value. The upside is that the hashCode function would always just returned this cached value; the downside is that we would have to compute the hash code for every String when it was created, even if we were never are going to call hashCode on it. The approach above, actually used in Java, caches a hash code only if asked to compute it at least once. Hash Tables are used frequently in practice, so there has been a lot of studies, both theoretical and empirical, of hashCode methods, which are at the heart of Hash Tables working efficiently. The best methods are quick to compute and return results scattered all over the range of int. Given such a hash code, the rest of the code to implement a Hash Table (see below) is straightforward. If you look on the Programs link on the course web site, you will see a download of HashCode, which allows you to test the String's hashCode method statistically. If you want, you can write a static hashCode function and test it compared to the one built-in to Java (for both speed and its ability to generate a wide range of values). There are two drivers there, one testing chaining and one testing open addressing: both are covered in this lecture. Hash Tables: Now let's look at how to insert a value into a hash table. Recall the basic structure is an array of bins/buckets with each referring to a linked list of values that hash/compress to that index. The basic picture looks like the following (here with 5 bins/buckets). Bin/Bucket Collisions (handled through separate chaining) +---+ | | +----+---+ +----+---+ 0 | ------>| v1 | --+--->| v2 | / | | | +--------+ +--------+ +---+ | | 1 | / | | | +---+ | | +----+---+ 2 | ------>| v3 | / | | | +--------+ +---+ | | +----+---+ +----+---+ 3 | ------>| v4 | --+--->| v5 | / | | | +--------+ +--------+ +---+ | | +----+---+ 4 | ------>| v6 | / | | | +--------+ +---+ Generally, bins can have zero, one, or many values. We say that values v1 and v2 collided in bin 0. And we have used "separate chaining" (using a linked list) to keep track of all the "collisions". With good hash/compression functions, the values stored in a hash tabled should be approximately equally distributed throughout the bins. Of course, there will typically be some bins with fewer values (maybe even non) and some bins with more because hash codes aren't perfect. Hash Table Algorithms: 2 Important Algorithms manipulating Hash Tables 1) add(for Set)/put(for Map): Use hashCode/Compression to compute a bin index (for the value/key). Search all the collisions to see if the information is there (sets have unique values/maps have unique keys). If it is there: for a Sets don't change anything; for a Map change the value associated with the key If it is not there: add the information anywhere convenient in that bin: in a list node at the front, rear, wherever: there is no ordering of information in the bins for sets and maps. 2) contains(for Set)/get or containsKey(for Map): Use hashCode/Compression to compute a bin index. Search all the collisions to see if the information is there. ------------------------------ 2 more Important Algorithms The "Load Factor" of a hash table is the ratio of values it contains divided by the number of bins. It computes the expected number of the values that will hash/compress to each bin. Generally Java classes tries to keep the load factor below 1 (more bins than values, so each bin contains zero or very few -hopefully just 1, but this is rarely achieved- values). When the add/put method is called for a Set/Map, it checks the load factor and increases the length of the hash table to ensure that it is always below the specified threshold, doubling the size of the hash table if need be. 3) Double Length of Hash table Remember the old hash table array and allocate a new one 2 times as big Traverse the old hash table (you can do it directly or use the Iterator if one is available), adding each value to the new hash table, but NOT NECESSARILY IN THE SAME BIN! Instead, add it to by applying hashing/compression again (compression will be DIFFERENT, because the length of the table is doubled, so we compute the same hash value but compute the remainder using the DIFFERENT TABLE LENGTH). By being clever, we can re-use the entire LN (list node), so we don't have to allocate any new objects; but this makes the code harder to write. 4) Iterator: Uses index and cursor instance variables. Constructor: Loop to find the first bin with a list of values (not null): Succeed: set index to the bin number, cursor to the first LN in the bin Fail : set cursor to null hasNext: return cursor != null next: if hasNext is true, return the appropriate value by using only the cursor reset cursor = cursor.next; if it becomes null, loop to advance the index to later bins, stopping at a non-null one Succeed: set index to bin number, cursor to the first LN in the bin Fail : set cursor to null Remove: Store previous cursor (in extra instance variable) to do removal. or Store no extra information but use a trailer node in every bin (which makes removal easier; we will do this in Programming Assignment #4) Note that iterators for data types implemented by hash tables return their values in a "strange" order (based on their hash values and collisions). In fact, adding one value to a hash table can cause it to exceed its load factor, and thus it will double the number of bins (doubling the length of the hash table) which causes rehashing. Now, the iterator order might be completely different. We should never assume any special ordering for these iterators, since data types might be implemented by hash table data structures. Instead, we should put the values produced by iterators into a List and sort the list and iterate through it, if we want to ensure they are processed in a specific order. Implementing Overflows: Note that we can store in each bin an array of values, a linked list of values, a binary search tree of values, etc. (possibly using different implementations of the Set data type). The reason that linked lists, and not more exotic data structures, are used, is that with a load factor <= 1 we expect to find few values in each bin, so more exotic data structures just make the coding more difficult with little gain in speed for searching a very small number of values. Why are hash tables O(1) with good hash functions: In the worst case, a hashCode will be the same for every value hashed (not likely, but possible) so no matter how big the hash table is, we would go to the same bin and search all N values there. Such a process would be O(N). But let's assume that we are using a hashCode that does a pretty good job (as most do). For such a good hash code, if the table length is M, we would expect to have to search for N/M values in each bin. Thus, for any given M the method is O(N/M) or just O(N) since it seems M is "a constant". But, we are doing something a bit more subtle. By keeping the load factor <= 1, we ensure that M >= N (say M is always at least N, and sometimes as much as 2*N, right when the load factor exceeds 1 and we double the length of the hash table). So, M is not a constant, but it grows with N. In fact, we know that M >= N. So in the "worst case" M = N. Therefore the complexity class of O(N/M) is reallyO(N/N) or O(1). In fact, for any reasonable load factor the complexity class is O(1). For a load factor of 2, the complexity is O(N / N/2) or O(2) which is the same as O(1). Certainly with this bigger load factor, we'd expect to spend twice as much time searching, but on average we'd still examine some fixed number of values in each bin. Security via Hashing as a 1-way function: Given a String, it is very easy to compute the hash code of it; but it is typically not easy, given a hash code, to determine what String(s) will hash to that value. In mathematics, such functions are called 1-way (or non-invertible) functions. We can use 1-way function to provide security: let's look at one example, supplying password security. Have you ever wondered how a computer system stores your password? If the computer stored everyone's password in a file (as a list of user-ids and their passwords), then anyone who could steal/read that file could compromise all the accounts. Here is another way to store the information: store a list of user-ids and the hash code of their password. When a user tries to log in, the system would hash the password they type in, and see if it matched the hashed entry in the password table. The hashing method would be public, but it would be a 1-way function. So, even knowing the algorithm wouldn't allow you to easily compute a password from its hash code (but would allow you to easily compute a hash code from a password). Now, if someone could read the password file, they would see only the hash code of the passwords, but not the passwords themselves. Of course, they could write a program that generated all possible Strings, hash each, and look for one that had the same hash code. Then they could log into the system with the userid and that password (which would hash to the one stored in the password file). That is why you are encouraged to have passwords that are long, and have upper AND lower case letters (and maybe even symbols in them). It increases the size of the alphabet, so makes searching over all symbols in that alphabet harder. Assume that a computer could compute 10^9 (a billion) hash codes per second. There are 1.4 x 10^17 different Strings of length 10 (52^10, using upper- and lower-case letters). If we tried to generate all these Strings, hash each, and compare it to the one we are looking for, it would take about 1.4 x 10^8 seconds, or about 4.5 years. This is another reason to change your password frequently (say, every 4.5 years!) Actual password systems are more complicated these days, and use advanced cryptographic methods. But these methods themselves are typically based on the general theory of 1-way functions: the functions are just much more interesting that computing the Java hash codes of Strings. The Complexity of Doubling Arrays: We have seen that arrays are used to store all kinds of collections: I have supplied array implementations of all the collection classes, and some advanced implementations that we will write, like HashMap, also are based on arrays. In these implementations, the add method determines whether to double the length of the array. Most adds don't double the length of the array: in all the collections we've seen before, we just put the value in the next unused index in the array (or in a hash table, chain it in a linked list in its bin). But, as the collection size grows, eventually we double the length, which means we copy all the values currently in the array to a new array. So most adds are O(1) but some adds are O(N). So, if we are talking about upper bounds, we might say at worst each add is O(N) and we do N adds, so the process of doubling and copying to get N values into an array is O(N^2). But, we can derive a better (smaller) upper bound. We will talk about "amortized complexity" to analyze this case. At worst, we allocate a collection to have 1 array cell. Adding the 1st value stores it in the array and requires no copying. Adding a 2nd value doubles the size of the array, copying the 1st value (so 1 copy). Adding a 3rd value doubles the size of the array, copying the 1st-2nd values (so 2 more copies) Adding a 4th value stores it in the array and requires no copying. Adding a 5th value doubles the size of the array, copying the 1st-4th values (so 4 more copies). Adding the 6th-8th value stores them in the array and requires no copying. Adding a 9th value doubles the size of the array, copying the 1st-8th values (so 8 more copies). Adding the 10th-16th value stores them in the array and requires no copying. Etc. Each time we double the length, we copy twice as many values as before, but we can add twice as many values before having to double again. If we end up with N values in the array, what is the total number of copies that we have to make? We will see below is is O(N) -actually bounded by 2N. When we double the array size from 1 to 2, we have copied 1 value in total. When we double the array size from 2 to 4, and copy 2 more values for a total of 1+2=3 copies. When we double the array size from 4 to 8, and copy 4 more values for a total of 1+2+4=7 copies. Notice that the sum 2^0 + 2^1 + 2^2 + ... + 2^N = 2^(N+1) - 1. We used this formula before to compute the maximum number of nodes in a binary tree of height h: 2^(h+1) - 1 (which has 1 node at depth 0, 2 nodes at depth 1, 4 nodes at depth 2, etc). Here is a table. On the left, N is the number of values in the array, and on the right is the total number of values we need to copy. N Movements/Copying -------------------------- 1 0 2 1 3- 4 3 = 1 (1->2) + 2 (2->4) 5- 8 7 = 1 (1->2) + 2 (2->4) + 4 (4->8) 9-16 15 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16) 17-32 31 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16) + 16 (16->32) 33-64 63 = .... Notice that for N a perfect power of 2, there is N-1 copies. When N is 1 bigger than a power of two, there are 2*N-3 copies. So, the number of times that we must copy a piece of data as an array grows is O(N). Rather than thinking about "add" sometime doing very little work and sometime doing lots, we can think about "add" doing a bit of extra work (a constant amount) every time that we call it: really, most of the time we don't have to do the work, but every so often we have to do alot of work, not just for the new value but for all the ones before it. This is called amortized complexity. Still, the total work done for adding N values is just O(N) -it cannot be less because N*O(1) is O(N). Another way to think of this is just how much extra work is there to double the number of values in an array. If the original length is N (say it is a power of 2), then when we add N more values, we will have to first copy each of the N values originally in the array into the new array, and then copy each of the new N values into the array. Thus, overal, adding N more values into the array requires copying N values and then adding N values without copying, for a total of 2N operations. The total complexity is still just O(N), since every add required the actual add and a total of N adds also required copying a total of N values originally in the array. Think about every add as counting for itself and one of the copying operations. Mutating a Value in A Hash Table If we are using a tree or hash table to store an object in a Set or a key to a Map, we should not mutate that object inside the Set/Map. This is because the place it is stored depends on the value/key: when we put a value into a tree, we use a comparator to determine all the subtrees it belongs in; when we put a value into a hash table, we compute its hashCode to determine which array index to put it in. If we mutate a key in a tree or hash table, it probably will not belong where it is: it would be in a different location in a tree or different bin in a hash table. That is, the Java code to locate and store a value in Set/key in a Map is based on using its state (for comparison or hashing). If we store a value in a Set/key in a Map, and then mutate it (change its state) then we may never be able to locate that value/key again. So, it is a good idea to use immutable classes (like String) for keys, or at least be extra careful not to mutate a value INSIDE a tree/key INSIDE a Map. Instead, we can remove it, change the key, and then add it back. For a hash table, both removal and adding are O(1) operations, so although awkward, changing a value in this way only requires a constant amount of work to update it in the hash table. Hashing with Open Addressing Instead of Chaining: The final big topic on hashing is collision resolution without separate chaining. By far the most useful way to handle different values that hash to the same bin is to store all these results in a linear linked list referred to by the bin. Hash table operations will do a linear search of such a list, which, by having a good hash code and low load factor, will generally not be long. There is an alternative way to deal with collisions. While it takes up no extra space (compared to chaining, which constructs new LN objects) it can cause a big increase in the time needed to search for a value in a hash table, unless the load factor is kept well below 1/2. The method is calling Probing via Open Addressing. We will discuss 3 different forms of probing: linear, quadratic, and double hashing. In linear probing, if we compute the bin for storing a value; if the table already contains a value at that bin, we increment the index by 1 (incrementing the last array index brings us back to index 0) and keep probing bins until we find an empty bin, and then put the value there. The locate method likewise continues from the original bin, linearly looking through other bins, until we find the value we are looking for, or reach an empty bin (meaning the value is not in the hash table). Note that many values can be searched, including many that have different hash values that happen to have a hash code bit bigger than the hash code of the value we are looking for. For example, let's use linear probing via open addressing for the following hash table. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Thus, if "a" hashes to bin 4, we put it there because it is empty. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Likewise, if "b" hashes to bin 5 we put it there because it is empty. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | "b" | | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ But now, if "c" hashes to bin 4, we have to probe bin 4 and 5 until we find that bin 6 is the first empty one after 4, and put "c" there. 0 1 2 3 4 5 6 7 8 9 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | | | | | "a" | "b" | "c" | | | | +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ If we are looking to see whether "d" is in the table (say it hashes to bin 4) we start probing at bin 4 for d, then check bin 5, bin 6, and finally bin 7: it is empty, so we know "d" is not in the hash table. Another problem with probing via open addressing involves removing values. If we want to remove "b", we first find it by hashing to bin 5 and finding it there. But if we actually removed it (putting a null there), if we were looking for "c" the following problem would occur: we hash "c" (see above) and get bin 4, then we look at the next bin (which is now empty, because we removed "b") and we would think that "c" is not in the hash table. But at the time "c" was added, there was a value in bin 5, which is why "c" had to go to bin 6. So, in order to know we are done when we reach an empty bin, when we remove a value, we find it, and mark its bin as "available". Bins marked "available" have previously stored values and can store new values (if we reach them), but unlike empty bins they cannot stop the probing: probing must continue until it reaches the value to locate or has passed through all occupied and "available" bins and reached a bin storing null. We can create a special Object to represent "available". Instead of linear probling (where the bin number increases by 1 every time), we can do quadratic probing, where the bin number increases by xi+yi^2 (for i=0 the first time, i=1 the second time, etc. once we specify the values for x and y: if both are 1/2, e.g., 1/2(i+i^2) we probe hash, hash+1, hash+3, hash+6, hash+10, etc). Of course with the compression function handles indexes that get bigger than the table length. Likewise in double hashing, we use a second hash method h2 (hash, hash+h2(1), hash+h2(2), hash+h2(3), ... ) to compute a value that is continually added to the bin index (with the compression function) until the value is located or an empty bin is found. Unlike linear probing, by using quadratic or double hashing, the probe sequence "after a bin" depends on how many probes it takes to reach that bin in the first place. That is, in quadratic probing if we hash to a bin, the next probe is at hash+1; but if we reach that bin on the third probe (getting there as hash+6) the next probe is hash+10. This typically improves performance by spreading out (avoiding clustering) values in the hash table Unless space is critical, it is typically better to probe using separate chaining than any kind of probing discussed above. Hash tables using probing via open addressing get clogged up with values in a non linear way: as the load factor approaches 1, the searching time approaches O(N). So, if this method of probing is used, we need to set the load factor lower, say at .5. You can simulate such a hash table and measure the performance degradation by counting the average number of probes at various load factors. If you look on the Programs link on the course web site, you will see a download of HashCode, which allows you to test the String's hashCode method statistically. If you want, you can write a static hashCode function and test it compared to the one built-in to Java (for both speed and its ability to generate a wide range of values). There are two drivers there, one testing chaining and one testing open addressing.