Hashing


Introduction:

Hash tables are a data structure for storing and retrieving unordered
information, whoseand its primary operations are in complexity class O(1) -
independent of the amount of information stored in the hash table. We saw that
digital trees had this same property, but only for special keys (that were
digital: meaning we could decompose them into a first part of the key, a second
part of the key, etc. as we can with digits in a number and characters in a
String). Hash tables work with any kind of key. The most commonly used
implementations of the Set and Map collection classes in Java (which are
unordered) are implemented by hash tables. We will also implement a Map via a
HashTable in Program #4.

Here are some terms that we need to become familiar with to understand (and
talk about) hash tables: hash codes, compression function, bins/buckets, 
overflow-chaining, probing, load factor, and open-addressing. We will discuss
each below.


Our Approach:

We will start by discussing linear searching (using a linked list) of a
collection of names. If we instead used an array of 26 indexes and put in
index 0 a linked list of all names starting with "a", and in index 1 a linked
list of all names starting with "b", ... and in index 25 a linked list of all
names starting with "z", we could search for a name about 26 times faster by
looking just in the right index for any name (according to its first letter).
This speed increase assumes each letter is equally likely to start a last name,
which is not a realistic assumption.

In fact, if we used an array of 26x26 (676) indexes, storing in index 0 a
linked list of all names starting with "aa", and in index 1 a linked list of
all names starting with "ab", ... and in index 676 a linked list of all names
starting with "zz", we could search for a name 676 times faster by looking
just in the right index for any name (according to its first two letters). This
speed increase assumes each letter pair is equally likely to start a last name,
which is not a realistic assumption.

Of course, this speedup isn't achieved unless we have at least 676 names, and
each box is equally likely to have a name (which isn't true: few names start
with combinations like "bb", etc). And what about looking-up information that
isn't Strings: for example the WordGenerator uses a List of Strings as the key
for its map. So while this approach seems promising, we need to modify it to
be truly useful.


Hash Codes:

Hashing is that modification. We declare an array with any number of "bins or
buckets" and use a "hash code" to compute an int value for any piece of data
that can go into the hash table. It must always compute the same hash code for
the same value, so it cannot use random numbers. We should design such a hash
code to generate the widest variety of numbers (over the range of all
integers), with as small a  probability as possible of two different values
hashing to the same number.

Of course, in the case of using Strings as values, there are more Strings than
int values. There are only about 4 billion different ints -actually, exactly
4,294,967,296- but an infinite number of Strings, which can be of any
length, meaning any number of characters: even if we consider only Strings with
lower-case letters, there are 26^N different Strings with N chacters; 26^7 is
8,031,810,176, so there are already more 7-letter Strings than ints.

Once we have a hash code function, we use a "compression function" to convert
the hash code to a legal index in our hash table. One simple compression
function computes the absolute value of the hash code (hash codes should cover
both negative and positive values but array indexes are always non-negative)
and then computes the remainer (using the % operator) using the hash table
length as the 2nd operand, producing a number between 0 and length-1 of the
hash table. Other compression functions use-bitwise operations to compute a bit
pattern in the correct range.

The hashCode method must be important: it is one of the few methods declared in
the Object class, so every class can override it (it is as fundamental as
toString and equals, which are also declared in Object). Here is a  slightly
simplified hashCode for the actual String class in Java (we will see the exact
code later).

  public int hashCode() {
    int hash = 0;
    for (int i = 0; i < chars.length; i++) {
      hash = 31*hash + chars[i];   //promotion of char -> int: its ASCII value
    return hash;
  }   

"a".hashCode() returns 97 ('a' has an ASCII value of 97; you can actually
call .hashCode on any String literal, which is really replaced by a String
object storing that value) and "aa".hashCode() returns 3104 (31*97 + 97).
Generally, if String.length() is n (the chars array contains n values), then
its hashed value is given by the formula

   chars[0]*31^(n-1) + chars[1]*31^(n-2) + ... + chars[n-2]*31^1 + chars[n-1]

So, "ICS23".hashCode() returns 69,494,394, and "Richard Pattis".hashCode()
returns -125,886,044! Yes, because of arithmetic overflow and the standard
properties of binary numbers, the result might be negative (and overflow of
negative numbers can go positive again). Recall that Java  does not throw any
exceptions when arithmetic operators produce values outside of the range of
int: hashing is one of the few places where this behavior produces results
that are still useful. 

Generally the hashCode for all the numeric types is a numeric value with that
bit pattern. Characters hash to their ASCII values. Every other type in Java
is built from from these: for examples Strings are an ordered sequence of chars
and we compute the hash code of the String by looking at every char in it.
Note that "ab".hashCode() != "ba".hashCode(), which is fine because those two
Strings are not equal: the order of the letters is important. In fact, a good
hashCode function for any ordered data type will produce different values for
different orders of the same values.

Likewise, here is the Java hash code for a List (defined in the AbstractList
class). It relies on computing the hash code for every value in the list (and
if a null value is in the list, using 0 for its hash code). Computing the hash
code for a sequence of values in a list is similar to computing the hash code
for a sequence of characters in a String, because the order is important.

  public int hashCode() {
    int hashCode = 0;
    for (E e : this) {
      hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
    }
    return hashCode;
  }

Note that if we store an explicit null in a List (this is legal) we cannot
call the hashCode method on it: Java would throw a null pointer exception,
so we use just the value 0 for its hash code.

Recall in the WordGenerator program we used a key that was a list of String.
So we would iterate through the list of String values, computing the hash code
for each String in the list, and combining them as shown above.

It is critical that the equals and hashCode methods for a class are compatible.
The key property is that if  a.equals(b)  then  a.hashCode() == b.hashCode().
Of course the opposite is not true, because many different Strings have equal
hashCodes, because there are more Strings than ints: of course it is unlikely
that many different Strings actually used in some problem will have the same
hash code (if there are only millions, not billions, of them). 

This compatibility requirement is very important for unordered collections like
Sets. Typically we iterate through the values of a Set to compute the hashCode.
But, values in the Set can be stored in (and iterated through in) any order.
Regardless of the order these values are processed, they must compute the
same hash code each time hashCode is called (because set implementations that
happen to store their values in different order are still .equals).

So, hash codes are different, but only slightly, for unordered collections
(like a Set). Since no matter what order the values are stored, the hash codes
should be the same, we cannot use the hash code method above, but instead must
use something that accumulates the hash code values of its elements without
regard to their order. Here we just add together (without the weighting of 31*)
all the values.

  public int hashCode() {
    int h = 0;
    for (E e : this) {
       if (e != null)
          h += e.hashCode();   //or we could use h *= e.hashCode()
     }
     return h;
   }

Thus, if we add or multiply together all the hash codes, it doesn't make a
difference what order we do the addition or multiplication: a+b+c, b+a+c,
c+b+a, etc. all compute the same value (as does a*b*c, b*a*c, c*b*a, etc.)

Finally, here is something that I found interesting, when I came across it when
I was reading the .java String class). The real hashCode method in String looks
like the following, with cachedHash being an instance variable for all objects
in the String class, which is initially set to 0. The first time hashCode is
called, cachedHash is 0 so it computes the hash value and stores it in
cachedHash before returning it. For every other time it is called, it
immediately returns cachedHash, doing no further computation. Remember that
Strings are immutable, so once they are constructed their contents do not
change, so once the hashCode is computed for a String object, that String
object will always return the same result.

  public int hashCode() {
    if (cachedHash != 0)
      return cachedHash;

    int hash = 0;
    for (int i = 0; i < chars.length; i++) {
      hash = 31*hash + chars[i];   //promotion of char -> int
    return cachedHash = hash;
  }   

If a String's compute hashCode is 0, even after its hashCode is computed and
cached, it will be recomputed (because with the != 0 test, Java cannot tell
the difference between a hash code that has not been computed and a hash code
that has been computed with value 0). Typically, the only String whose hash
code is 0 is ""; most other Strings will have a non-0 hashCode. Recomputing the
hash code of "" is very quick, because it stores no values (chars.length is 0,
so the loop immediately exits). We could include an extra boolean instance
variable named hashCached, initialized to false and set it to true after
caching. So we would have

  public int hashCode() {
    if (hashCached)
      return cachedHash;

    int hash = 0;
    for (int i = 0; i < chars.length; i++) {
      hash = 31*hash + chars[i];   //promotion of char -> int

    hashCached = true;
    return cachedHash = hash;
  }   

...but that is a bit overkill. 

Of course, we also could compute the hashCode of every String WHEN IT IS
CONSTRUCTED, storing it in cachedHash, and never checking this value, always
returning the cached value. The upside is that the hashCode function would
always just returned this cached value; the downside is that we would have to
compute the hash code for every String when it was created, even if we were
never are going to call hashCode on it. The approach above, actually used in
Java, caches a hash code only if asked to compute it at least once.

Hash Tables are used frequently in practice, so there has been a lot of
studies, both theoretical and empirical, of hashCode methods, which are
at the heart of Hash Tables working efficiently. The best methods are quick
to compute and return results scattered all over the range of int. Given such
a hash code, the rest of the code to implement a Hash Table (see below) is
straightforward.

If you look on the Programs link on the course web site, you will see a
download of HashCode, which allows you to test the String's hashCode method
statistically. If you want, you can write a static hashCode function and
test it compared to the one built-in to Java (for both speed and its ability
to generate a wide range of values). There are two drivers there, one testing
chaining and one testing open addressing: both are covered in this lecture.


Hash Tables:

Now let's look at how to insert a value into a hash table. Recall the basic
structure is an array of bins/buckets with each referring to a linked list
of values that hash/compress to that index. The basic picture looks like the
following (here with 5 bins/buckets).

   Bin/Bucket     Collisions (handled through separate chaining)

     +---+
     |   |    +----+---+    +----+---+
  0  | ------>| v1 | --+--->| v2 | / |
     |   |    +--------+    +--------+
     +---+
     |   |
  1  | / |
     |   |
     +---+
     |   |    +----+---+
  2  | ------>| v3 | / |
     |   |    +--------+
     +---+
     |   |    +----+---+    +----+---+
  3  | ------>| v4 | --+--->| v5 | / |
     |   |    +--------+    +--------+
     +---+
     |   |    +----+---+
  4  | ------>| v6 | / |
     |   |    +--------+
     +---+

Generally, bins can have zero, one, or many values. We say that values v1 and
v2 collided in bin 0. And we have used "separate chaining" (using a linked
list) to keep track of all the "collisions". With good hash/compression
functions, the values stored in a hash tabled should be approximately equally
distributed throughout the bins. Of course, there will typically be some bins
with fewer values (maybe even non) and some bins with more because hash codes
aren't perfect.


Hash Table Algorithms:

2 Important Algorithms manipulating Hash Tables

1) add(for Set)/put(for Map):
   Use hashCode/Compression to compute a bin index (for the value/key).
   Search all the collisions to see if the information is there
      (sets have unique values/maps have unique keys).
    If it is there: for a Sets don't change anything; for a Map
      change the value associated with the key
    If it is not there: add the information anywhere convenient in that
      bin: in a list node at the front, rear, wherever: there is no ordering
      of information in the bins for sets and maps.

2) contains(for Set)/get or containsKey(for Map):
   Use hashCode/Compression to compute a bin index.
   Search all the collisions to see if the information is there.

------------------------------

2 more Important Algorithms

The "Load Factor" of a hash table is the ratio of values it contains divided
by the number of bins. It computes the expected number of the values that will
hash/compress to each bin. Generally Java classes tries to keep the load factor
below 1 (more bins than values, so each bin contains zero or very few
-hopefully just 1, but this is rarely achieved- values). When the add/put
method is called for a Set/Map, it checks the load factor and increases the
length of the hash table to ensure that it is always below the specified
threshold, doubling the size of the hash table if need be.

3) Double Length of Hash table
   Remember the old hash table array and allocate a new one 2 times as big
   Traverse the old hash table (you can do it directly or use the Iterator
     if one is available), adding each value to the new hash table, but NOT
     NECESSARILY IN THE SAME BIN! Instead, add it to by applying
     hashing/compression again (compression will be DIFFERENT, because the
     length of the table is doubled, so we compute the same hash value but
     compute the remainder using the DIFFERENT TABLE LENGTH).
   By being clever, we can re-use the entire LN (list node), so we don't have
     to allocate any new objects; but this makes the code harder to write.

4) Iterator: Uses index and cursor instance variables.
   Constructor:
     Loop to find the first bin with a list of values (not null):
       Succeed: set index to the bin number, cursor to the first LN in the bin
       Fail   : set cursor to null

   hasNext:
     return cursor != null

   next:
     if hasNext is true, return the appropriate value by using only the cursor
       reset cursor = cursor.next; if it becomes null, loop to advance the
       index to later bins, stopping at a non-null one
         Succeed: set index to bin number, cursor to the first LN in the bin
         Fail   : set cursor to null

  Remove:
     Store previous cursor (in extra instance variable) to do removal.
       or
     Store no extra information but use a trailer node in every bin (which
       makes removal easier; we will do this in Programming Assignment #4)

Note that iterators for data types implemented by hash tables return their
values in a "strange" order (based on their hash values and collisions). In
fact, adding one value to a hash table can cause it to exceed its load factor,
and thus it will double the number of bins (doubling the length of the hash
table) which causes rehashing. Now, the iterator order might be completely
different. We should never assume any special ordering for these iterators,
since data types might be implemented by hash table data structures. Instead,
we should put the values produced by iterators into a List and sort the list
and iterate through it, if we want to ensure they are processed in a specific
order.


Implementing Overflows:

Note that we can store in each bin an array of values, a linked list of values,
a binary search tree of values, etc. (possibly using different implementations
of the Set data type). The reason that linked lists, and not more exotic data
structures, are used, is that with a load factor <= 1 we expect to find few
values in each bin, so more exotic data structures just make the coding more
difficult with little gain in speed for searching a very small number of
values.


Why are hash tables O(1) with good hash functions:

In the worst case, a hashCode will be the same for every value hashed (not
likely, but possible) so no matter how big the hash table is, we would go to
the same bin and search all N values there. Such a process would be O(N). But
let's assume that we are using a hashCode that does a pretty good job (as most
do).

For such a good hash code, if the table length is M, we would expect to have
to search for N/M values in each bin. Thus, for any given M the method is
O(N/M) or just O(N) since it seems M is "a constant".

But, we are doing something a bit more subtle. By keeping the load factor <= 1,
we ensure that M >= N (say M is always at least N, and sometimes as much as
2*N, right when the load factor exceeds 1 and we double the length of the hash
table). So, M is not a constant, but it grows with N. In fact, we know that
M >= N. So in the "worst case" M = N.

Therefore the complexity class of O(N/M) is reallyO(N/N) or O(1). In fact, for
any reasonable load factor the complexity class is O(1). For a load factor of
2, the complexity is O(N / N/2) or O(2) which is the same as O(1). Certainly
with this bigger load factor, we'd expect to spend twice as much time
searching, but on average we'd still examine some fixed number of values in
each bin.


Security via Hashing as a 1-way function:

Given a String, it is very easy to compute the hash code of it; but it is
typically not easy, given a hash code, to determine what String(s) will hash
to that value. In mathematics, such functions are called 1-way (or
non-invertible) functions. We can use 1-way function to provide security:
let's look at one example, supplying password security.

Have you ever wondered how a computer system stores your password? If the
computer stored everyone's password in a file (as a list of user-ids and their
passwords), then anyone who could steal/read that file could compromise all the
accounts. Here is another way to store the information: store a list of
user-ids and the hash code of their password. When a user tries to log in,
the system would hash the password they type in, and see if it matched the
hashed entry in the password table.

The hashing method would be public, but it would be a 1-way function. So, even
knowing the algorithm wouldn't allow you to easily compute a password from its
hash code (but would allow you to easily compute a hash code from a password).

Now, if someone could read the password file, they would see only the hash code
of the passwords, but not the passwords themselves. Of course, they could write
a program that generated all possible Strings, hash each, and look for one that
had the same hash code. Then they could log into the system with the userid and
that password (which would hash to the one stored in the password file).

That is why you are encouraged to have passwords that are long, and have upper
AND lower case letters (and maybe even symbols in them). It increases the size
of the alphabet, so makes searching over all symbols in that alphabet harder.

Assume that a computer could compute 10^9 (a billion) hash codes per second.
There are 1.4 x 10^17 different Strings of length 10 (52^10, using upper- and
lower-case letters). If we tried to generate all these Strings, hash each, and
compare it to the one we are looking for, it would take  about 1.4 x 10^8
seconds, or about 4.5 years. This is another reason to change your password
frequently (say, every 4.5 years!)

Actual password systems are more complicated these days, and use advanced
cryptographic methods. But these methods themselves are typically based on the
general theory of 1-way functions: the functions are just much more interesting
that computing the Java hash codes of Strings.


The Complexity of Doubling Arrays:

We have seen that arrays are used to store all kinds of collections: I have
supplied array implementations of all the collection classes, and some advanced
implementations that we will write, like HashMap, also are based on arrays. In
these implementations, the add method determines whether to double the length
of the array. Most adds don't double the length of the array: in all the
collections we've seen before, we just put the value in the next unused index
in the array (or in a hash table, chain it in a linked list in its bin). But,
as the collection size grows, eventually we double the length, which means we
copy all the values currently in the array to a new array. So most adds are
O(1) but some adds are O(N). 

So, if we are talking about upper bounds, we might say at worst each add is
O(N) and we do N adds, so the process of doubling and copying to get N values
into an array is O(N^2). But, we can derive a better (smaller) upper bound.
We will talk about "amortized complexity" to analyze this case.

At worst, we allocate a collection to have 1 array cell. Adding the 1st value
stores it in the array and requires no copying. Adding a 2nd value doubles the
size of the array, copying the 1st value (so 1 copy). Adding a 3rd value
doubles the size of the array, copying the 1st-2nd values (so 2 more copies)
Adding a 4th value stores it in the array and requires no copying. Adding a
5th value doubles the size of the array, copying the 1st-4th  values (so 4 more
copies). Adding the 6th-8th value stores them in the array and requires no
copying. Adding a 9th value doubles the size of the array, copying the 1st-8th
 values (so 8 more copies). Adding the 10th-16th value stores them in the array
and requires no copying. Etc.

Each time we double the length, we copy twice as many values as before, but we
can add twice as many values before having to double again. If we end up with
N values in the array, what is the total number of copies that we have to make?
We will see below is is O(N) -actually bounded by 2N.

When we double the array size from 1 to 2, we have copied 1 value in total.
When we double the array size from 2 to 4, and copy 2 more values for a total
of 1+2=3 copies. When we double the array size from 4 to 8, and copy 4 more
values for a total of 1+2+4=7 copies.

Notice that the sum 2^0 + 2^1 + 2^2 + ... + 2^N = 2^(N+1) - 1. We used this
formula before to compute the maximum number of nodes in a binary tree of
height h: 2^(h+1) - 1 (which has 1 node at depth 0, 2 nodes at depth 1, 4
nodes at depth 2, etc).

Here is a table. On the left, N is the number of values in the array, and
on the right is the total number of values we need to copy.

  N    Movements/Copying
--------------------------
  1      0
  2      1
 3- 4    3 = 1 (1->2) + 2 (2->4)
 5- 8    7 = 1 (1->2) + 2 (2->4) + 4 (4->8)
 9-16   15 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16)
17-32   31 = 1 (1->2) + 2 (2->4) + 4 (4->8) + 8 (8->16) + 16 (16->32)
33-64   63 = ....

Notice that for N a perfect power of 2, there is N-1 copies. When N is 1 bigger
than a power of two, there are 2*N-3 copies. So, the number of times that we
must copy a piece of data as an array grows is O(N).
 
Rather than thinking about "add" sometime doing very little work and sometime
doing lots, we can think about "add" doing a bit of extra work (a constant
amount) every time that we call it: really, most of the time we don't have
to do the work, but every so often we have to do alot of work, not just for the
new value but for all the ones before it. This is called amortized complexity.
Still, the total work done for adding N values is just O(N) -it cannot be less
because N*O(1) is O(N).

Another way to think of this is just how much extra work is there to double
the number of values in an array. If the original length is N (say it is a
power of 2), then when we add N more values, we will have to first copy
each of the N values originally in the array into the new array, and then copy
each of the new N values into the array. Thus, overal, adding N more values
into the array requires copying N values and then adding N values without
copying, for a total of 2N operations. The total complexity is still just O(N),
since every add required the actual add and a total of N adds also required
copying a total of N values originally in the array. Think about every add as
counting for itself and one of the copying operations.


Mutating a Value in A Hash Table

If we are using a tree or hash table to store an object in a Set or a key to a
Map, we should not mutate that object inside the Set/Map. This is because the
place it is stored depends on the value/key: when we put a value into a tree,
we use a comparator to determine all the subtrees it belongs in; when we put a
value into a hash table, we compute its hashCode to determine which array index
to put it in.

If we mutate a key in a tree or hash table, it probably will not belong where
it is: it would be in a different location in a tree or different bin in a hash
table. That is, the Java code to locate and store a value in Set/key in a Map
is based on using its state (for comparison or hashing). If we store a value in
a Set/key in a Map, and then mutate it (change its state) then we may never be
able to locate that value/key again.

So, it is a good idea to use immutable classes (like String) for keys, or at
least be extra careful not to mutate a value INSIDE a tree/key INSIDE a Map.
Instead, we can remove it, change the key, and then add it back. For a hash 
table, both removal and adding are O(1) operations, so although awkward,
changing a value in this way only requires a constant amount of work to update
it in the hash table.


Hashing with Open Addressing Instead of Chaining:

The final big topic on hashing is collision resolution without separate
chaining. By far the most useful way to handle different values that hash to
the same bin is to store all these results in a linear linked list referred
to by the bin. Hash table operations will do a linear search of such a list,
which, by having a good hash code and low load factor, will generally not be
long.

There is an alternative way to deal with collisions. While it takes up no
extra space (compared to chaining, which constructs new LN objects) it can
cause a big increase in the time needed to search for a value in a hash table,
unless the load factor is kept well below 1/2.

The method is calling Probing via Open Addressing. We will discuss 3 different
forms of probing: linear, quadratic, and double hashing.

In linear probing, if we compute the bin for storing a value; if the table
already contains a value at that bin, we increment the index by 1 (incrementing
the last array index brings us back to index 0) and keep probing bins until we
find an empty bin, and then put the value there. The locate method likewise
continues from the original bin, linearly looking through other bins, until we
find the value we are looking for, or reach an empty bin (meaning the value is
not in the hash table). Note that many values can be searched, including many
that have different hash values that happen to have a hash code bit bigger than
the hash code of the value we are looking for. For example, let's use linear
probing via open addressing for the following hash table.

   0     1     2     3     4    5      6     7     8     9
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Thus, if "a" hashes to bin 4, we put it there because it is empty.

   0     1     2     3     4    5      6     7     8     9
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     | "a" |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

Likewise, if "b" hashes to bin 5 we put it there because it is empty.

   0     1     2     3     4     5     6     7     8     9
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     | "a" | "b" |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

But now, if "c" hashes to bin 4, we have to probe bin 4 and 5 until we find
that bin 6 is the first empty one after 4, and put "c" there.

   0     1     2     3     4     5     6     7     8     9
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     | "a" | "b" | "c" |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

If we are looking to see whether "d" is in the table (say it hashes to
bin 4) we start probing at bin 4 for d, then check bin 5, bin 6, and finally
bin 7: it is empty, so we know "d" is not in the hash table.

Another problem with probing via open addressing involves removing values. If
we want to remove "b", we first find it by hashing to bin 5 and finding it
there. But if we actually removed it (putting a null there), if we were looking
for "c" the following problem would occur: we hash "c" (see above) and get bin
4, then we look at the next bin (which is now empty, because we removed "b")
and we would think that "c" is not in the hash table. But at the time "c" was
added, there was a value in bin 5, which is why "c" had to go to bin 6.

So, in order to know we are done when we reach an empty bin, when we remove a
value, we find it, and mark its bin as "available". Bins marked "available"
have previously stored values and can store new values (if we reach them), but
unlike empty bins they cannot stop the probing: probing must continue until it
reaches the value to locate or has passed through all occupied and "available"
bins and reached a bin storing null. We can create a special Object to
represent "available".

Instead of linear probling (where the bin number increases by 1 every time),
we can do quadratic probing, where the bin number increases by xi+yi^2 (for
i=0 the first time, i=1 the second time, etc. once we specify the values for
x and y: if both are 1/2, e.g., 1/2(i+i^2) we probe hash, hash+1, hash+3,
hash+6, hash+10, etc). Of course with the compression function handles indexes
that get bigger than the table length. Likewise in double hashing, we use a
second hash method h2 (hash, hash+h2(1), hash+h2(2), hash+h2(3), ... ) to
compute a value that is continually added to the bin index (with the
compression function) until the value is located or an empty bin is found.

Unlike linear probing, by using quadratic or double hashing, the probe sequence
"after a bin" depends on how many probes it takes to reach that bin in the
first place. That is, in quadratic probing if we hash to a bin, the next
probe is at hash+1; but if we reach that bin on the third probe (getting there
as hash+6) the next probe is hash+10. This typically improves performance by
spreading out (avoiding clustering) values in the hash table

Unless space is critical, it is typically better to probe using separate
chaining than any kind of probing discussed above.

Hash tables using probing via open addressing get clogged up with values in a
non linear way: as the load factor approaches 1, the searching time approaches
O(N). So, if this method of probing is used, we need to set the load factor
lower, say at .5. You can simulate such a hash table and measure the
performance degradation by counting the average number of probes at various
load factors.

If you look on the Programs link on the course web site, you will see a
download of HashCode, which allows you to test the String's hashCode method
statistically. If you want, you can write a static hashCode function and
test it compared to the one built-in to Java (for both speed and its ability
to generate a wide range of values). There are two drivers there, one testing
chaining and one testing open addressing.