ICS 46 Spring 2022
Notes and Examples: Hash Tables


Avoiding the cost of searching

So far, we've talked about a number of different search structures, data structures whose goal was to help us find things more efficiently. Given a search key, we can find those keys (and any value associated with them), as well as insert or remove keys. So far, we've seen that you can implement this idea using linked lists (either ordered or unordered), arrays or std::vectors, binary search trees, AVL trees, or skip lists, each of which presents different tradeoffs among various goals like ease of implementation, execution speed, and the probability (and cause) of its worst-case behavior.

But even the best of these search structures leaves us wanting something more. That they require us to search at all seems unfortunate. Many of us, in our day-to-day lives, have a system by which we organize our possessions — our clothes are hanging in a closet, our wallets or purses are on a particular table, an alarm clock by the side of our bed, toiletries in a cabinet in a bathroom, and so on. The advantage of this is that we never have to search for anything; as long as we follow the system and put things where they belong, we always know where to find them. Some of us have been doing this since we were young children, so it's not as though this is a complicated idea, which might rightly lead you to wonder why a data structure couldn't do the same thing. If we had a data structure that had this same well-defined notion of where things belonged, it would never be necessary to search. When we wanted something, the data structure would simply look where it belonged. If it was there, we'd be done; if it wasn't there, we'd know for sure it wasn't anywhere else, because we already looked in the only place where it belonged. What's not to like?

Of course, as a practical matter, it's not quite that simple; even our best idea for organizing our stuff in the real world may not work out according to plan. If, for example, we decide that we should keep our socks in a particular drawer, that's a fine plan; whenever we buy a new pair of socks, we'll know right where to put them. But what if we buy too many socks, so that the drawer isn't big enough for all of them, but we really want to keep all of these socks? What do we do then? A data structure would need to recognize this same issue and deal with it; if too many objects belong in a particular place, we'd need it to work around that problem for us.

Still, the idea is generally sound: Storing things where they belong makes them easier to find later, tantalyzing us with the idea that we might be able to find elements in Θ(1) time. This concept is embodied by a data structure called a hash table.


What is a hash table?

A hash table is a data structure that stores a collection of unique search keys (and, optionally, a value associated with each), just as binary search trees and skip lists do. However, they differ significantly in their implementation, eschewing a tree-based structure in favor of a simpler one: an array. In their structurally-simplest version, hash tables store the keys (and associated values) — or, at least, pointers to them — directly in the cells of the array.

First thing's first, though: In which cell should a key be stored? If our goal is to be able to find it later without having to search for it, then we'll have to have a uniform notion of "Where does this key belong?" that doesn't change over the course of time. This is the job of a hash function, which is used to determine, for any given key, where it belongs. We say that the hash function returns a hash value or a hash. Each time we insert a key, remove a key, or lookup a key, we determine its hash, and that will tell us where to insert it, remove it, or find it. The word "function" here is used in the mathematical sense, rather than the C++ sense: for a hash function h, given a key k, h(k) will always return the same hash value. (If it didn't, how would we ever be able to find keys? It's imperative that we look in the same place where we originally inserted it, or we might as well just use an unsorted linked list or array.) Quite often, a hash function will return an unsigned integer value, a value which can then easily be reduced into the range of possible array indices by using the modulo operator (% in C++). This approach allows a single hash function to be independent of the size of the array, which makes these functions more reusable. As we'll see, how you design your hash function is an enormous factor in how successfully you'll be able to use a hash table, but we'll revisit this issue a little later in these notes.

By way of example, let's consider a hash table with capacity 7. Let's suppose that our keys are unsigned integers, and that we'll use the hash function h(k) = k. So, technically, when we hash a key k and then determine which index in the array it belongs in, we'll ultimately need to calculate k % 7 (because we'd take the hash and then use % to reduce the hash to the range of available array indices). Initially, our hash table would be empty:

0 1 2 3 4 5 6
                           

Suppose we started by inserting the key 10 into this hash table. To do that, we'd hash 10 and determine that it belongs in the index 10 % 7 = 3. We could then insert the key in cell 3 of the array.

0 1 2 3 4 5 6
            10            

We could similarly insert 14 (14 % 7 = 0), 19 (19 % 7 = 5), and 23 (23 % 7 = 2), and everything would be moving along nicely. Here's what our hash table would then look like:

0 1 2 3 4 5 6
14     23 10     19    

Note that all of these insertions would take Θ(1) time; the cost of calculating the index and storing a key in the corresponding cell doesn't vary with the number of keys or the capacity of the array, since we can access any cell in an array in Θ(1) time given its index. So far so good.

Subsequent lookups would also be Θ(1), using a similar algorithm: hash the key, figure out the corresponding array index, then check whether the key is where it's supposed to be. For example, we could determine that the key 10 is in this table by looking at index 3 (10 % 7 = 3) and finding the key 10 there. Similarly, we could determine that the key 21 is not in the table by looking at index 0 (21 % 7 = 0) and finding a key other than 21 there, and that the key 32 is not in the table by looking at index 4 (32 % 7 = 4) and finding no key there at all. This is shaping up to be a great strategy!

Unfortunately, there is a big problem awaiting us as we consider this further. What happens if we wanted to insert the key 26 into this table? 26 belongs at index 5 (26 % 7 = 5), but there is already a key 19 at that same index. And the underlying data structure is an array, after all, so we simply can't store two things in a cell; every cell of an array is the same size and, in C++, has the same type. So what can we do? How can the keys 19 and 26 both be stored in a hash table if the hash function says they both belong in the same place? It's not reasonable to simply fail in a case like this — even if you did, what would the exception say? There's no reasonable justification for not being able to store the keys 19 and 26 in a hash table at the same time.

The issue we're faced with here is called a collision. We say that there is a collision involving the keys k1 and k2 when both k1 and k2 are determined to belong at the same index. The important question here is this: What can we do about collisions?

One way to solve this problem is to eliminate it altogether. As long as every key belongs at a unique index, according to our hash function, then we won't ever have this problem. In practice, though, this idea is very often a nonstarter, because there are too many possible keys. For example, imagine that your keys are UCI student IDs. Student IDs at UCI are eight-digit unsigned integers, which means that there are 108 = 100,000,000 possible student IDs. So the only way to eliminate the possibility of a collision in this scenario would be to have an array with a capacity of 100,000,000. There are approximately 30,000 students enrolled at UCI (as of this writing) so if we were only storing information about currently-enrolled students, we'd be using less than 0.03% of the array, with the rest being empty space. Even if we consider all of the students who were ever enrolled at UCI in its history, we're probably not talking about much more than 1,000,000 students (as of this writing), so we still have an array that's 100 times larger than it needs to be. That's a lot of memory to give up to get Θ(1) searches!

So, given that we can't easily avoid the problem of collisions in practice, we'll have to devise a solution that takes them into account and works around them. And until we've done that, there's no sense in doing any kind of serious asymptotic analysis of hash tables; they simply aren't any good to us, in general, until they can handle collisions (except, perhaps, in scenarios where the number of possible keys is very small, such that we could afford to pre-allocate an array with enough space for every possible key).


Handling collisions

Collisions occur when two or more keys belong in the same place, so, broadly speaking, there are two ways we could try to handle them:

Each of these will involve making some adjustments to our design, and either solution is viable, though, as we'll see, they present slightly different tradeoffs.


Separate chaining

One way to allow hash tables to handle collisions is called separate chaining, which is an approach that allows more than one key to be stored at the same index in the array. Rather than the array storing keys (and, optionally, the associated values), each cell in the array will instead be a singly-linked list, in which each node contains one key (and, optionally, the associated values) that belongs in that array cell. The nice thing about a linked list is that it has no size limit; it simply grows as more nodes are added and shrinks as they're removed. This makes it possible to store any number of keys at one index in the array, so collisions no longer present any kind of conceptual problem.

For example, after adding the keys 10, 14, 19, and 23 in the previous example, and then adding the key 26, a separately-chained hash table might look something like this:

A hash table with separate chaining

The lookup algorithm would need to be adjusted, as well:

  1. Use the hash function to determine the key's hash, then use % to determine the appropriate index where the key belongs.
  2. Search the linked list at that index, looking for the key. (This would be just like any other linked list search.)

Similarly, the algorithm for insertion would need to be adjusted a bit:

  1. Use the hash function to determine the key's hash, then use % to determine the appropriate index where the key belongs.
  2. Search the linked list at that index, looking for the key, so you're sure that the key is not already present. (The keys need to be unique.)
  3. Add a new node to either end of the linked list at that index, then store the key in the new node. (Why either end? Adding at the front would be inexpensive. Adding at the end would also be inexpensive since we already did a search that terminated at the end. Where you add the node, then, is mainly a matter of taste.)

Analysis

Without thinking too hard about it, we're already faced with an uncomfortable reality: We've seemingly lost the main benefit we were trying to achieve, because we now have to search a linked list. But how long will this take to do? If we have a hash table with an array of capacity c and n keys stored in it, what can we conclude?

First of all, we'll always presume that the hash functions runs in Θ(1) time, because they generally don't take any other keys into account — so it doesn't matter how many keys are already stored — and they also don't do any fancy footwork that is affected by array's capacity. Their goal is to be fast but effective, so we'll say that hashing takes Θ(1) time — or, at the very least, the time wouldn't be affected by c or n.

At worst, it might be the case that every key is stored in the same cell in the array, in which case we have an n-node linked list that would need to be searched. This could be because of extraordinarily bad luck, or it could be because we chose a poor hash function, the most ridiculously bad of which is h(k) = 0. Either way, lookups would now be a matter of searching a linked list containing every key in the hash table, so we would say that this takes O(n) time.

In practice, however, this is a unnecessarily pessimistic result, because it presumes a worst case that is easily avoided if we're careful about how we design our hash function. If we assume that we have a "good" hash function — which we'll define as one for which, given a randomly-chosen key k, is it equally likely that the key will belong at any particular index — then we would expect that the length of each linked list would tend to be n/c. We'll define n/c as the load factor of a hash table. So, for example, in a hash table with capacity 100 with 1,000 keys, we would expect the length of each linked list to be around 1,000/100 = 10; stated differently, we would say that this hash table has a load factor of 10.

But even that, in practice, is unnecessarily pessimistic, because there's one more useful trick we can employ. When the number of keys in the hash table begins to approach the array's capacity, we can do the same thing that std::vector does when it fills up: Make a new array that's larger, then copy everything from the old array into the new one. It's a little bit trickier than that, though, because the change in the array's capacity will also change where keys belong — since the capacity is one factor in determining the right index in which to store a key — so we would actually need to do this:

  1. Allocate a new array that's larger than the original one. (So that this will amortize well, we would want it to be a multiple of the original size, though it wouldn't have to be a large multiple. Something like 1.5 or 2 would be fine.)
  2. Re-hash the keys into the new array, which is to say that we need to iterate through the old array, find all of the keys, and freshly determine where each one belongs in the new array. The hash function will not have changed, but the capacity of the array will have, and that's part of how we decide where keys belong.

If we did that whenever, say, the load factor reached something like 0.8, then n/c would never be more than 1, so we'd expect the vast majority of the linked lists in the array to have no more than a couple of keys in them. Ultimately, then, searches in our hash table would take (more or less) Θ(1) time.

Granted, there's no guarantee here. A poorly-chosen hash function will lead to a very bad result. So it's necessary to take seriously the way that we choose our hash function; the performance of a hash table is almost wholly dependent on the quality of the hash function.


Designing a "good" hash function

We'll say, generally, that a "good" hash function has two qualities:

So how do we design such a function? The answer depends as much on the set of keys as it does anything else; in other words, we have to know something about the data we're hashing to do a good job of hashing it.

Why you can't be cavalier in your choice of hash function

Let's imagine that our keys are credit card numbers and that we want to hash them into a hash table of capacity 10,000. Credit card numbers are sequences of digits, so one way to do that would be to use this very straightforward hash function:

h(k) = first four digits of k

So, for example, if we were hashing the credit card number 4012 3456 7890 1234, its hash would be the first four digits: 4012. This would certainly be something we could calculate fast, so it meets the first of our two goals in designing a "good" hash function.

How about the second goal? This is where we need to think more carefully. It turns out that this approach is actually quite bad, but it's not obvious why if you don't know anything about how credit card numbers are actually assigned. There are a couple of details about where credit card numbers come from that turn out to be important. (I'm paraphrasing the reality, partly because I don't know every detail, and partly to keep things simple. But the spirit of it definitely matches the reality.)

At first, that might seem like more than you ever wanted to know about credit card numbers, but you have no choice but to embrace these details if you want to take credit card numbers and hash them properly. What we learn from those two facts above is that the first four digits of a credit card number are a terrible way to hash credit card numbers! Why is that? Because the first four digits are probably, by far, the most biased digits on the card. Every Visa number begins with a 4, and every Mastercard number begins with a 5. What proportion of credit cards in the United States are Visa and Mastercard? As of this writing, that proportion appears to be between 80% and 90%. So, if you use the first four digits to hash a credit card number, 80-90% of your credit card numbers will hash to 20% of the cells in your array (those whose indices run from 4000-5999). What's worse, there will be more bias in the second through fourth digits, because some of the larger banks will issue many more cards than the much smaller ones, but the second through fourth digits are partly determined by who issued the credit card. Suppose you chose 8,000 active credit cards in the United States at random; think about how many of them would be Bank of America Visa debit cards. And they'll all hash to the same cell: 4217. (Having 10,000 separate cells in our hash table only helps if we're potentially going to use all 10,000 of them.)

The moral of the story is that you have to take hashing seriously enough that you don't get yourself into this kind of trouble. A better approach would be to use digits on the credit card number that aren't biased — if there are any, that is — or to combine all of the digits mathematically in a way that leads to better guarantees about how well their hashes will spread around. (There are universal hash functions that can do this in some circumstances, though that delves into math that's beyond the scope of this course.)

If you knew, though, that the last ten digits of credit card numbers were generated completely at random — though I don't think this is true, in practice — then this would be a perfectly reasonable hash function for credit card numbers:

h(k) = last four digits of k

because there would be absolutely no bias in its result. By our assumption, given a randomly-chosen credit card, it would have an equal probability of having any particular values in its last four digits.


Open addressing

Another approach to solving the problem of collisions is to allow keys to be stored somewhere other than where the hash function says they belong. This is not to be done willy-nilly — we still need to be able to find them easily, without arbitrary searching — but, if we're careful about it, we can maintain the benefits of hashing while still storing keys (and their associated values, optionally) directly in the array.

This approach, generally, is called open addressing, which revolves around a simple premise: When a key is said to belong at some index, but another key is already stored there, we'll find an alternative index at which to store it. There are different variants of the open addressing approach, and they primarily differ in terms of how they find that alternative index. We'll focus on just one of them, to give you an idea of how this works.

Linear probing is the simplest version of open addressing. The easiest way to understand how it works is to consider its insertion algorithm:

  1. Hash the key k, then reduce the key to an index using the % operator. We now know the index i at which the key belongs.
  2. Inspect the array cell at index i. If it's empty, store the key in that index, and we're done.
  3. If the array cell is occupied, probe backward, by looking first at the index i − 1, then at the index i − 2, and so on. When we find an empty cell, we store the key in that cell, and we're done. If we reach index 0 and still haven't found an empty cell, we "wrap around" to the last cell in the array and keep going.

To ensure that this algorithm terminates, we can either ensure that the number of keys in the hash table is never more than c − 1 (where c is the capacity of the array), or we can add logic that causes us to stop probing if we ever get back where we started (which is slower, but reclaims the one empty cell).

Suppose that we have an empty hash table with an array of capacity 7, and we're storing arbitrary unsigned integer keys using the hash function h(k) = k. In that case, let's take a look at how this behaves. Suppose we first insert the keys 10 and 14. 10 % 7 = 3 and 14 % 7 = 0, so we'll store these keys at indices 3 and 0, respectively:

0 1 2 3 4 5 6
14         10            

Next, suppose we insert the key 24. 24 % 7 = 3, so this key belongs in the same cell as the key 10. Since that cell is occupied, however, we'll use linear probing and discover that the previous cell, whose index is 2, is empty. So we'll store the key 24 there.

0 1 2 3 4 5 6
14     24 10            

Finally, suppose that we insert the key 28. 28 % 7 = 0, so this key belongs in cell 0, where key 14 already resides. Probing backward (and wrapping around), we find that the last cell (at index 6) is empty, so we store 28 there.

0 1 2 3 4 5 6
14     24 10         28

Now that we have these keys stored, how do we find them? The answer is to use the same basic algorithm to look them up that we did to insert them.

  1. Hash the key k, then reduce the key to an index using the % operator. We now know the index i at which the key belongs.
  2. Inspect the array cell at index i. If it contains the key we're looking for, we found it and we're done. If it's empty, we didn't find it and we're done.
  3. If the array cell at index i is occupied but contains a key other than the one we're looking for, probe backward (and "wrap around" if necessary) until we either find the key we're looking for or we reach an empty cell.

That seems reasonable enough. Using that algorithm, we'd find 24, for example, by determining that 24 % 7 = 3, then probing backward from index 3 to index 2, where we would find 24. We could also determine that 21 is not in the hash table by determining that 21 % 7 = 0, then probing backward (and wrapping around) to index 6, then index 5, which is empty.

There's one more wrinkle to consider: How do we handle removal? Suppose we remove the key 10:

0 1 2 3 4 5 6
14     24             28

A subsequent lookup of 24 will now fail, because we'll start by calculating 24 % 7 = 3, then discover that the cell at index 3 is empty, which tells us that we can terminate our search. How we work around this problem is to use a special process for removing keys, which is to leave a mark in the cell — maybe an extra bit — that specifies that we should still search through it, even though it's technically empty:

0 1 2 3 4 5 6
14     24 X         28

If we ever inserted a key that belonged there, such as 31, we could simply store 31 in that cell; it's available for subsequent insertion. But by leaving it marked as having been removed, we'll still be able to find keys that we inserted by probing through it beforehand.

Over time, though, this would mean that the array would tend to fill up with marked cells. We can periodically compress the table, by re-hashing everything into it from scratch — starting with an empty array and then re-inserting the keys into it, so that there will be no marked cells left.

Analysis

While an asymptotic analysis is complicated by a number of factors, we can still understand a few things quickly without getting into too much depth.

This problem, whereby clusters grow and performance degrades quickly, is known as primary clustering. As the load factor approaches 1.0 (i.e., every cell being in use), the number of cells you'll have to look at in a search will tend to degrade rapidly. I've seen empirical studies that show results like the ones below, which show the average number of cells that might need to be checked in a hash table with capacity 1,000 in both successful and unsuccessful searches (differentiated by whether we're searching for a key that's actually in the hash table).

Load Factor Cells Checked in
Successful Search
Cells Checked in
Unsuccessful Search
0.25 1.2 1.4
0.5 1.5 2.4
0.75 2.5 8.3
0.9 5.0 39
0.99 16 360

You can see that once you reach load factors around 0.75, things start degrading quite quickly. By the time the load factor reaches 0.99, we'll be looking at over a third of the elements in the hash table on every unsuccessful search! So one thing to think about when deciding whether you might employ this strategy is whether you can afford to keep the load factor low enough — you'll need to give up some memory to be spent on perpetually empty cells.

It should be noted that there are other approaches, such as double hashing, that can be used to partially mitigate the problem of primary clustering — by having different keys that hash to the same cell probe different paths to find a final place to be stored — though these lie beyond the scope of this course.

Comparison with separate chaining

One thing you might be wondering is why you would ever want to use open addressing if separate chaining is a reasonably good solution already. The answer lies in the difference between them. In particular, separate chaining requires us to maintain linked lists, which has two potential downsides:


Tradeoffs between hash tables, AVL trees, and skip lists

Now that we've discussed three different ways of solving the same problem, it's a good idea for us to compare and contrast them. Why did we learn about three different solutions? What does each one provide that the others do not?

Performance comparison

Functionality comparison

There is one other important way that these data structures differ, which is worth considering when you're selecting one of them.