I'm in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.
One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I'll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.
A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that's associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.
For an example, think of a phonebook. You've got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person's name, and then use it to get their phone number.
A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.
In those schematic terms, a hashtable is very simple. It's a structure that maintains a mapping from keys to values. A hashtable really only needs two operations:
put(key, value): add a mapping from
valueto the table. If there's already a mapping for the key, then replace it.
get(key): get the value associated with the key.
In a hashtable, both of those operations are extremely fast.
Let's think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.
We've got a list of names and phone numbers. We want to know how long it'll take to find a particular name. How quickly can we do it?
How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.
In the worst case, there's nothing to help us: the only thing we can do is take the key we're looking for, and compare it to every single key. If we have 10 keys, then on average, we'll need to do an average of about 5 steps before we find the key we're looking for. If there are 100 keys, then it'll take, on average, about 50 steps. If there are one million keys, then it'll take an average of half a million steps before we can find the value.
If the keys are ordered - that is, if we can compare one key to another not just for equality, but we can ask which came first using a "less than or equal to" operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.
With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That's the point of a hashtable. It might be really hard to do something like general a list of all of the keys - but if all you want to do is look things up, it's lightning.
How can it do that? It's a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.
A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.
It's easiest to understand by looking at some actual code. Here's a simple, not at all realistic implementation of a hashtable in Python:
class Hashtable(object): def __init__(self, hashfun, size): self._size = size self._hashfun = hashfun self._table = [ for i in range(size)] def hash(self, key): return self._hashfun(key) % self._size def get(self, key, value): self._table[self.hash(key)].append((key, value)) def get(self, key): for k,v in self._table[self.hash(key)]: if k == key: return v return None
If you've got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.
There are two big problems with hashtables.
First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it's no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good - but it's a surprisingly tricky thing to get right.
Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn't going to be very good. (In fact, it'll be slightly worse that just using a binary search tree.)
So what makes a good hash function? There are a bunch of things to consider:
- The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you're writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
- The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 - so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren't divisible by four would always be empty. That would be bad.
- The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.
So what's a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It's an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here's djb2 in Python:
def djb2(key): hash = 5381 for c in key: hash = (hash * 33) + ord(c) return hash
What the heck is it doing? Why 5381? Because it's prime, and it works pretty well. Why 33? No clue.
Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?
In practice, it's rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they're not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.
Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They're used constantly. You usually don't have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you're usually safe.
There's also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that's a subject for another time.