- In this short post, I will talk about the concept of Associativity in Hashing. This is a fairly well known method so the article might be an old hat for some readers.
When implementing a collision resolution strategy for our hash table design, there are usually two directions that we take:
- Chaining – There is a per bucket overflow chain (a linked list of overflow buckets) containing items that earlier hashed to main bucket in the hash table but couldn’t be stored there because the bucket was already full (collision). Thus each time we hit such a condition on a given bucket B, we create a new overflow bucket (effectively creating a new node on linked list), and insert it at the head/tail of overflow chain of bucket B. The new item is then stored in this overflow bucket. Such hashing schemes are often referred to as closed addressed hashing or out-of-line hashing.
- Probing – Unlike the chaining approach, we can also design an inline strategy to handle collisions. In such design, there is no overflow chain on any bucket. The main hash table buckets are the only buckets we have at our disposal. So whenever there is a collision on bucket B, we simply probe in a linear order (starting from bucket B and peeking at B+1, B+2 …..) for the next vacant bucket B’ where the new item can be inserted. Such hashing schemes are often referred to open addressed hashing or inline hashing or linear probing.
Whatever the collision resolution mechanism is, the primary goal should be to keep the performance of get() and put() operations optimal. Some applications can tolerate a slow write (put operation on hash table) code path, but fast lookups is a requirement that should be supported by data stores that use some sort of hashing techniques (key value stores, hash indexes in databases etc).
Under what conditions can the performance of operations on hash table deteriorate ?
In almost all of the cases, load factor (aka occupancy percentage) has a significant impact on the performance of get() and put() operations. Load factor is the ratio of number of items (or records) in the hash table to the total number of buckets in the hash table array. It is basically an indicator of the degree of fullness of the hash table.
Consider the probing based hash tables. Study [see links below] shows that such hash table can show some performance degradation for both get() and put() operations even at a load factor of 50% (around half the buckets in hash table have items, and half are empty). This is because of “probing” which is required in the presence of collisions. If a key K hashes to bucket B and B is already occupied by some other key K1, the put() algorithm will linearly go over every subsequent bucket looking for an empty bucket which can hold key K. Any subsequent get() on key K will also have to do the same thing.
When linear probing based hashing schemes hit these situations often, the hash table is resized. This is so that performance of hash table operations doesn’t degrade significantly. Although resizing will possibly provide optimal performance (because number of buckets will increase), this approach is not space efficient. The fact that hash table’s performance shows degradation even at low fill factors (hash table is 50% full) and forces a resize is a straight forward evidence that hash table is not memory efficient. Resizing (allocating more buckets or a bigger hash table) isn’t really a good idea when 40%-50% buckets are already empty in the existing hash table.
Similar problem can be understood for chaining based hash tables. On every collision, we create a new overflow bucket thus effectively increasing the size of overflow chain and every subsequent get()/lookup operation has to traverse down the overflow chain on hash bucket to look for the presence of a Key-Value pair. The longer the size of chain, the slower our get() operation will be. Again, the hash table will be resized based on some internal statistics (average length of overflow chain etc etc).
It is possible to resize the hash table _purely_ on the basis of load factor thresholds. Let’s say we keep a threshold of 75%. Internally we need to track how many items have been inserted into the hash table. Once the hash table is 75% full, we trigger a resize operation that allocates a larger hash table and rehashes the existing items. This is fine as long as space efficiency is concerned. For example, if we start with hash table of size 16, with a load factor threshold of 75% we will resize when at least 12 buckets are full. This is reasonable.
However, in most cases this not really a good idea because it is like waiting for the hash table’e performance to degrade significantly and then doing something about it. If the collisions are happening (which implies longer overflow chains for chaining based tables and longer probe lengths for linear probing based tables), the slowdown in get() and put() operations will start becoming apparent. Most implementations of hash table will resize sooner than later and this is where the space inefficiency concern mentioned above is reinforced.
Nowadays, there are many hash table designs that use associativity in hash buckets to make them space efficient — implying performance of hash table operations is optimal (or at least acceptable/reasonable) even at high load factors. The concept of associativity is not really new. Since ages it has been used in the design of CPU caches (full-associative, 4-way set associative etc). In fact, the adoption of this design principle in hashing is also not new.
The fundamental idea is to use a fat bucket. Regular/traditional hash table designs are describes with the view that each bucket stores a single KV item. With associativity, each bucket is capable of storing “N” items — implying the hash table is N way set associative. Here are the advantages:
- For chaining based tables, we create an overflow bucket only when all the “N” slots in current bucket are full and there is “N+1″th collision. Thus we can possibly keep the length of chain shorter because each bucket now stores multiple items. The shorter the chain, the lesser the number of pointer de-references.
- Each read of a hash bucket gives us N items to look at whereas earlier we had to de-reference the pointer on overflow chain to look at every other item.
- Similarly, the probe lengths tend to be shorter for linear probing based hash tables because each bucket now stores “N” KV items.
- The performance of both get() and put() operations is not expected to degrade at low fill factors.
- We should try to align our fat hash bucket as per the size of CPU cache line (64 bytes on x86). This will also increase the performance by reducing cache misses as we move from one KV item to another.
Note that associativity is not an alternative to “resizing”. Hash table will eventually have to be resized. It is a typical Computer Science problem — What do we do when we exhaust the existing space taken by a particular data structure? The data structure has to be resized. There is no doubt in that. The point I am trying to make is that we should try to refrain from resizing the hash table at low load factors. Instead, the slowdown in get() and put() operations should be addressed by changing the design of hash table — which is what associativity does.
So it is perfectly fine to resize the hash table based on load factor thresholds (something like 75%, 80%). But we should try to use the “associative bucket” approach so that hash table’s performance remains reasonable as the load factors begins to increase from 0 to 20%, 30%, 50% and so on and we don’t keep on resizing at low occupancy rates.
The concept is being widely used in hash tables based on Cuckoo Hashing. Here are few useful links: