The paper: "The Power of Simple Tabulation Hashing", by Mihai Pătraşcu and Mikkel Thorup, arXiv:1011.5200.
The problem: hashing algorithms are designed to work well with hash functions that are truly random (chosen uniformly from all possible functions that map the keys to be stored in the hash table into indices). But they are actually used with functions that are very nonrandom. How can we design nonrandom or pseudorandom hash functions that can be evaluated efficiently and that cause hashing algorithms to behave as well as they do with random hash functions?
The known solution (Wegman and Carter 1981, etc): find a constant C such that the hash algorithm in question requires C-tuples of hash function values to be independent (each C-tuple of keys should map to each C-tuple of indices with equal probability) but does not require (C+1)-tuples to be independent. Then find a hash function that achieves the desired level of independence. One simple construction for C-wise independent functions is to use a uniformly random polynomial of degree C-1 over a finite field as the hash function. This has the desired theoretical properties: it requires only a small number of random bits (the coefficients of the polynomial) to hash a much larger number of keys, it is C-wise independent, and (when C is constant) it takes O(1) time per hash table value to evaluate. However, because evaluating polynomials involves multiplication, it is slow, especially when C is large, and better alternative solutions work well only for very small values of C.
The "new" solution: break the key into bytes, and let T[i,b] be a table of random numbers indexed by byte position and byte value. Then simply let the hash value h(x) of a key x whose bytes are x0, x1, x2, ... be h(x) = T[0,x0]+T[1,x1]+T[2,x2]+...
Actually, this isn't a new solution at all: it was already described by Wegman and Carter. But it's only 3-independent. The innovation in the new paper is to show that the same hash function can be proven to work well, even for some hash algorithms that require C-independence for C > 3. Specifically:
- In chaining (that is, resolving collisions by using linked lists of items that all hash to the same index) the longest chain has length O(log n/log log n) with high probability.
- Linear probing (that is, resolving collisions by placing each item into the nearest empty slot, and performing lookups by searching forwards from the hashed index until finding an empty slot) in a table with n items stored and (1+ε)n slots takes expected time O(1/ε2) per operation.
- Cuckoo hashing (a more complex collision resolution strategy with the advantage that search operations are extremely fast, although updates can be slower) has polynomially small failure probability (but not as good as the linearly small failure probability that one would get with a truly random hash function). This is not good enough to use this function for a dynamic cuckoo hash table, but is good enough for using cuckoo hashing in the static case.