After reading lecture 1 of the roughgarden notes for cs168 I thought I would try to implement consistent hashing in python. Enough people have written about consistent hashing that I don’t think I have anything interesting to add, but I will document the problem and my implementation here.
In brief, consistent hashing is a way of assigning resources -> servers such that when a new server is added, the resources mostly remain on their original server. The typical use case is for CDNs: we want to keep a list of resources and which server is hosting them. When we add or remove a server, we want the majority of the resources to remain assigned to that server for caching purposes.
If we were to take a naive solution, we might create an array of size N, where N is the number of servers we have. To figure out which server holds a resource, we would take the hash of the resource and then modulo by N to figure out which server should be assigned that resource. The problem is that if we add or remove a server, every resource’s assigned server will shift over.
Consistent hashing gets around this problem by doing the following: we hash servers and resources into the same hash table. To figure out which server holds a resource, we walk rightwards from the resource until we find a server. When we add a new server, we just place it into a new spot in the hash table. If we get to the end of the hash table and don’t find a server, we continue searching from the front. When we add a new server, it takes over all the resources between the server to its left and the server to its right in the hash table.
Conceptually, the idea is elegant and simple - but there are two main issues: 1) how do we implement this hash table and 2) how do we reduce variance: if we have bad luck, it might be possible to insert two servers next to each other, causing one server to have many more resources cached on it than the other.
The simplest way to implement consistent hashing is the following: for each server, we insert its hash into a sorted list. When we want to find where a resource belongs, we do a binary search on the hash of the resource and find the next element in the list. Done! Insertions: O(N) into an array, Lookups: O(lg N)
Another method would be to use a binary tree: we hash each server and insert it into a binary tree and since we expect the hash function to be random-ish, the binary tree depth should grow logarithmically. Insertions: log(N), Lookups: log(N), deletions: log(N), find successor: log(N)
If we want to get even more complex, we can use van emde boas trees which have anticipated search time of log log M where M is the expected number of items in the the vEB tree.
For my purposes, I went with a binary tree and ended up implementing the following operations:
class ServerTree(): def add_server(self, num_copies) def remove_server(self) def successor(self, resource)
To reduce variance, we add multiple copies of each server to our hash table. The ideal number is to use log(K) replicas of each server, where K is the amount of resources we are expecting to store. If we want to store 1 billion resources, then we should use lg(1e9) replicas of each server to balance their load. By adding multiple copies, we will reduce the probability that one server gets over-saturated compared to its neighbors.