Bloom Filters in Distributed Systems: Speed Through Probability

Have you ever wondered how Instagram almost immediately hits backs with the “Username already exists!” message when you try to create a new account? Lets make a quick estimate -

Suppose the number of usernames stored in Instagram is around 5 billion (active+inactive+suspended+bots). In the worst case scenario there will be 5,000,000,000 comparisons to find if a username exists if searched sequentially. Lets assume 1 comparison takes 100 ns (optimistic). So 5 billion comparisons will take around 8.3 minutes! This is highly impractical for for a website that must respond in milliseconds.

This is where bloom filters shine. Bloom filter is a space efficient probabilistic data structure that can give two possible answers -

Element is definitely not present
Element is probably present

Traditional approaches for lookups use hash sets. This requires massive memory if we are to store millions or billions of records. Bloom filters can store compact representations of the values although they are prone to false positives (will say an element exists when it does not). False negatives are impossible (will say an element does not exist when it does).

How bloom filters work?

A bloom filter is essentially an array of m bits with k independent hash functions. Lets understand with an example -

Suppose we want to store the username “dodo”.

The bit array is initialised to all 0s at first. Assume m = 10 and k = 3

Step1: Compute hash of the username.

Step2: Set the above positions to 1 in the bit array.

Try to insert the username “momo”.

Step1: Compute hash.

Step2: Set the above positions to 1 in the bit array.

Now we query an username - “panda”.

Step1: Compute hash

Step2: Check the corresponding positions in the bit array.

Since position _ is set to 0, “panda” is definitely not present.

Alternatively if all checked positions are set to 1, then the element is probably present. It might produce false positives in case the bit positions are set by other words and not the username we query. Hence, it is important to choose the values of m and k efficiently to reduce the false positive rate. If the size of the filter is too small or number of elements to be inserted is larger then the rate of false positive will increase rapidly. A properly sized Bloom Filter can achieve false positive rates below 1%.

You cannot delete an element from a standard bloom filter as multiple elements can share the same bit and deleting one will impact other elements as well.

There are many special forms of standard bloom filters to mitigate such problems.

Method	Working
Counting Bloom filters	Store counts rather than bits. Supports safe deletion(increments and decrements of the counts) but requires more storage memory.
Scalable Bloom Filters	Create a larger bloom filter if the existing one fills up. Redistribution of values is impossible of the existing filter so one has to check all filters before performing an operation on the new one.
Cuckoo Filters	Most modern systems use cuckoo filters. It provides lower false positive rates, supports deletion and use comparable memory.

Questions worth exploring:

How to mathematically derive the optimal value for m and k?
How does cuckoo filters work and why is it preferred by modern systems?
How to distributed systems synchronise bloom filters across all nodes?

Bloom Filters in Distributed Systems: Speed Through Probability

Comments

Tech Notes

Progressive Disclosure vs. Monolithic Skills: A Real-World Benchmark of How to Structure LLM Instructions

More from this blog

Through her Hair

Notes on The Visual MBA

Progressive Disclosure vs. Monolithic Skills: A Real-World Benchmark of How to Structure LLM Instructions

Adaptive-Intent-Driven Development: Why I Stopped Writing Specs

Command Palette

Comments

Tech Notes

Progressive Disclosure vs. Monolithic Skills: A Real-World Benchmark of How to Structure LLM Instructions

More from this blog