A bloom filter is a probabilistic data structure that potentially can make SQL queries execute orders of magnitudes faster. Today I want to tell you how we use them in Floe, and how we make them produce 2x fewer false results.
Feel free to skip this section if you know the answer.
A bloom filter is a probabilistic data structure that answers one question: "Is this element definitely not in the set?" It can give false positives (says yes when the answer is no), but never false negatives (it won't miss elements that are actually there). The main benefit is that a well-designed bloom filter can be really fast - a few CPU cycles per lookup. That's faster than a single function call.
The structure: An array of m bits, all initially set to 0.
Insertion: To add an element, we:
Lookup: To check if an element exists:
Here we're discussing bloom filters in the context of database engineering. If you're not familiar with how databases join tables - here's a quick primer: a hash join matches rows from two tables. First, it loads the smaller table into a hash table (that's the build side). Then it scans the larger table row by row, looking up each value in the hash table to find matches (that's the probe side). Most of the work happens on the probe side, because the larger table can have billions of rows.
When processing millions of rows we want to avoid all the extra work that we can. Don't decompress the data you won't use. Don't probe hash tables for keys that don't exist. Discard rows as soon as you can - it's called being efficient (not lazy!)
We use bloom filters at 2 critical places:
Let's imagine the situation where we want to join two tables where only 1% of 10 billion probe-side rows will be matched. Without filtering we would need to decompress and probe 99% of those rows before discarding them.
What we do instead:
Build phase: At build phase we populate the bloom filter with hashes of build side.
Pushdown: after build phase is complete we push down the bloom filter, which at this point is read-only, to the storage engine.
First-pass filtering: The storage engine decompresses only the columns needed for bloom filtering. It checks each value against the bloom filter, and marks values that definitely do not match the build side.
Adaptive behaviour: Here it gets interesting. We keep the statistics of how many rows we skipped. If we end up discarding almost no rows we don't bother with first-pass filtering and disable it. But we keep checking decompressed rows to re-enable filtering if stats improve.
Rough estimate:
Without bloom filtering:
With bloom filtering:
That's a huge 9x reduction in scan and I/O!
Why do we need to keep the filtering adaptive? Because sometimes bloom doesn't help:
For hash joins we use a simpler, almost textbook-style bloom filter: insert values into the bloom filter at build phase, read it at probe phase before probing the hash buckets.
We landed on using a fixed 256KB bloom filter per join as a sweet spot between size and efficiency. Go bigger - waste the memory and overflow L2/L3 cache (cache misses hurt). Go smaller - might as well flip a coin.
Why fixed size? Predictable performance. No dynamic allocation. Compiler can optimize the hell out of it. Lock-free access. The last one is especially critical when we're talking about a concurrent performance-first database engine.
All of the above works well only if the bloom filter is actually useful and doesn't lie too often. If it does - it is useless. In our engine we measure bloom filter performance with a simple threshold for number of bits set. What does that matter? To understand we need to dive deeper into the theory, and understand false positive rate of bloom filter
Why false positives? As we insert more elements (n), more bits get set to 1. Eventually, random elements will have all their k bits set to 1 by pure chance - even though they were never inserted. That's a false positive.
The occupancy problem: As we insert more elements, more bits get set to 1 and the filter gets saturated. For our single-hash (k=1) approach, that means the false positive rate climbs quickly - up to 10% and above - that's way too high!
There's a well-known equation for false positive rate:
(1 - e(-kn/m))k
Where:
k = hash functions
m = bits
n = inserted elements.
You could just trust the formula. Or we could derive it in 30 seconds and actually understand why bloom filters break down:
The intuition: Every time we insert an element, we flip some bits from 0 to 1. Eventually, so many bits are set that random elements look like they were inserted - even though they weren't.
The math:
Plug in our numbers (k=1, n=256K, m=2M): FP = 11.5%
Here is a really nice interactive tool where you can play around with different parameters of bloom filters to see how they scale: Bloom Filter Calculator
Our old implementation:
static inline uint32_t idx32(HashKey32 h, uint32_t mask) { return (h >> 5) & mask; }
static inline uint32_t mask32(HashKey32 h) { return 1u << (h & 31u); }
inline void put(HashKey32 h) {
const uint32_t m = mask32(h);
__sync_fetch_and_or(&bits[idx32(h, idx32Mask)], m);
}
inline bool contains(HashKey32 h) const {
const uint32_t m = mask32(h);
return (bits[idx32(h, idx32Mask)] & m) != 0;
}
That implementation is simple, and it works. But it is way too simple, let's look for something better:
Goal: find something that's still fast as hell, but lies to us less often.
We started experimenting with ideas to see how they perform
Naive approach: Set two bits using two independent hash functions - terrible
Alternative 1: store two bits in the same cache line - better, but still bad
Alternative 2: split uint32 into halves. Use lower 16 bits for first bit position, upper 16 bits for 2nd bit position - better, we are getting there
Best solution: two bits, one uint32
Here's the insight: both bits live in the same uint32 variable. We use a single hash value to compute:
Leaving us with 6 unused bits.
Why this is beneficial:
uint32There is a minor trade-off: two bits are not truly independent anymore. This slightly increases collision probability. But the performance gain from one memory access crushes the theoretical disadvantage.
The new code is nearly identical in structure - just one extra bit in the mask. We shift and mask by 5 bits because uint32_t has 32 bit positions (2^5).
T bitLoc1(T& h) { return (h >> IDX_BITS) & MASK_5BIT; } // first 5-bit offset (0..31)
T bitLoc2(T& h) { return (h >> (IDX_BITS + 5)) & MASK_5BIT; } // next 5-bit offset (another bit)
void put(HashKey32 h) {
const uint32_t idx = uint32Idx(h);
const uint32_t mask = (1u << bitLoc1(h)) | (1u << bitLoc2(h));
__sync_fetch_and_or(mBuf + idx, mask);
}
bool contains(HashKey32 h) const {
const uint32_t data = mBuf[uint32Idx(h)];
const uint32_t mask = (1u << bitLoc1(h)) | (1u << bitLoc2(h));
return (data & mask) == mask;
}
The performance hit? Negligible. From our benchmarking:
put(): 9.12 → 9.70 cyc/op (+6%)contains(): 1.97 → 3.16 cyc/op (+60% in percentage terms, +1.2 cycles in reality)For context: even the "slower" version executes faster than a function call or branch mis-prediction. We're talking about a nanosecond.
The win? Massive.
Let me spell it out: we spend one extra nanosecond per row to avoid reading dozens of extra gigabytes.
I'll take that trade every single time.
Two bits in one uint32 gave us 2x better bloom filter accuracy at essentially zero cost:
Adaptive filtering at the storage engine layer saves even more, allowing us to completely avoid decompressing rows that will not be needed. But because the first-pass decompression is still costly, optimizing this code path is not as trivial, so we did not touch it at that time. But when working on Floe, we will definitely use some of this gained knowledge for our smarter pushdowns.
Fair question. Using the same interactive tool: Bloom Filter Calculator
The beauty of 2 bits: they fit in a single uint32 with minimal collisions.
It is the sweet spot between "too simple" and "too complex"
"But what about Cuckoo filters or XOR filters?"
Great structures! But they require dynamic resizing or more complex addressing. We wanted:
Two bits in a fixed bloom filter gave us all three.
We also wrote a version that checks 8 elements at a time using SIMD instructions. But that is a story for another day.