A Bloom Filter is a probabilistic data structure with a build in method allowing for efficient testing for the presence of an element in some given set. Please refer to Wikipedia for further details.

The two basic operations on a Bloom Filter are

1. insertion: hash the value to be inserted k times and update a bit array at the index equal to each hashed value.
2. querying: hash the value k times and in case of k collisions, then return true; otherwise false.

Querying a Bloom Filter is very fast but speed comes with a price. Because the underlying implementation depends on hash function(s) there is a chance of false positive answer when checking if a certain element is present in the Bloom Filter or not. The algorithm guarantees that no false negatives answers are given when querying. In other words the Bloom Filter might indicate that some element is present in the set even if it is not. However if one query for membership of some element known not to be in the set the answer will always be that the given element is not present in the set.

# Challenge and approach

Say you want to join two tables A and B based on some field x and that the number of records in A is considerable larger than the number of records in B. In the book Data Algorithms by Mahmoud Parsian the first approach to solve this is simple by brute force illustrated by this pseudo-code:

A variant of this approach is to sort the tables A and B by x before joining them. Such a method is described here.

Both approaches can be improved considerable by incorporating the use of a Bloom Filter. The idea is simple to build a Bloom Filter based on the smaller set B and then use the filter to rule out all records of A not present B.

# Implementation

The given F# implementation is inspired by the Java implementation by Ian Clarke. The underlying MapReduce function is very simple and do NOT represent a realistic example of a proper implementation of the MapReduce framework. There is just one mapper and one reducer. A real world working implementation would definitely include some degree of parallelism. The point here is not to build a MapReduce framework in F# but merely to show the performance improvement one can gain from using Bloom Filter for joining two tables.

Source code is available at GitHub.

# FsCheck

Given a desired false negative probability rate $p$ and the number of elements $n$ being filtered one can determine the length of the bit array by the following formula:

From this formula we see that the following relation holds for all probability rates $p1$ and $p2$ with $p1 < p2$:

length of bit array based on p1 < length of bit array based on p2

Instead of writing just a couple of unit tests verifying this property we can use the test framework FsCheck (inspired by the original Haskell version called QuickCheck). FsCheck cannot prove that the property holds for all lists but it can automatically generate a large number of lists and verify that the relation holds for all these lists. Please see the source code for details.

# Results

In the table below we see the results from simulating the join of two tables in memory. The larger table contains the number of elements given in the column “Number of elements” and the smaller table contains 1/100 elements of the larger table.

Run times are shown in milliseconds for the two cases: brute force and Bloom Filter. We observe a significant performance improvement when using the Bloom Filter approach for larger tables.

Number of elements Without BF With BF Improvement
10000 54 21 61,1%
25000 297 44 85,2%
50000 1192 94 92,1%
100000 4778 213 95,5%
1000000 484404 8446 98,3%

My computer is a MacBook Pro with a Intel Core i5 2,6 GHz processor and 16 GB of memory.

# Alternative language implementations

Bloom Filter implementations in other programming languages and frameworks

• Algebird from Twitter in Scala. Contains lots of other interesting things as well.
• Java. Alternative Java implementation of Ian Clarkes original implementation.
• Scala. Also inspired by Ian Clarke.