Bloom Filters – Implementation in Java

By  Ananth Kumar Ganesna

Introduction:

A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives.

The term Bloom filter names a data structure that supports membership queries on a set of elements. It was introduced by Burton Bloom [1970]. It differs from ordinary dictionary data structures, as the result of a membership query might be “true” although the element is not actually contained in the set. Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR).

Use cases:

Cache is important to handle CPU intensive operations to store and quickly access the the result  of such operations.

Sometimes the due to changes  in IO, you do not want to hit the database repeatedly and would prefer  to cache the results and update the cache only with underlying data changes.

Similarly there are other use cases where we need to perform a quick look up to decide what to do with an incoming request. For example, consider the use case where you have to identify that an URL points to a malware site or not. There could be many URLs like that, to do it in one instance, if we cache all the malware URLs in memory, that would require a lot of space to hold them.

Another use case could be to identify if a user typed string has any reference to a place in USA. Like “museum in Washington” – in this string, Washington is a name of a place in USA. Should we keep all the places in USA in memory and then lookup? How big the cache size would be? Is it effective to do it without any database support?

This is where we need to move away from basic map data structure and look for answers in more advanced data structure like Bloom Filter.

You can consider bloom filter, like any other java collection where you can put items in it and ask it whether an item already present in it or not (like a HashSet). If Bloomfilter mentions that it does not contain the item, then definitely that item is not present. But if it mentions that it has seen the item, then that may be wrong ( False positive ). If we are careful enough, we can design a bloom filter such that the probability of the wrong is controlled. (more…)

Read More