Bloom Filters – Implementation in Java

By  Ananth Kumar Ganesna

Introduction:

A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives.

The term Bloom filter names a data structure that supports membership queries on a set of elements. It was introduced by Burton Bloom [1970]. It differs from ordinary dictionary data structures, as the result of a membership query might be “true” although the element is not actually contained in the set. Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR).

Use cases:

Cache is important to handle CPU intensive operations to store and quickly access the the result  of such operations.

Sometimes the due to changes  in IO, you do not want to hit the database repeatedly and would prefer  to cache the results and update the cache only with underlying data changes.

Similarly there are other use cases where we need to perform a quick look up to decide what to do with an incoming request. For example, consider the use case where you have to identify that an URL points to a malware site or not. There could be many URLs like that, to do it in one instance, if we cache all the malware URLs in memory, that would require a lot of space to hold them.

Another use case could be to identify if a user typed string has any reference to a place in USA. Like “museum in Washington” – in this string, Washington is a name of a place in USA. Should we keep all the places in USA in memory and then lookup? How big the cache size would be? Is it effective to do it without any database support?

This is where we need to move away from basic map data structure and look for answers in more advanced data structure like Bloom Filter.

You can consider bloom filter, like any other java collection where you can put items in it and ask it whether an item already present in it or not (like a HashSet). If Bloomfilter mentions that it does not contain the item, then definitely that item is not present. But if it mentions that it has seen the item, then that may be wrong ( False positive ). If we are careful enough, we can design a bloom filter such that the probability of the wrong is controlled.

Understanding the Error in evaluation process before implementing  Bloom filter .

false positive error, or in short false positive, commonly called a “false alarm“, is a result that indicates a given condition has been fulfilled, when in fact has not been fulfilled.

Example: In the case of  “crying wolf” – the condition tested for was “is there a wolf near the herd?”, the actual result was that there had not been a wolf near the herd. The shepherd wrongly indicated there was one, by calling “Wolf, wolf!”.

false negative error, or in short false negative, is where a test result indicates that a condition failed, while in fact it was successful.

Example: A guilty prisoner freed from jail. The condition: “Is the prisoner guilty?” is true ( yes, the prisoner is guilty). But the test (a court of law) failed to realize this, and wrongly decided the prisoner was not guilty. ( A false positive in scientific research suggests an effect that is not actually there, while a false negative fails to detect an effect that exists there. )

Bloom filter implementation in Java:

A Bloom filter is an implementation of a set which allows a certain probability of false positives when determining if a given object is a member of the set, in return for a reduction in the amount of memory required for the set.

It effectively works as follows:

  • Allocate m bits to represent the set data;
  • Write a hash function that, instead of a single hash code, produces k hash codes for a given object;
  • To add an object to the set, derive bit indexes from all k hash codes and set those bits;
  • To determine if an object is in the set, calculate again the corresponding hash codes and bit indexes, and say that it is present if and only if all corresponding bits are set.

The margin of error (or false positive rate) thus comes from the fact that as we add more and more objects to the set, we increase the likelihood of “accidentally” setting a combination of bits that correspond to an object that isn’t actually in the set.

Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR) .The advantage of a Bloom filter over the established dictionary structures is space efficiency.

To add item:

In order to add any item, it needs to be feed through k hash functions. Each hash function will generate a number which can be treated as a position of the bit array (hash modulo array length can give us the index of the array) and we shall set that value of that position to 1.

For example – first hash function on item 1 produce a bit position x, similarly second hash function produce position y

So we shall set: S[x]=S[y]=1

To find item:

Similar process will be repeated, item will be hashed two times through k hash functions. Each hash function will produce an integer which will be treated as a position of the array. We shall inspect those x,y positions of the bit array and see if they are set to 1 or not. If no, for sure no one ever tried to add this item into bloom filter, but if all the bits are set, it could be a false positive.

Java Example:

From the above explanation, it becomes clear that to design a good bloom filter we need to keep track of the following things:
• Boom Filter testing is effective for false negative
• Good hash functions that can generate wide range of hash values as quickly as possible
• The value of  m  (size of the bit array) is very important. If the size is too small, all the bits will be set to 1 quickly and false positives will grow large.
• Number of hash functions(k) is also important so that the values get distributed evenly.

Formula to determine m (number of bits for the bloom filter) is as below:

m =  -nlogp / (log2)^2;
where p = desired false positive probability

Formula to determine k (number of hash functions) is as below:

k = m/n log(2) ;
where k = number of hash functions, m=number of bits and n= number of items in the filter