By Ananth Kumar Ganesna

**Introduction:**

A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives.

The term Bloom filter names a data structure that supports membership queries on a set of elements. It was introduced by Burton Bloom [1970]. It differs from ordinary dictionary data structures, as the result of a membership query might be “true” although the element is not actually contained in the set. Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR).

**Use cases:**

Cache is important to handle CPU intensive operations to store and quickly access the the result of such operations.

Sometimes the due to changes in IO, you do not want to hit the database repeatedly and would prefer to cache the results and update the cache only with underlying data changes.

Similarly there are other use cases where we need to perform a quick look up to decide what to do with an incoming request. For example, consider the use case where you have to identify that an URL points to a malware site or not. There could be many URLs like that, to do it in one instance, if we cache all the malware URLs in memory, that would require a lot of space to hold them.

Another use case could be to identify if a user typed string has any reference to a place in USA. Like “museum in Washington” – in this string, Washington is a name of a place in USA. Should we keep all the places in USA in memory and then lookup? How big the cache size would be? Is it effective to do it without any database support?

This is where we need to move away from basic map data structure and look for answers in more advanced data structure like Bloom Filter.

You can consider bloom filter, like any other java collection where you can put items in it and ask it whether an item already present in it or not (like a HashSet). If Bloomfilter mentions that it does not contain the item, then definitely that item is not present. But if it mentions that it has seen the item, then that may be wrong ( **False positive** ). If we are careful enough, we can design a bloom filter such that the probability of the wrong is controlled.

**Understanding the Error in evaluation process before implementing Bloom filter .**

A **false positive error**, or in short **false positive**, commonly called a “**false alarm**“, is a result that indicates a given condition has been fulfilled, when in fact has not been fulfilled.

**Example:** In the case of “crying wolf” – the condition tested for was “is there a wolf near the herd?”, the actual result was that there had not been a wolf near the herd. The shepherd wrongly indicated there was one, by calling “Wolf, wolf!”.

A **false negative error**, or in short **false negative**, is where a test result indicates that a condition failed, while in fact it was successful.

**Example:** A guilty prisoner freed from jail. The condition: “*Is the prisoner guilty?*” is true ( yes, the prisoner is guilty). But the test (a court of law) failed to realize this, and wrongly decided the prisoner was not guilty. ( A false positive in scientific research suggests an effect that is not actually there, while a false negative fails to detect an effect that exists there. )

**Bloom filter implementation in Java:**

A Bloom filter is an implementation of a set which allows a certain probability of **false positives** when determining if a given object is a member of the set, in return for a reduction in the amount of memory required for the set.

It effectively works as follows:

- Allocate
to represent the set data;*m*bits - Write a hash function that, instead of a single hash code, produces
for a given object;*k*hash codes - To add an object to the set, derive
**bit indexes from all**and set those bits;*k*hash codes - To determine if an object is in the set, calculate again the corresponding hash codes and bit indexes, and say that it is present if and only if
.*all*corresponding bits are set

The margin of error (or *false positive rate*) thus comes from the fact that as we add more and more objects to the set, we increase the likelihood of “accidentally” setting a combination of bits that correspond to an object that isn’t actually in the set.

Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR) .The advantage of a Bloom filter over the established dictionary structures is space efficiency.

**To add item:**

In order to add any item, it needs to be feed through k hash functions. Each hash function will generate a number which can be treated as a position of the bit array (hash modulo array length can give us the index of the array) and we shall set that value of that position to 1.

For example – first hash function on item 1 produce a bit position x, similarly second hash function produce position y

So we shall set: S[x]=S[y]=1

**To find item:**

Similar process will be repeated, item will be hashed two times through * k* hash functions. Each hash function will produce an integer which will be treated as a position of the array. We shall inspect those

**x,y**positions of the bit array and see if they are set to 1 or not. If no, for sure no one ever tried to add this item into bloom filter, but if all the bits are set, it could be a false positive.

**Java Example:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
/* * Java Program to Implement Bloom Filter */ import java.util.*; import java.security.*; import java.math.*; import java.nio.*; /* Class BloomFilter */ class BloomFilter { private byte[] set; private int keySize, setSize, size; private MessageDigest md; /* Constructor */ public BloomFilter(int capacity, int k) { setSize = capacity; set = new byte[setSize]; keySize = k; size = 0; try { md = MessageDigest.getInstance("MD5"); } catch (NoSuchAlgorithmException e) { throw new IllegalArgumentException("Error : MD5 Hash not found"); } } public int getSize() { return size; } /* Function to get hash - MD5 */ private int getHash(int i) { md.reset(); byte[] bytes = ByteBuffer.allocate(4).putInt(i).array(); md.update(bytes, 0, bytes.length); return Math.abs(new BigInteger(1, md.digest()).intValue()) % (set.length - 1); } /* Function to add an object */ public void add(Object obj) { int[] tmpset = getSetArray(obj); for (int i : tmpset) set[i] = 1; size++; } /* Function to check is an object is present */ public boolean contains(Object obj) { int[] tmpset = getSetArray(obj); for (int i : tmpset){ if (set[i] != 1) return false; } return true; } /* Function to get set array for an object */ private int[] getSetArray(Object obj) { int[] tmpset = new int[keySize]; tmpset[0] = getHash(obj.hashCode()); for (int i = 1; i < keySize; i++){ tmpset[i] = (getHash(tmpset[i - 1])); } return tmpset; } } public class BloomFilterTest { public static void main(String[] args) { Scanner scan = new Scanner(System.in); System.out.println("Bloom Filter Test\n"); //Set capacity and key size"); BloomFilter bf = new BloomFilter(5 , 2); char ch; System.out.println("Enter integer element to insert"); bf.add( new Integer(scan.nextInt()) ); System.out.println("Enter integer element to search"); System.out.println("Search result : "+ bf.contains( new Integer(scan.nextInt()) )); System.out.println("\nSize = "+ bf.getSize() ); } } |

From the above explanation, it becomes clear that to design a good bloom filter we need to keep track of the following things:

• Boom Filter testing is effective for false negative

• Good hash functions that can generate wide range of hash values as quickly as possible

• The value of ** m** (size of the bit array) is very important. If the size is too small, all the bits will be set to 1 quickly and false positives will grow large.

• Number of hash functions(

*) is also important so that the values get distributed evenly.*

**k**Formula to determine m (number of bits for the bloom filter) is as below:

m = -nlogp / (log2)^2;

where p = desired false positive probability

Formula to determine k (number of hash functions) is as below:

k = m/n log(2) ;

where k = number of hash functions, m=number of bits and n= number of items in the filter