0:00

So, in this video, we're going to discuss Bloom filters which is a data structure

Â developed appropriately enough by Burton Bloom back in 1970. Bloom filters are

Â variant on hash tables, you'll recognize a lot of the ideas from our hash table

Â discussion. The win that you get in Bloom filters is that they are more space

Â efficient than run of the mill hash tables and they're going to handle, they

Â do allow for errors, there is a non zero false positive probability when you do

Â look ups but that's still a win for some applications. So, it's a very cool idea,

Â very cool data structure. You do see it used quite a bit in practice so let's

Â start talking about it. So, we'll go through the usual topics that we do

Â whenever we discuss a new data structure. So first, I want to tell you what

Â operations they support and what kind of performance you're going to expect from

Â those operations, in other words, what is the API corresponding to the data

Â structure. Secondly, I'm going to talk a little bit about what it's good for. So,

Â what are some potential application and then we'll take a peek under the hood.

Â I'll tell you some of the implementation details with an emphasis on explaining why

Â you get the kinds of performance trade offs that you do with Bloom filters. So,

Â to first order, the raison d'Ãªtre of Bloom filters is exactly the same as a

Â hash table. It supports super fast inserts, super fast look ups. You can put

Â stuff in there and you can remember what you put in earlier. Now, of course, what

Â you should be wondering is what we already know what data structure that supports

Â super fast in certain look ups, a hash table. Why am I bothering to tell you

Â about yet another data structure with exactly those same operations? So, let me

Â tell you about the pros and cons of Bloom filters relative to run off the mill hash

Â tables as we've already discussed. The big win is that Bloom filters are more space

Â efficient than hash tables. No matter whether they are implemented with chaining

Â or with open addressing, you can store much less space per objects. In fact, as

Â we'll see, less space than that of an object itself using a Bloom filter. As

Â far as the cons, well, first of all, this is really for applications where you just

Â want to remember what kind of values you see. You are not trying to store pointers

Â to the objects themselves and just trying to remember values. So, the first drawback

Â of the Bloom filter is that because we want to be so space efficient, we don't

Â even want to remember the object itself just whether or not we've seen it before.

Â We're not going to be able to store the objects or even pointers to the objects in

Â a Bloom filter. We're just going to remember what we've seen and what we

Â haven't. So, some of you might know the terminology hash set for this kind of

Â variant of a hash table as opposed to a full blown hash table or hash map. The

Â second con is at least in the vanilla implementation of Bloom filters that I'm

Â going to describe here, deletions are not allowed. You can only insert, you can't

Â delete. The situation with deletions is very much similar to hash tables

Â implemented with open addressing. It's not that you can't have a Bloom filter that

Â accommodates deletion, you can, there are very instances of it but that requires

Â significantly more work and we're not going to discuss it here. So, the first

Â order at least for vanilla Bloom filters, you want to think of them as suitable for

Â applications or deletions or not a first order of operation. Now, the third con and

Â this is a drawback that we have not see previously using any data structures is

Â Bloom filters can actually make mistakes. Now, what kind of mistake could this kind

Â of data structure possibly make when all you're really doing is looking something

Â up. Well, one of mistake would be a false negative and that means you have inserted

Â something previously then you look it up and the hash table or the Bloom filter

Â says, it's not there. So, Bloom filters will not have false negatives of this

Â form. You've insert something, you look it up later, it's definitely going to

Â confirm that you inserted it in the past. But Bloom filters will have false

Â positives, that means that despite the fact you have never inserted say, a given

Â IP address into the, into the Bloom filter, if you look it up later, it will

Â say that you have. So, there will sometimes be in some sense phantom objects

Â in Bloom filters, objects which it thinks have been inserted even though they

Â haven't been. So, given that, I am now showing you two data structures with

Â essentially the same functionality, hash tables and Bloom filters, at least, if we

Â ignore the deletion issue. You might want to wonder which one is more appropriate,

Â which one is more useful. And because there is these trade offs between the two,

Â the answer as you expect is, it depends on the application, right? So, if it's an

Â application where space is really at a premium, you might want to turn to Bloom

Â filters especially if a small chance of a false positive is not deal breaker. If you

Â have some kind of application where false positives are absolutely out of the

Â question, of course, you should not use a Bloom filter and you want to think about a

Â hash table. So, what are some situations where people actually do use Bloom filters

Â where you either really care about space and/or you don't really care about this

Â false positive probability. For one of the earliest applications of Bloom filters,

Â this is not long time ago, this is something like 40 years ago, was the spell

Â checkers. So, how would you implement a spell checker using a Bloom filter? Well,

Â first you have this insert phase where you basically just go through the entire

Â dictionary word-by-word and you insert every valid word into the Bloom filter.

Â Then, afterwards, when you're presented with a new document that somebody has

Â written, you're going to go through the document word-by-word for each word, you

Â say, is this in the Bloom filter? That is, is this one of the legitimate word from

Â the dictionary which is previously inserted? If the Bloom filters says yes,

Â this word is in the dictionary as in we've stored and seen that before, then you

Â treat is as a correctly spelled word and if it's not in the Bloom filters, then you

Â treat it as incorrectly spelled word. Now, the false positive probability means this

Â isn't a perfect spell checker. I mean sometimes, you're going to look up a

Â misspelled word and the Bloom filter won't catch it and it willl actually say yes,

Â with small probability, we'll say, this is a legitimate word. So, you know, it's not

Â ideal but, you know, the, the English language is pretty big and space was

Â definitely at a premium, 40 plus years ago. So, it was a win for that application

Â at that time, to use Bloom filters to implement a spell checker. Another

Â application which, you know, remains relevant today is to keep track of a list

Â of forbidden passwords. Now, why would you have forbidden passwords? Well, maybe, you

Â want to keep track of password which are too weak or too easy to guess or too

Â common. You may, yourself, have used the piece of software or website at some point

Â where it asked you for a password and if you typed in something which is too simple

Â or too easy, rejected it and asked you to type in another one. So, one way to

Â implement a list of forbidden passwords is just with the Bloom filter and the idea is

Â similar to the spell checker. You first, insert into the Bloom filter all of the

Â passwords that you don't want anybody to use for whatever reason. Then, when a

Â client comes and tries to type in a new password, you look it up in the Bloom

Â filter and if you get a positive look up, then you tell the user, no, that's no

Â good, you can't use that password, choose another one. And this is an application

Â where you really don't care about the errors, you really don't care about the

Â fact that there's a false positive rate. Let's assume that the error rate is

Â something like one percent or 0.1%. So, what would that means in context, that

Â would just mean once in a while, one in a hundred clients or one in a thousand

Â clients actually types in a perfectly strong password that gets rejected by the

Â Bloom filter and they have to type in a second one. Okay, but big deal and if

Â space is at the, the premium, this is definitely a win to use this super

Â lightweight data structure to keep track of these blocked passwords. These days

Â certainly one the killer applications of Bloom filters is in software deployed on

Â network routers. So, the machinery out in the Internet which is responsible for

Â transmitting packets from one place to another. So, what are the reasons why

Â Bloom filters have found fertile application in network routers? Well,

Â first of all, you do have a budget on space, typically on network routers.

Â There's a lot of things that you got to do and you don't want to waste that much of

Â it on some random data structure to do one's specific task. So, you do have a

Â budget on space and also, you need super, super fast data structures, right? Since

Â these packets are coming in at this torrential rate which you can't even

Â imagine and you want to process these packets in real time, sending them off to

Â the next top. Bloom filters are the work force behind a lot of different tasks that

Â is done in the network router. You can imagine wanting to keep track of blocked

Â IP addresses, you can imagine keeping track of the contents of some cache so you

Â don't do spurious look ups. You can imagine maintaining statistics to check for denial

Â of service attacks and so on and so forth. So, summarizing as a expert programmer,

Â what is it that you should remember about Bloom filters, what purpose does this

Â tool serve in your tool box? Well, as far as the operation supported which is the

Â same as a hash table, the point is to have super fast inserts, super fast look ups.

Â But Bloom filters are more lightweight version of a hash table. So, they are more

Â space efficient but they do have this drawback of having a small error

Â probability. So, those are the key features you should remember when deciding

Â whether or not you are working on an application that could make good use of

Â this data structure. So, having discussed one of th e operations and what these data

Â structures are good for, let's take it to the next level, let's peer under the hood

Â and see how they are actually implemented. Cuz this is really a quite simple, quite

Â cool idea. So, like hash tables, Bloom filters have essentially two ingredients.

Â First of all, there's an array and second of all, there's a hash function or in

Â fact, several hash functions. So, we're going to have a random access array

Â except, instead of having n buckets or n slots as we've been calling them, each

Â entry in this array is just going to be a single bit. Each entry in this array can

Â only take on two values, zero or one. And the way they think about the space

Â occupied by Bloom filters is in terms of the number of bits per object that has

Â been inserted into the Bloom filter. So, if you have inserted the data set capital

Â S, then the total number of bits is n, the number of objects that have been inserted

Â is cardinality of s. So, n / |s| is the number of bits in this data structure that

Â you are using per entry in the data set. Now, you can tune a Bloom filter so this

Â ratio is any number of different quantities but for now, I encourage you to

Â think of this ratio as being eight, that is for each object stored in the Bloom

Â Filter, you are using only eight bits of memory. That will help you appreciate just

Â how amazing this data structures are, right, cuz maybe our data set is something

Â like IP addresses which is 32 bits so what I'm saying here, if this is eight, I'm

Â saying we are not, definitely not actually storing the IP address. So, we have this

Â 32-bit object we are inserting and we are only using eight bits of memory. This is

Â how we are going to remember whether its there or whether its not. And again,

Â certainly, eight bits per object is way less than keeping a pointer to some

Â associated memory somewhere. So, this is a really impressive minimal use of space to

Â keep track of what we've seen and what we haven't. And secondly, we need mappings of

Â given an object to say, given the IP address, what are the relevant bits for

Â seeing if we've seen this IP address before or not? So, in a Bloom filter, its

Â important to have not one hash function, but several hash functions. So, k is going

Â to denote the number of hash functions in the Bloom filter which you think of k is

Â some small constant somewhere, you know, three, four, five, or something like that.

Â So, obviously it's a little bit more complicated to use multiple hash functions

Â as supposed to just one hash function. But it's really not that big of deal. So, we'll

Â call from our discussion of say, universal hashing, we have identified the entire

Â families of hash functions which will work well on average. So, instead of choosing

Â just using one hash function at random from universal family, you gave me k

Â independent random choices from universal family. In fact, in practice, it seems to

Â typically be enough to just use two different hash functions and then generate

Â k different linear combinations of those two hash functions. But for the purposes

Â of this video, let's just assume that we've done enough work to come up with k,

Â different good hash functions and that's what we're going to be using in our Bloom

Â filter. So, the code for both insert and delete is very elegant. So, let's

Â start by insertion. So, suppose we have some new IP address and we want to stick

Â into these Bloom filter, what we do? Well, we'll just evaluate each of our k hash

Â functions on this new object. Each of those tells us an index into our array of

Â bits and we'll just set those k bits equal to one. And when we do this insert, we

Â don't even bother to look at what the previous values of these bits were.. So,

Â zero or one, we don't care. We'll just blithely go in and set this k bits equal

Â to one, whatever they were before. So, what about looking up? How are we going to

Â implement that? Well, all you have to do is check for the footprint that was

Â inevitably left by a prior insertion. So, if we're looking up an IP address and we

Â know was inserted sometime in the past, what happened when we evaluated the k hash

Â functions, we went t o appropriate positions in the array and we set all of

Â those bits to one. So now, I'll just check that, that indeed happened, that is when

Â we get a new IP address, we're looking it up. We evaluate the hash functions, all k

Â of them. We look at the corresponding k positions and we verified that indeed

Â those k bits have been set to one. So, what I hope is clear fairly quickly from

Â inspecting this very elegant code is that we will not ever have false negatives,

Â yet, we might have false positives. So, let's discuss those one other time. So,

Â remember, a false negative would mean that the Bloom filter says, something isn't

Â there when in fact, it is, that is we insert something and we'll look it up

Â later and the Bloom filter rejects us. Well, that's not going to happen. Cuz when

Â we insert something, we set the relevant k bits to one. Notice when a bit is one, it

Â remains one forevermore. That bits are never reset back to zero. So, if anything

Â was ever inserted in the subs when we look it up, definitely we well confirm that all

Â those bits are one. So, we're never going to be rejected by something we inserted

Â before. On the other hand, it is totally possible that we will have a false

Â positive. It's totally possible that there will be a phantom object and we'll

Â do a look up and the Bloom filter will turn yes when we never inserted that

Â object. Suppose for example, the k = three. So, we're using three different

Â hash functions. Consider some IP address, fixed IP address, maybe the three hash

Â functions tell us the relevant bits are seventeen, 23, and 36. Maybe we never

Â inserted this IP address, but we have inserted IP address number two and in its

Â insertion, the seventeenth bit got set to one. We inserted some other IP address, IP

Â address number three and the twenty-third bit got set to one. And then we inserted

Â IP address number four and the 36th bit got set to one. So, three different IP

Â addresses were responsible for setting these three different bits but whatever,

Â its not like we are remembering that. And that once, once we look up this IP address

Â we really care about, what do we do, we just inspect bit seventeen, its one.

Â Inspect the 23, its one. We inspect the 36, its also one. For all we know, this

Â thing really was inserted and the Bloom filter is going to say, yes, it's in the

Â table. So, that's how we have false positives. All of the bits that are

Â indicating whether or not a given object are in, are in the Bloom filter were

Â previously set by insertions from other objects. So, there are two points that I

Â hope are clear at this stage of the discussion. First of all, that this Bloom

Â filter, the idea does suggests a possibility of a super space efficient

Â variant of a hash table, right. So, we've been talking about setting the number of

Â bits to be say roughly eight times the number of objects that you're

Â storing so you're only using eight bits per object and for most objects, that's

Â going to be radically smaller than just the simple array, storing the objects

Â themselves. Again, if their IP addresses we're only have 25 percent much space as

Â we actually stored those IP address in just an array with no extra bells and

Â whistles. The second point is that we're inevitably going to have errors in a

Â Bloom filter, we will have false positives or we look something up and it

Â says, its there when in fact, its not. So, those two points I hope are clear. What's

Â actually not clear is the bottom line. Is this actually a useful idea? For this to

Â be useful, it'd better be the case that the error probability can be quite small

Â even while the space per object is quite small. If we can't get those two things

Â small simultaneously, this is a bad idea and we should always just use a hash table

Â instead. So, to evaluate the quality of this idea, we're going to have to do

Â little bit of mathematical analysis. That's what I'm going to show you in the

Â next couple of slides.

Â