0:00

In this video we'll begin our discussion of hash tables; we'll focus first on the

Â support operations, and on some of the canonical applications. So hash tables are

Â insanely useful. If you want to be a serious programmer or a computer scientist

Â you really have no choice but to learn about hash tables. I'm sure many of you

Â have used them in your own programs in the past in fact. Now on the one hand what's

Â funny is they don't actually do that many things in terms of the number of supported

Â operations, but what they do, do they do really, really well. So what is a hash

Â table? Well conceptually, ignoring all of the aspects of the implementation, you may

Â wanna think of a hash table as an array. So one thing that arrays do super well is

Â support immediate random access. So if you're wondering what's the position

Â number seventeen of some array, boom, with a couple of machine instructions you can

Â find out, wanna change the contents of position number 23 in some array? Done, in

Â constant time. So let's think about an application in which you want to remember

Â your friends phone numbers. So if you're lucky your friends parents were all u nu,

Â unusually unimaginative people and all of your friends names are integers let's say

Â between one and 10,000. So if this is the case then you can just maintain an array

Â of link 10,000. And to store the phone number of say, your best friend, 173, you

Â can just use position 173 of this modest sized array. So this array based solution

Â would work great, even if your friends change over time, you gain some here you

Â lose some there, as long as all your friends names happen to be integers

Â between 1-10,000. Now, of course, your friends have more interesting names:

Â Alice, Bob, Carol, whatever. And last names as well. So in principal you could

Â have an array with one position in the array for every conceivable name you might

Â encounter, with at least 30 letters set. But of course this array would be way too

Â big. It would be something like 26 raised to the thirtieth power and you could never

Â implement it. So what you'd really want is you'd want an array of reasonable size,

Â say, you know ballpark the number of friends that you'd ever have, so say in

Â the thousands or something, where it's positions are indexed not by the numbers,

Â not integers. [inaudible] Between one and 10,000, but rather by your friends Names

Â And what you'd like to do is you'd like to have random access to this array based on

Â your friend's name. So you just look up the quote unquote Alice position of this

Â array and. Boom, there would be Alice's phone number in constant time. And this,

Â on a conceptual level is basically what a hash table, can do for you. So there's a

Â lot of magic happening under the hood of a hash table and that's something we'll

Â discuss to some extent in other videos. So you have to have this mapping between the

Â keys that you care about, like your friends' names, and, numerical positions

Â of some array. That's done by what's called a hash function, but properly

Â implemented, this is the kind of functionality that hash tables gives you,

Â So like an array with its positions indexed by the keys that you're storing.

Â So you can think of the purpose of the hash table as to maintain a possibly

Â evolving set of stuff. Where of course the set of things that you're maintaining, you

Â know, will vary with the application. It can be any number of things. So if you're

Â running an e-commerce website, maybe you're keeping track of transactions. You

Â know, again, maybe you're keeping track of people, like for example, your friends and

Â various data about them. So maybe you're keeping track of I-P addresses, for

Â example if you wanna know, who was, were there unique visitors to your websites.

Â And so on. So a little bit more formally, you know, the basic operations, you need

Â to be able to insert stuff into a hash table. In many, but not all applications,

Â you need to be able to delete stuff as well. And typically the most important

Â operation is look-up. And for all these three operation you do it in a key based

Â way. Where as usual a key should just be a unique identifier for the record that

Â you're concerned with. So, for example, for employees you might be using social

Â security numbers. For transactions you might have a transaction ID number. And

Â then IP addresses could act as their own key. And so sometimes all you're doing is

Â keeping track of the keys themselves. So, for example, in IP addresses, maybe you

Â just want to remember a list of IP addresses. You don't actually have any

Â associated data but in many applications, you know, along with the key, is a bunch

Â of other stuff. So along with the employee's social security number, you

Â gotta remember a bunch of other data about that employee. But when you do the insert,

Â when you do the delete, when you do the look up, you do it based. On this key, and

Â then for example, on look up you feed the key into the hash table and the hash table

Â will spit back out all of the data associated with that key. We sometimes

Â hear people refer to data structures that support these operations as a dictionary.

Â So the main thing the hash table is meant to support is look up in the spirit of a

Â dictionary. I find that terminology a little misleading actually. You know, most

Â dictionaries that you'll find are in alphabetical order. So they'll support

Â something like binary search. And I want to emphasis something a hash table does

Â not do is maintain an ordering on the elements that it supports. So if you're

Â storing stuff and you do want to have order based operations, you wanna find the

Â minimum or the maximum, or something like that, a hash table's probably not the

Â right data structure. You want something more. You wanna look at a heap or you

Â wanna look at a, a search tree. But for applications in which all you have to do

Â is basically look stuff up you gotta, you gotta know what's there and what's not,

Â then there should be a light bulb that goes off in your head. And you can say,

Â let me consider a hash table, that's probably the perfect data structure for

Â this application. Now, looking at this menu-supported operations, you may be left

Â kinda unimpressed. Alright, so a hash table, in some sense, doesn't do that many

Â things; but again, what it does, it does really, really well. So, to first order.

Â What hash tables give you is the following amazing guarantee. All of these operations

Â run in constant time. And again this is in the spirit of thinking of a hash table as

Â just like an array. Where its positions are conveniently indexed by your keys, So

Â just like an array supports random access in constant time, you can see if, you

Â know, there's anything in the array position, and what it is. As similarly a

Â hash table will let you look up based on the key in constant time. So what is the

Â fine print? Well, there's basically two caveats. So the first thing is that hash

Â tables are easy to implement badly. And if you implement them badly you will not get

Â this guarantee. So this guarantee is for properly implemented hash tables. Now, of

Â course if you're just using a hash table from a well known library, it's probably a

Â pretty good assumption that it's properly implemented. You'd hope. But in the event

Â that you're forced to come up with your own hash table and your own hash function

Â and unlike many of the other data structures we'll talk about, some of you

Â probably will have to do that at some point in your career. Then you'll get this

Â guarantee only if you implement it well. And we'll talk about exactly what that

Â means in other videos. So the second caveat is that, unlike most of the

Â problems that we've solved in this course, hash tables don't enjoy worst case

Â guarantees. You cannot say for a given hash table that for every possible data

Â set you're gonna get cost and time. What's true is that for non-pathological data,

Â you will get cost and time operations in a properly implemented hash table. So we'll

Â talk about both of these issues a bit more in other videos, but for now just high

Â order bits are, you know, hash tables, constant time performance, subject to a

Â couple of caveats. So now that I've covered the operations that hash tables

Â support and the recommend way to think about them, let's turn our attention to

Â some applications. All of these applications are gonna be in some sense,

Â you know, kinda trivial uses of hash tables, but they're also all really

Â practical. These come up all the time. So the first application we'll discuss, which

Â again is a conical one, is removing duplicates from a bunch of stuff, Also

Â known as the deduplication problem. So in the De-duplication problem, the input is

Â essentially a stream of objects. Where, when I say a stream I have kinda, you know

Â two different things in mind as canonical examples. So first of all you can imagine

Â you have a huge file. So you have, you know, a log of everything that happened on

Â some website you're running. Or all of the transactions that were made in a store on

Â some day, And you do a pass through this huge file. So you're just in the middle of

Â some outer for loop going line by line through this massive file. The other

Â example of a stream that I had in mind, is, where you're getting new data over

Â time. So here, you might imagine that you're running software to be deployed on

Â an internet router. And data packets are coming through this router at a constant

Â extremely fast rate. And so you might be looking at, say, the IP addresses and the

Â sender, and use your data packet which is going through your router. So it would be

Â another example of a stream of objects. And now, what do you gotta do? What you

Â gotta do is you gotta ignore the duplicates. So remember just the distinct

Â objects that you see in this stream. And I hope you find it easy to imagine why you

Â might want to do this task in various applications. So, for example, if you're

Â running a website you might want to keep track of the distinct visitors that you

Â ever saw in a given day or a given week. If you're doing something like a web

Â crawl, you might want to identify duplicate documents and only remember them

Â once. So, for example, it would be annoying if in search results both the top

Â link and the second link both led to identical pages at different URLs, okay,

Â so search engines obviously want to avoid that, so you want to detect duplicate web

Â pages and only report unique ones. And the solution using a hash table is laughably

Â simple. So every time a new object arrives in the stream, you look it up. If it?s

Â there, then it?s a duplicate and you ignore it. If it?s not there, then this is

Â a new object and you remember it. Qed, that's it. And so then after the string

Â completes, so for example after you finish reading some huge file, if you just want

Â to report all of the unique objects, hash tables generally support a linear scan

Â through them and you can just report all of the distinct objects when this stream

Â finishes. So let's move on to a second application slightly less trivial maybe

Â but still quite easy, and this is the subject of Programming Projects number

Â five. So this is a problem called the two sum problem. You're given as input an

Â array of N number. These images are in no particular order. You're also given a

Â target sum, which I'll call T. And what you want to know is are there two integers

Â from amongst these N you are given that sum to T. Now the most obvious and naive

Â way to solve this problem is just to go over all N, choose two pairs of integers

Â in the input, and check each one separately. So that's clearly a quadratic

Â time algorithm. But now, of course, we need to ask, can we do better? And, yes,

Â we can. And first of all let's see what you'd do if you couldn't use any data

Â structures. So if you were clever, but you didn't use any data structures like a hash

Â table, here would be a reasonable improvement over the naive one. So the

Â first step of a better solution is to sort A upfront, For example, using word sort or

Â heap sort, something that runs in end log and time. So you may be asking about the

Â motivation for sorting. Well, again, you know, one thing is just, you know whenever

Â you're trying to do better than N squared; you might think that sorting your data

Â somehow helps. Right and you can sort of do it almost for free in N log N time.

Â Now, why would sorting the array up front help us? Well, then the clever insight is

Â that for each entry of the array a, say the first entry, now we know what we're

Â looking for to achieve this given target, right. If the target that we're trying to

Â get to is summed to 100 and the first entry in the sorted array is 43, then we

Â know we're looking for a 57 somewhere else in. This now sorted array. And we know

Â that searching a sorted array is pretty easy, right. That just binary search. That

Â just takes logarithmic time. So for each of the n array entries, we can look for a

Â complementary. Entry, namely of reach X we can look for T - X using binary search.

Â And to use binary search takes log N time. So the sorting upfront speeds up this

Â entire batch of N searches. So that's why it's a win. So, in the second step,

Â because we do a linear number of binary searches, again, this is just n, the

Â number of searches, times log-n, the time per search. So, this is just another theta

Â of N log N factor. Alright, so that's pretty cool. You, I don't think you could

Â come up with this N log N solution without having some basic, facility with

Â algorithms. This is already a really nice improvement over the naive N squared. But

Â we can do even better. It is no reason we're stuck with an N log N lower bound

Â for the [inaudible] problem. Obviously, because the array is unsorted, we have to

Â look at all the integers. So we're not gonna do better than linear time. But we

Â can do linear time via a hash table. So a good question you might ask at this point

Â is what's the clue about this problem, about this task that suggests we want to

Â use a hash table. Well, so hash tables are going to dramatically speed up any

Â application where the bulk of the word is just repeated look-ups. And if we examine

Â this n log n solution, once we have this idea of doing a search for T minus X for

Â each value of X, we realize actually, you know, the only thing we needed the sorted

Â array for was to support look-ups. That's all binary search here is doing, is just

Â looking stuff up. So we say, ah-ha. All of the work here in step two is from repeated

Â look-ups. We're paying an exorbitant relatively, logarithm per amount of time

Â per look-up, whereas hash tables can do them in cost and time. So, repeated

Â look-ups, ding, ding, ding, let's use a hash table; and indeed that's what gives

Â us linear time in this problem. So from the amazing guarantee of hash tables, we

Â get the following amazing solution for the true [inaudible] problem, although again

Â this is subject to the same fine print about you better use it properly

Â implemented hash table and you better not have pathological data. So rather than

Â sorting, you just insert everything in the array into a hash table. So insertions

Â cost time. So this is gonna be linear time rather than the end log [inaudible] we

Â were paying before. Once all the stuff is in the hash table, we just do the same

Â thing as in the n log-n solution. For each x in the array, we look for its matching

Â elements, t-x in the hash table using the cost and time look-up operation exported

Â by the hash table. And of course if for some X, you do find the matching element T

Â minus X. Then you can just report X and T minus X. That proves that there is indeed

Â a pair of integers of target sum T. If for every single element of the input array A,

Â you fail to find this matching element T minus X in the hash table. Then, for sure

Â there is no pair of integers in the input that sums to T. So this solves the problem

Â correctly. Moreover, constant time insertion, so that means this first step

Â is going to be O of end time. And constant time look-up. So that means that the

Â second step is also gonna be linear time. That leaves subjects to the caveats that

Â we discussed on the previous slide. So it's kind of amazing how many different

Â applications of computer science boil down in their essence to repeated look up

Â operations. Therefore, having a super fast look up operation, like that supported by

Â a hash table, permits these applications to scale to fantastic sizes. It's really

Â amazing, and it drives a lot of modern technology. So let me just mention a

Â couple examples. Again, if you look around or do some research on the web, you'll

Â quickly find many more. So originally what prompted researchers to think hard about

Â data structures that support super fast look ups, was back when people were first

Â building compilers. So this is a long time ago. This is in the fifties or so. And

Â these repeated look ups to figure out, you know, what has and has not been defined

Â before was, was emerging as a bottleneck in compilers. Back in the early days of

Â programming languages. And that was one of the early applications of hash tables. Was

Â to support super fast look ups to speed up compile time. To keep track of the

Â function of variable names and things like that. Hash table technology is also super

Â useful for software on routers in the Internet. So, for example, you might want

Â to block network traffic from certain sources. So, for example, maybe you

Â suspect that a certain IP address has been taken over by spammers and so any traffic

Â coming from that IP address you just want to ignore. And you don't wanna even let it

Â get to the end host, to the computer on someone's desktop, or to someone's mobile

Â device but rather inside the internet. You wanna just drop packets that are coming

Â certain, certain centers. So what is that problem boil down to? Well, you might have

Â a blacklist of IP addresses that you're refusing traffic from and then the tasks

Â faced by the router is really the look up problem. So if data packet comes in at

Â some insanely fast data rate, and when you wanna. You immediately, just look up, is

Â this in the blacklist or not, and if it is in the blacklist then you drop the packet,

Â if it?s not, then you let it go through. So a very different application is for

Â speeding up search algorithms. And when I say a search algorithm, what I'm thinking

Â about here is something like a chess playing program. So something that does

Â game tree exploration. So we've already talked a fair amount about graph search in

Â this class, but in our discussion of breadth first and depth first search, we

Â were thinking about graphs that you could basically write down. You could store them

Â in the main memory of your machine or, in the worst case, on some big cluster. So

Â maybe graphs, you know, about the size of the web graph or possibly smaller. But in

Â a context of something like a chess playing program the graph that you're

Â interested in is way, way, way bigger than the web graph. So what's the graph we care

Â about for a chess playing program? Well, the nodes of the graph are going to

Â correspond to all possible configurations of chess pieces On a chess board. So every

Â chess board that you might ever encounter in a game of chess. So that's a. Massive,

Â massive number of configurations. And you're never gonna be able to write down

Â these vertices. The edges in this graph are going to take you from one

Â configuration to another. And there gonna correspond to legal moves. So if you can

Â move a bishop from. One place to another place, and you get from one configuration

Â to another configuration, there's an edge in the graph corresponding to that move.

Â Now you can't write down this graph. So you can't implement breadth versus depth

Â versus search exactly as we discussed it before. But, you'd still like to do graph

Â exploration, right? So you'd like to have your computer program, reason about the at

Â least short term ramifications of your possible next move. So that will

Â correspond to searching through this graph. Now, how are you gonna, it's

Â remembering graphs search a really important property was you don't want to

Â do redundant work, you don't want to re-explore things you've already explored.

Â That would be silly and might lead into loops and so on. And you can't write down

Â the graph just remembering where you've been, is suddenly a non-trivial problem;

Â but what is remembering where you've been, fundamentally? Fundamentally that's a look

Â up operation. So that is exactly what hash tables are for. So to be a little more

Â concrete, you know, one where you use the hash table in, say, a chess playing

Â program, is you'd stake, take the initial configuration. You would sort of imagine

Â trying all possible moves from this configuration. And then you'd try, you'd

Â sort of have all moves from your opponent and then you'd have all your moves in

Â response. And you would always remember, as you were doing this reasoning, every

Â chessboard configuration you'd ever looked at before and you'd stick in the hash

Â table. And before you go exploring some configuration, you'd look it up in your

Â hash table to see if you've already explored it in the past. And if you have,

Â you don't bother. You've already seen it. All right. So chess playing programs

Â operate by exploring systematically as many configurations as they'd have time

Â for. You know, obviously, in a budget of three minutes or whatever you don't wanna

Â waste any work exploring any given configuration more than once. How do you

Â remember where you've been? Well everything you've explored you stick in a

Â hash table Before you explore a configuration you look it up in a hash

Â table and see if you've already done it. So these of course are just scratching the

Â surface. I just wanted to highlight a couple, you know, fairly different looking

Â applications, you know to convince you that hash tables come up all the time. And

Â the reason they come up all the time is because you know the need for fast

Â look-ups comes up all the time. It's kind of amazing how much technology is being

Â driven just by you know repeated fast look-ups. So as homework I encourage you

Â to just sort of think about you know your own life, or think about technology out

Â there in the world, and come up with some. You know, guesses about where probably

Â hash tables are making something out there running blazingly fast. I think it won't

Â take you more than a few minutes to come up with some good examples.

Â