0:03

In segment 1.1 we're going to talk about cryptographic hash functions.

Â We'll talk about what they are, and what their properties are.

Â And then later we'll move on and talk about what their applications are.

Â So, a cryptographic hash function is a mathematical function.

Â And it has three attributes that we need to start with.

Â First of all,

Â a hash function can take any string as input, absolutely any string of any size.

Â It produces a fixed-size output,

Â we'll use a 256 bits in this series of lectures, cuz that's what bitcoin does.

Â And it has to be efficiently computable, meaning given a string,

Â in a reasonable length of time, you can figure out what the output is.

Â So that's a hash function, but

Â we're going to need hash functions that are cryptographically secure.

Â The cryptographic properties of hash functions are a complicated

Â topic in general.

Â But we're gonna focus here on three particular properties.

Â And I'll explain in a minute what those are.

Â 0:53

In particular, that the function is collision-free,

Â that it has a hiding property, and that it's puzzle-friendly.

Â And for each of these, I'll talk about what the property is, what it means.

Â And then I'll talk about why it's useful to have a function that has that property.

Â So, first, collision-free.

Â So, the first property that we need from a cryptographic hash function

Â is that it's collision free.

Â And what that means is that it's impossible, nobody can find values x and

Â y, such that x and y are different, and yet

Â the hash of x is equal to the hash of y.

Â 1:25

And so if we look at the operation of the function as depicted

Â by one of these red arrows.

Â Here's x and H(x), and here's y and H(y).

Â Then nobody can find a situation like this.

Â That you have an x and y that are separate, and yet when you hash them,

Â they hash to the same value.

Â Now one thing to notice is that I said, nobody can find.

Â I didn't say that there is no collision,

Â because if you think about it there has to be a collision.

Â Collisions do exist, and to understand why that is, we can use this diagram.

Â Over here on the left, I'm depicting all of the possible inputs to this function,

Â which can be a string of any size.

Â And over here, I have all of the possible outputs,

Â which has to be string of 256 bits in size.

Â So the right hand side here, the outputs, there are only 2 to the 256 possibilities.

Â Over here, there are more possibilities.

Â And so if you think that every point here on the left is gonna

Â be mapped by an arrow, to some point on the right.

Â You can see that as you go from all the points over here on the left into

Â the right, it has to get crowded.

Â And in fact, that there have to be multiple

Â values over here on the left that map to the same output over here.

Â In fact, in general, there will be a very large number of possible inputs

Â that map to any particular output.

Â So collisions do exist.

Â I said before nobody can find a collision.

Â And that's the key question.

Â We know collisions exist.

Â The question is are there any collisions that are findable by regular people

Â using regular computers?

Â 2:50

Okay, now to make things even worse,

Â I said that it has to be impossible to find a collision.

Â Let me tell you how to find a collision,

Â because there's a method that's guaranteed to work.

Â And the method works like this.

Â That we're gonna pick 2 to the 130 randomly chosen inputs,

Â over on the left cloud of that previous diagram.

Â And if we pick those 2 to the 130 randomly chosen inputs,

Â it turns out there's a 99.8% chance that at least two of them are going to collide.

Â And so this is a simple method for finding a collision.

Â It works no matter what the hash function is, but of course, the problem is,

Â that this takes a very, very long time to do.

Â You have to compute the hash function 2 to the 130 times.

Â And that's, of course, an astronomical number.

Â This method works no matter which hash function we're using.

Â There's still a 99.8% probability that this works.

Â And if it doesn't work, just try it again, it'll probably work the next time.

Â But, this doesn't really matter.

Â And the reason it doesn't really matter, is that this procedure takes 2 to

Â the 130 steps, in order to get to that high probability.

Â So, we can say something like this.

Â We can say that if every computer ever made by humanity

Â was computing since the beginning of the entire Universe up to now,

Â the odds that they would have found a collision is still infinitesimally small.

Â So small that it's way less than the odds that the Earth will be destroyed by

Â a giant meteor in the next two seconds, which didn't happen.

Â 4:14

Okay, so we know how to find a collision.

Â But this method takes too long to matter.

Â The question is, is there some other method that could be used on a particular

Â hash function, in order to find a collision?

Â And that's the question that is harder to answer.

Â Is there a faster way to find collisions?

Â Well, for some possible values of hash functions, of course there are.

Â For example, if our hash function were to simply take the input,

Â modulo 2 to the 256, that is, it just selected the last 256 bits of the input.

Â Then we would know an easy way to find a collision.

Â One collision would be the values 3, and 3 plus 2 to the 256.

Â So, for

Â some possible values of the hash function, it's very easy to find a collision.

Â For others, we don't know.

Â 5:01

Now, one thing I need to note is that there's no hash function

Â in existence which has been proven to be collision free.

Â There are just some that people have tried really, really hard to find collisions and

Â haven't succeeded.

Â And so we choose to believe that those are collision free.

Â Okay, now, what good does collision freedom do us?

Â If we can assume that we have a hash function that is collision free,

Â then we can use that hash function as message digest.

Â And what I mean by that is the following.

Â That if we know that x and y have the same hash,

Â then it's safe to assume that x and y are the same.

Â Because if someone knew an x and y that were different, that had the same hash,

Â of course, that would be a collision.

Â Since there's not a collision that we know of, then knowing the hashes are the same,

Â we can assume that the values are the same.

Â And this let's us use the hash as a kind of message digest.

Â Suppose, for example, that we had a file, a really big file.

Â And we wanted to be able to recognize later whether another file was the same

Â as the file we saw the first time, right?

Â So one way to do that would be to save the whole big file.

Â And then when we saw another file later, just compare them.

Â But because we have hashes that we believe are collision free,

Â it's more efficient to just remember the hash of the original file.

Â Then if someone shows us a new file, and claims that it's the same,

Â we can compute the hash of that new file and compare the hashes.

Â If the hashes are the same,

Â then we conclude that the files must have been the same.

Â And that gives us a very efficient way to remember things we've seen before and

Â recognize them again.

Â And, of course, this is useful because the hash is small,

Â it's only 256 bits, while the original file might be really big.

Â So hash is useful as a message digest.

Â And we'll see, later on in this lecture, and in subsequent lectures,

Â why it's useful to use hash as a message digest.

Â 8:01

And so, if we're gonna have a hiding property like this,

Â it needs to be the case that there's no value of x which is particularly likely.

Â That is, x has to be chosen from a set that's, in some sense, very spread out.

Â So that this method for the adversary of just trying all the possible values of x,

Â or just trying few values of x that are especially likely, is not going to work.

Â So the hiding property that we are going to need to set up

Â is a little bit more complicated.

Â And the way we're gonna fix this problem with the common value x, like heads and

Â tails, is we're gonna take the x.

Â And we're gonna put next to it, we're gonna concatenate with it, a value, r,

Â which is chosen from a distribution that's really spread out.

Â 8:43

And so this H(r | x),

Â that means take all the bits of r, and put after them all the bits of x.

Â And so what we're going to say is given the hash of r together with x,

Â that it's infeasible to find x.

Â And that this will be true in the formally stated property that,

Â if r is a random value chosen from a distribution that has high min-entropy,

Â then, given H(r | x), it's infeasible to find x.

Â And what does high min-entropy mean?

Â Well, it captures this intuitive idea that r is chosen from a distribution that's

Â really spread out.

Â And what that means specifically is that there is no particular value that r could

Â have had, that would occur with more than a negligible probability.

Â So, for example, if r is chosen uniformly from among all of the strings that are 256

Â bits long, then any particular string was chosen with probability 1 in 2 to the 256,

Â which is truly a negligible value.

Â So, as long as r was chosen that way,

Â then the hash of r concatenated with x is going to hide x.

Â And that's the hiding property that the hash function will be deemed to have.

Â Okay, now let's look at an application of that hiding property.

Â And, in particular, what we wanna do is something called a commitment.

Â And this is kind of the digital analogy of taking a value, a number, and

Â sealing it in an envelope, and

Â putting that envelope out on the table, where everyone can see it.

Â Now, when you do that, you've committed to what's in the envelope.

Â 10:09

But you haven't opened it, it's secret from everyone else.

Â Later, you can open the envelope and get out the value, but it's sealed.

Â So commit to a value and reveal it later.

Â We wanna do that in a digital sense.

Â So, to be more specific about what is the API that we're going to provide here,

Â the commitment API looks like this, that there are two things you can do.

Â First, you can commit to a message.

Â And that's going to return two values, a commitment and a key.

Â Think of the commitment as the envelope that you're gonna put on the table, and

Â the key as a secret key for unlocking the envelope.

Â Then later, you allow someone else to verify, given the commitment and

Â given a key, which you've told them in the meantime, and the message.

Â So they can verify that that commitment, key, and message really do go together.

Â And this will return a true or false.

Â Okay, now to seal an msg in an envelope, what we do is we commit to the message.

Â And that returns a commitment and a key, and then we publish the commitment.

Â That's putting the envelope on the table.

Â Now, later, to open the envelope, what we're going to do is publish the key and

Â the message that we committed to.

Â And then anybody can use this verify call,

Â with the commitment that we published previously, the key and

Â message that we just announced, to check the validity of our opening the envelope.

Â Okay, and the property, of course, we want from this,

Â is that it behaves like sealing an envelope.

Â And, in particular, the two security properties are these.

Â First, given com, the commitment, the envelope on the table,

Â that someone just looking at the envelope can't figure out what the message is.

Â 12:03

That in order to commit to a value message,

Â we're going to generate a random 256 bit value and call it the key.

Â And then we're going to, as the commitment,

Â return the hash of the key concatenated together with the message.

Â And as the key value, we're going to return H of this key.

Â And then later, to verify, someone is going to compute this same hash of

Â the key they were given, concatenated with the message.

Â And they're gonna check whether that's equal to the commitment that

Â they saw, okay?

Â So this is a way of using hash functions of both in the commitment and

Â in the verification.

Â So now the security properties.

Â If we go down to the security properties that were at the bottom of the previous

Â slide, and we just plug in the definitions of how we're going to implement this here.

Â That is, this used to say com, given com infeasible to find msg,

Â we just plug in what com is.

Â Com is the hash of key concatenated with msg.

Â And similarly down here, this is what happens when we take

Â what was written there before and plug in the definition of verify in com.

Â Okay, so now what these properties become, the first one is given H(key | msg),

Â it's infeasible to find msg.

Â Well, it turns out that that's exactly the hiding property

Â that we talked about before.

Â Key was chosen random 256-bit value.

Â And therefore, the hiding property says that if we take the message, and

Â we put before it something that was chosen from a very spread out distribution,

Â like I said a random 256-bit value, then it's infeasible to find the message.

Â So this is exactly the hiding property.

Â And this one down here turns out to be exactly the collision-free property.

Â So that if someone can find two messages which have the same hash like this,

Â well then they have an input value here and

Â an input value there that are different, and yet those have the same hash.

Â And so because of the two security properties we've talked about for

Â hashes so far, this commitment scheme will work,

Â in the sense that it will have the necessary security properties.

Â Okay, so that's the second security property of hashes, that they're hiding.

Â And the application of that is commitments.

Â The third security property we're going to need is that they're puzzle-friendly.

Â And this is, again, a little bit more complicated,

Â but let me just go through it bit by bit.

Â That for any possible output value y that you might want from the hash function.

Â We're going to use y as an output value of the hash later.

Â That if k is chosen from a distribution that has high min-entropy.

Â That is, k is chosen randomly from some set that's super spread out.

Â Then there's no way to find an x, such that the hash of k and x is equal to y.

Â So, what this means is basically that if someone wants to target the hash function,

Â if they want it to come out to some particular output value y.

Â That if there's part of the input that is chosen in a suitably randomized way,

Â that it's very difficult to find another value that hits exactly that target.

Â So the application we're going to use of this,

Â is we're going to build a search puzzle.

Â And what that means is we're going to build a mathematical problem,

Â which requires searching a very large space in order to find the solution.

Â And where there's no shortcuts,

Â a way to find a good solution, other than searching that large space.

Â That's a search puzzle.

Â To be more specific, the idea is that if we're given a puzzle ID,

Â which is chosen from some high min-entropy distribution.

Â That is some very spread out probability distribution.

Â And we're given a target set, Y,

Â which someone wants to make the hash function fall into.

Â Then we wanna try to find a solution, x.

Â So that if we hash the puzzle ID together with the solution X,

Â we get a result that's in the set Y.

Â So the idea is Y is a target range or a set of hash results that we want.

Â ID specifies a particular puzzle, and x is a solution to the puzzle.

Â 15:57

And the puzzle-friendly property here implies that there's no solving strategy

Â for this puzzle, which is much better than just trying random values of x.

Â And so if we wanna pose a puzzle that's difficult to solve, that we can

Â do it this way, as long as we can generate puzzle IDs in a suitably random way.

Â And we're going to use that later when we talk about bitcoin mining.

Â That's the sort of computational puzzle we're going to use.

Â Okay, so we've talked about three properties of hash functions and

Â one application of each of those.

Â Now let me talk just very briefly about the particular hash function we're going

Â to use.

Â There are lots of hash functions in existence, but

Â this is the one bitcoin uses, and it's a pretty good one to use.

Â It called SHA-256 or SHA-256, and it works like this.

Â Basically, it takes the message that you're hashing, and

Â it breaks it up into blocks that are 512 bits in size.

Â The message isn't gonna be, in general, necessarily exactly

Â a multiple of the block size, so we're going to add some padding at the end.

Â And the padding is gonna consist of, at the end of the padding,

Â a 64 bit length field, which is the length of the message in bits.

Â And then before that, it's gonna consist of a one bit,

Â followed by some number of zero bits.

Â And you choose the number of zero bits so

Â that this comes out exactly to the end of a block.

Â So once you've padded the message so

Â that its length is exactly a multiple of the 512 bit block size,

Â you then chop it up into blocks, and you then execute this computation.

Â You start with the 256 bit value called the IV.

Â That's just a number that you look up in a standards document.

Â And then take the IV and the first block of the message.

Â You take those 768 total bits, and you run them through this special function,

Â c, the compression function, and out comes 256 bits.

Â You now take that with the next 512 bits of the message,

Â run it through c again, and you keep going.

Â Each iteration of c crunches in another 512 bit block of the message and

Â mixes it in, sort of logically to the result.

Â And when you get to the very end,

Â you have consumed all of the blocks of the message plus the padding.

Â The result is the hash, that's a 256 bit value.

Â And it's easy to show that, if this function, c, this compression function

Â is collision free, then this entire hash function will also be collision free.

Â The other properties are a little bit more complicated, so

Â I won't talk about them here.

Â