0:00

One of the more interesting stories in the field probabilistic graphical model

Â relates to the connection between the, algorithm of Loopy Belief Propogation and

Â the problem of decoding messages sent on a noisy channel.

Â This is a connection that was discovered fairly recently and has proven to have a

Â tremendous impact on both disciplines that of probabilistic graphical models

Â and that of message decoding. So let's understand the message decoding

Â problem. Imagine that I want to send K bits across

Â a noisy channel. If I just go ahead and easily send those

Â K bits, then it's impossible for me to know from the noisy bits that I received

Â at the other end what the correct bits ought to have been because we don't know

Â which ones got corrupted and which ones didn't.

Â And so what people have, developed is the notion of coding theory.

Â Where those K bits are passed through an encoder.

Â And what we get at the end are N bits. where N is usually greater than K.

Â And those are the bits that are actually transmitted across the noisy channel to

Â give us a set of noisy bits, Y1 up to Yn, which we now hope to be able to decode in

Â a way that ideally gives us bits V1 up to Vk, that are close.

Â And hopefully, completely accurate relative to the bits that were originally

Â sent, U1 up to Uk plus message coding and decoding in a nutshell.

Â And the rate of a code such as this is k over n, that is you send n bits of which

Â k are actual information. The bit error rate, that one gets, and

Â that's a function of the channel itself, is the, overall probability that a bit

Â that we get. On this side, say the first bit, what is

Â the probability that the first bit is correct, that is, the same as as the

Â original bit, bit? And so the error rate is the total

Â average of the probability that these are different across the K decoded bits.

Â 1:57

So, we can now think about different channels and their properties and there's

Â this very important notion called the channel capacity.

Â So let's first think about different error rates that a channel, different

Â error modes, that a channel might exhibit.

Â So here's something called the binary symmetric channel, often abbreviated BSC,

Â and the binary symmetric channel says that when I send a zero, then with

Â probability.9, I get a zero, but with probability 0.1, I get a one.

Â And conversely, with a probability, when I send the one, I get a one with

Â probability of 0.9, and the zero with probability 0.1.

Â You can see why it's called the binary symmetric channel because the area is

Â symmetric between zero and one. A different error model, an erasure

Â channel where the bits don't get corrupted, they just get destroyed.

Â Now the big difference between this, the binary erasure, erasure channel

Â abbreviated BEC is that when a bit gets destroyed, I know it.

Â That is what I get at the end, is a question mark as opposed to a bit where I

Â don't actually know whether that was the original bit that was sent or a

Â corrupted, here I know the bit was corrupted.

Â And finally, here is what's called a Gaussian channel,

Â where the noise that gets added on top of is actually analog noise.

Â And so, there is sort of a, a Gaussian distribution relative to the signal that

Â was sent. And it turns out these different channels

Â have very different notions of what capacity means.

Â And capacity, we'll define exactly what implications that has in a moment but

Â information there is. Your super smart people have computed,

Â the capacity for these different kinds of channels, and it turns out, that for

Â example the capacity for the binary symetric channel, is zero, is a little

Â bit over 0.5, where the capacity of the binary eraser channel, is 0.9 where 0.9

Â is, is this, probability of getting the correct bit.

Â And if you think about that it makes perfect sense that, the capacity of a

Â channel where you know the bit were erased is higher than ones where you have

Â to figure out whether bits are right or wrong.

Â So this is, little less, a more benign type of noise.

Â And the, what you see over here is the capacity of the Gaussian channel and it's

Â a, its a expression that involves things like the with this Gaussian for example,

Â the wider it is, the lower the capacity. Now, Shannon, in a very famous result

Â known as Shannon's Theorem, related the notion of channel capacity and bit error

Â probability in a way that tells us and pro, defines an extremely sharp boundary

Â between codes that are feasible and codes that are infeasible.

Â So let's look at the diagram. The x axis is the rate of the code.

Â Remember, the rate was k over n. In the example that we gave and this is a

Â rate in multiple channel capacity so this says once you define the channel capacity

Â we can look at the rate of a code and each code will try for a dif could, could

Â be in a different point of the spectrum in terms of the rate.

Â On this axis, we have the bit error probability.

Â So obviously, lower is better, in terms of bit error probability.

Â And what Shannon proved is that this region over here.

Â The, the, the, the region that I'm marking here in blue,

Â the attainable region is, you could construct codes that achieve any point in

Â this space. That is for any point in this, in this

Â 2-dimentional space of rate and bit error probability you could achieve that that,

Â you can construct a code. He didn't show it as a constructive

Â argument, it was a, it was non constructive argument but he proved that

Â there existed such codes. Conversely, he showed that anything that

Â passes, on this side of the boundary the

Â forbidden region is just not obtainable. That is not matter how clever of a coding

Â theorist you are you, could not construct a code that had a rate above a certain

Â value and bit era probability that was below a certain value, which is the shape

Â of this boundary that we see over here. And you can see why the channel capacity

Â is called capacity because this is a multiple of one.

Â So this was Shannon's theorem and it set up as we said in non-constructive proof

Â that the blue region was an attainable region.

Â But the question is how can you achieve something that's close To the Shannon

Â limit. And, around, up to, a certain point in

Â time, the mid 90s.' This was, sort of, a diagram of the kind

Â of, Achieve, what was achievable in terms of

Â codes. And so this is a diagram we'll see

Â several times from, in a minute. so here is, on the x axis, we have the

Â signal to noise ratio, measured in db. And on the y axis, we have the log of the

Â bit error probability. And what you see here are codes that were

Â used, first up here we see the uncoded, version.

Â And you can see that the uncoded is not very good.

Â It, has very high, bit error rates, so high here is worse bit error rates, so,

Â very bad. And, of course, as the signal to noise

Â ratio moves to the left, the error rate grows.

Â And what you see over here in these two other lines are.

Â The bit error rate, sorry the, the curves achieved by two the NASA space missions

Â Voyager and Cassini and you can see that there was a good improvement between

Â Voyager and Cassini, which is 1977 to 2004.

Â 7:59

And then in May -three a revolution happened in coding theory.

Â And this was a paper by Berrou, et al and they,

Â you can read the title of the paper. It was a shocking title.

Â It says near Shannon limit error correcting codes.

Â And they called this ter- codes. And initially when they tried to submit

Â this paper, people didn't believe them that this was possible because this was

Â so much better than any of the previous codes that had been proposed.

Â And people said no this can't be, can't possibly be the case.

Â And, and they checked it and it turned out that, in fact, their code worked.

Â But nobody really understood why. Okay, so now you might wonder well why

Â the heck am I telling you all this. what does it have to do with the

Â probabilistic graphical models? Oops.

Â Okay, we have to scratch that. so why was this so surprising?

Â Because if you look at the turbo codes in terms of this diagram that I showed you a

Â minute ago, you can see that over here, we have again, this is, remember the

Â uncoded. This is the same kind of diagram.

Â Signal some noise on the x axis, bit, log bit error probability on the y axis.

Â Here once again we have Voyager and Cassini.

Â And there. And look at the turbo codes.

Â 9:25

The turbo codes are way to the left. Of any of these previous codes.

Â Now, again, remember left is better. Left means that you are achieving the

Â same bit error, probability with a higher signal to noise ratio.

Â So, so what is the magic of turbo codes? and this is where we bring ourselves back

Â probabilistic graphical models cause you might be sitting there thinking, well,

Â you know, fine, coding theory's great, but this is the coding theory class.

Â So what does this have to do with probabilistic graphical models?

Â So to understand that let's look at what turbo codes do.

Â So turbo codes took the same set of bits, that we want to transmit, U1 up to UK.

Â And they pass them through two encoders, encoder one and encoder two.

Â And then, those bits were passed through our noisy channel over here, so that what

Â we get at the end is noisy versions of those two sets of bits.

Â And what we'd like to do is really coding is a probabilistic inference problem when

Â you think about it, because what it's trying to do is it's trying to compute a

Â probability distribution over the message bits given my noisy evidence Y1 and Y2,

Â which are the noisy bits I received on the channel.

Â But instead of trying to solve the probabilistic inference problem exactly,

Â what Berrou et all proposed in this Turbocoding paper is this sort of

Â iterative approach, where we have a decoder that matches the

Â first encoder. So decoder one matches encoder one.

Â And decoder one takes these noisy bits that you get from encoder one.

Â that, after they pass through the noisy channel.

Â And decoder two gets a separate set of bits- the bits that were passed through

Â encoder two and then the noisy channel. And what happens, roughly speaking, in

Â the codes, is that the two decoders start to communicate with each other.

Â So, decoder one looks at it's bits and it doesn't exactly know what the correct

Â bits are, but it computes a posterior over those noisy bits.

Â And it takes the posterior over each of the noisy bits and it passes it.

Â The decoder two, which now uses it as a prior, over, the, values of the, of these

Â target bits the U, combines it with the evidence, from the noisy bits Y, and now

Â computes a more informed posterier over the U's, which it then sends, to, decoder

Â one. And this process continues to itterate

Â until convergence. So each decoder computes a posterior and passes it to the

Â other decoder, which uses it as a prior. And if you look at the, the implications

Â of the iterative algorithm what happens so you can see here is, is the, are the

Â different values for turbo, for turbo decoding and what you see here again is

Â the signal to noise ratio on the X axis the log of err probability on the Y axis

Â and you can see that initially in the first set iteration of turbo decoding the

Â curve is not that great. But then as the iterations improve, the

Â bit error rate goes down. so for example for a fixed signal to

Â noise ratio say here. The bit error rate goes down from here,

Â to here, over a durations. And, what you see as, as the dark line at

Â the bottom is the optimum bit decision if we were to do correct probabilistic

Â inference. And so the surprising part was how close

Â this totally heuristic and unmotivated algorithm as it seemed at the time comes

Â to the optimum of the bit decision. The revolution that happened in this

Â field came up when other people realized, people like McLeese and McKigh and Fray,

Â realized that what was really going on here is that this algorithm that we saw

Â of, over here is running a variant of loopy belief propagation, because what's

Â going on here is that each of these decoders.

Â Is actually running exact inference over a network that is sort of attractable and

Â then its passing messages these del these beliefs that there being passed are what

Â we have as the delta Ij's between in this case these two components.

Â Really, what was going on here are are two decoders that are communicating with

Â each other via a process that is exactly identical to the loopy belief

Â 14:20

propagation. And this caused the revolution in the

Â field in two different ways. Actually, it caused a revolution in two

Â different fields. It caused a revolution in the field of

Â probabilistic graphical models because up until that point people had known that

Â you could pass loopy messages over the graph.

Â in fact this was an observation made as early as 1988 by Judea Pearl when he

Â wrote the seminal book that really introduced Bayesian networks to the field

Â of artificial intelligence. He wrote at the time that sure you can

Â pass these messages over this loopy graph but it doesn't, but you have no

Â guarantees of convergence, and you have no guarantees if this gets the right

Â answers and so it's not perhaps a good idea.

Â And from 1988 until the mid 1990's the algorithm had been completely abandoned.

Â Because people said that it had these limitations, so it wasn't a good idea.

Â But when. People realize that in the context of

Â coding theory, this algorithm, which is obviously heuristic and has all these

Â potential limitations. Never the less, succeed in achieving

Â performance. That looks like this,

Â then people said, wait a minute. If it works so well for decoding, maybe

Â it's a good algorithm to think about after all.

Â And from the mid 1990's until today, set up a huge amount of work in the field of

Â probabilistic graphical models on understanding when and why loopy belief

Â propagation works as well as constructing through versions of loopy belief

Â propagations, some of which we briefly mentioned in this course.

Â So that's the first revolution. The second revolution occurred in the

Â field of coding. Where people said, well, if.

Â All that's going on is that we're running loopy belief propagation over a graph

Â that is trying to compute the posterior over the use, the message bits given the

Â noisy bits otherwise. Well, we can construct lots of other

Â codes. And use those codes and run loopy belief

Â propagation on the resulting graph. And sure enough, some of the more

Â successful codes that are in use today. Are, in fact, not the, necessarily the

Â turbo codes that Berrour et al originally came up with but a class of codes called

Â low density parity checking codes. That are actually really good in elegant

Â match to loopy belief propagation. Now, it turns out low-density parity

Â checking codes were actually invented by Robert Gallager in 1962.

Â And since 1962 up until, sort of, the late'90's, they had been totally

Â abandoned because they were computationally intractable to do in

Â terms of exact probabilistic inference. But by realizing that one could run a

Â loopy belief propagation as a way of decoding using these codes, that

Â reintroduced the codes into the field, and now their wildly successful and

Â there's all sorts of variations on them and extensions and so on.

Â So what are these parity checking codes? So in a parity checking code, we are

Â actually sending two types of bits. So, these are original bits U, U1 up to

Â U4. The original set of bits.

Â Are just, the first set of this are just the original message bits.

Â So I denoted them here even though x1 is actually identical to u1.

Â I just wanted to denote specifically x1 is the sent bit.

Â So x1 is just the copy of v1. X4 is a copy of u4.

Â And once again the Ys are the noisy. Bits that are received after

Â transmission. On the other side, the Xs that you see

Â here on the bottom are parity bits. Parity bits are like checksums on a file.

Â They look at then completely different, the parity of different subsets of the

Â bits. So X5 here is the parity.

Â That is whether even or odd, zero if even and one is odd of U1, U2 and U3.

Â And see U1, U2 and U2. X6 tests the parity of a different set of

Â bits. U1, u2 and u4, and x7 does u2, u3 and u4.

Â So here, we're sending four original bits and three parity bits.

Â So, this in total is a code whose rate is four over seven because there are seven

Â bits sent of which four message bits or four message bits.

Â 19:50

And then these are the large factors. And some of you will recognize this as a

Â beta cluster graph. A beta structure cluster graph.

Â And so this is just a list of applications, this is probably the, maybe

Â the most, one of the most ubiquitous applications of probabilistic graphical

Â models in use today simply because it exists in so many instantiations, you

Â know, of many set top boxes and when you're doing digital video broadcasting

Â in communications, mobile televisions, mobile telephones, to satellites, NASA

Â missions, wireless networks and so on and so one of the really powerful

Â applications of loopy belief propagation of probabilistic graphical models that

Â arose from this unexpected confluence of these two disciplines.

Â So. Really you could say that loopy belief

Â propagation which was originally discovered or proposed by Judea Pearl in

Â 1988 was actually rediscovered, by practitioners of coding theories.

Â These were not theoreticians of coding theories, these were people actually sat

Â down and engineered codes. And understanding those turbo codes in

Â terms of loopy belief propagation and that was done by a bunch of more machine

Â learning and information theory type people led to the development of many,

Â many new and better codes, and the current codes are actually coming

Â gradually closer and closer to the Shannon limits in a constructive as

Â opposed to a theoretical way. And at the same time, the resurgence of

Â interest and belief propagation led us to much of the work that we see today in

Â approximate inference in graphical models as well as both on theoretical

Â understanding side as well as on new algorithms that people are coming up with

Â derived from that new understanding.

Â