0:00

Now that we have, the n x n blocks for each one of the plains, the YCbC, we are

ready to go into the forward transform. So the first thing we need to understand

is why are we going to do a forward transform.

Let us explain first of all how do we measure the error that we create in

images when we're doing compression if we're doing a lossy compression.

And the basic idea is that error is measure with what's called the mean

squared error, MSE, or mean square error. And it's computed as following.

The mean squared error. Okay.

MSE = 1 / the number of pixels. That just a normalization.

We sum overall the pixels and what we do is we basic compute the square of the

error. We have, we're going to have the

reconstructed image, let's call it f, reconstruct it minus the

original image squared. We go over every single pixel.

So we're going to do f. So let's say the, the original f was 100

and we reconstructed 98, so we have the difference is 2 we square it, it's 4.

And we do that for every single pixel. We sum that, we normalize that by the

number of pixels, and most of the time, because we have square here,

we take a square root there. So it's called a root mean square error.

That's how we measure the error. Now, why are we going to do a transform?

Let's imagine that we take, remember, we are talking n x n blocks

now. In general, we're talking 8 x 8,

but we're talking n x n. So we have n x n pieces,

because we're working on each one of the blocks independently.

Imagine that, although, these are n x n, 64 pixels, I

want to transmit only one. I'm going to just give you the option to

transmit only one of them. So, if we take just a regular image, and

we transmit only one, we are going to get a tremendous error.

But, we might be able to do a transform into a different 8 x 8, the transform has

to be reversable, I should be able to go back.

And we do this transform, we have a new 8 x 8, and maybe in that transform domain,

2:48

we can take only the first one and still do a reasonable job.

It turns out that there is a transform that allow us to do that and that's

called the Karhunen-Loeve transform. [SOUND] That transformer has something

very particular. You go into that transform domain, I'm

going to write down in a second what do I mean by a transform?

But think is nothing else than a multiplication by a matrix,

nothing else than that. So I have my 8 x 8.

I do something. I go into a new 8 x 8 block or 8 x 8

image. And then, if what I did is a

Karhunen-Loeve transform, that's also sometimes written as KLT, Karhunen-Loeve

transform. If I'm in that domain,

if I take the first of those, just the first out of the 64 and I compute the

mean square error, remember, I basically did a reconstruction with only one.

So I take my image, I do a Karhunen-Loeve transform.

I remove 63 of the coefficients. I come back and I compute the mean square error.

I got the smallest possible mean square error I could have gotten with only one

basically coefficient, just transmitting only one.

If I want three, I can do actually exactly the same.

ANd the interesting part with the Karhunen-Loeve transform is not only that

with one I did the best error, but actually that one is the first one.

So you go, you have 64, you're gong to a new 64. You pick the first one, that's

the best you can do if you're only allowed to use one coefficient.

If you're allowed to, to use three, you pick the first three.

So it's a great transform, but it has only one major problem.

The problem of this transform is that it's image dependent.

I don't know how to do it. So I told you, it's going to be a matrix

multiplication, but I don't know the coefficients of the

matrix until I don't see the image. You give me the image or you give me some

information about the image, then I will be able to compute the

coefficients of that matrix. And that's not very practical,

it's very slow, and also, I cannot do things on the fly

as things come. So to replace that we count what's called

a discrete cosine transform, which is what is actually used in JPEG.

So let's just right down the equations for the discrete cosine transform and we

are going to understand a bit more of what a transform is.

But let's keep in our minds that what we actually want is a Karhunen-Loeve

transform. What the Karhunen-Loeve is doing is decorrealating the image, it's

putting a lot of information the first coefficient,

and a bit more of information in the second, which is independent of the first

and so on in such a way, that if we want to compute the means squared error, we

are optimal. But we're going to be suboptimal with the

discrete cosine transform, but it's going to have a fixed matrix, fixed

coefficient, so it's universal, we're going to be able to use for every image

without having to recompute it. So what is a transform?

The idea is very simple. We have an image and we can call this

image, let's say f(x, y) and this is a transform.

We're going to go into a new domain T(u, v).

Okay? And those of you that are familiar with

Fourier, Fourier is one of the transform. So we're going to sum overall the pixels

in our image. Remember, we basically have n.

So we are going to go from 0 to n - 1. We're going to go from 0 to n - 1 and

we're going to take our image f and multiply by transform coefficient.

This is what will become, I have to give you the formulas for what are these

numbers for DCT, for the discrete cosine transform.

This will be completely image dependent for the Karhunen-Loeve.

We not only depend on the positions we are, it will actually depend on the

actual gray values on the actual statistics of your image.

So this is a transforms. I sum over x and y, which means that x

and y will, let's say, disappear after my summation and I get a

new image, which is again 8 x 8 or n x n. So this is n x n and it just in a

transform domain and you can write this in matrix notation.

Now, of course, a transform is only useful if you actually can invert it.

It won't be very nice that you give me an image.

I go into a transform domain that then visually I don't understand what's going

on there. So, there's also an inverse formula so I

can get back f(x, y), it will be the sum from u = zero to n -

1, the sum, from v = 0 to n - 1 of now my transform

coefficients, T(u,

v). And then the inverse of the transform, which we are going to write

sometimes is equal to r sometimes is not, (x, y, u, v).

So, I take an image, I do this operation, I get a new image in a transform domain,

and then I go and invert it. That's a transform.

So, to specify the transform, I basically have to tell you, what are these guys?

What's r and what's s? And once again, it's not hard to see and

you can just do this as an optional homework to write these in the matrix

notation. It's not, not hard at all. So, I need to specify these two guys,

these two functions, r and s, which as I say ideally to minimize the mean square

to be optimal, we would like it to be Karhunen-Loeve transform but we can't, so

we are going to use the DCT, which I'm going to write next.

So what's the DCT? In the DCT, r(x, y, u, v) is actually

equal to s(x, y, u, v). And this is actually equal to some

normalization coefficient alpha (u) alpha (v) this v and this v are the same.

I, I apologize there for just writing two different vs.

12:13

So here, you see actually, in this case just to make our illustration easy, we

use n = 4. So, these are, each one of the bases,

if I go back to the formulas and basically, I put u 0, = 0, I get a cosine

which is a function of only x and y. The formulas were functions of u, v, x

and y, but now I fix u and v,

I make both of them 0, I get a formula of x and y.

I get actually a 4 x 4 image which would be this.

If I basically fix v = 1 and u = 0, I get this and so on.

So basically this r and s are nothing else than what's called basis functions.

And what we're trying to do is we're trying to decompose our small n x n image

into the linear combination of this basis functions.

So if your image is flat, it's going to have your T(u, v) is going to only be

known 0 for this. If your image has a lot of variations,

you're going to have coefficients also for T(3, 3). So each one of the T(u, V)

tells you how much of this type of component you have.

Okay? So basically you have represented your n

x n image into a linear combination of this type of images.

The important part is that these images are constant.

Again, in the Karhunen-Loeve, they won't be, and that's very difficult to have a

very efficient how our implementation to be able to do that in real time.

Here, this basis is fixed. You give me an image, I'm going to

represent it as a linear combination of this basis.

And T(u, v) tells me how much of each one of these I have in my image, in my

timing, n x n image. So, that's a transform.

Basically, nothing else than going into coefficient that tell us how much that

each one of these components. So I decompose my image into these

components which are kind of frequency components.

You see that there is more variation as I increase v and as I increase u in the

vertical or horizontal direction. There's one thing I haven't told you.

I say, okay, I cant do Karhunen-Loeve transform, I cannot compute bases for a

Karhunen-Loeve because its too expensive, then I go and do a discrete cosine

transform. Why a discrete cosine transform? Why not

a Fourier transform or any other transform? How the mark transform? There

are many transforms out there. So there is a couple of reasons why we

use the DCT. One of them is that, although I want to

do a Karhunen-Loeve transform, I need to do a DCT.

It turns out that the DCT is for particular cases, actually exactly equal

to the Karhunen-Loeve transform. Those particular cases are when the

images what's called Markovian. So a pixel depends basically on the pixel

next to it in a very particular form, and again, the pixel next to the pixel

next to it. So those are images that are called

Markovian. So if you assume something about your

image, you can show that the Karhunen-Loeve transform ends up being

the DCT. So it, at the end, it is a constant basis

like we see here. That's one reason and to basically decide

to use the DCT. The other reason that we use the DCT is

because of the following. Remember that I'm working from the whole

image and basically having n x n blocks. Okay?

Someone may wonder why not to use for example Fourier transform?

If we go back into the theory of Fourier transform, when you do a discrete Fourier

transform, you're making an underlying assumption of periodicity.

So you are basically assuming that your image repeats itself as we can see here.

So this comes back here and this comes back here.

Okay? So there is a repetition of your image.

That's the underlying mathematical assumption to be able to formally do a

Fourier transform of this 8 x 8 block. And here, I'm drawing just one line of

the 8 x 8 block, but the same assumption basically goes in the other direction and

then in the two-dimensional. Now this this basically means that when

your block ends, you're expecting the pixel in the next block to basically be

identical to the pixel that it was here. Okay? So you expect this block to

basically repeats here and repeats here. That's your underlying assumption.

It's not very probable that the pixel that was here is the same pixel value

that was here, not even close in an 8 x 8 block maybe.

On the other hand, DCT assumes a different type of periodicity.

Basically DCT assumes that at the boundary, you have mirror symmetry.

So basically, your image basically repeats,

but it kind of false. So, the assumption says, okay, I'm

assuming that the pixel here is similar to the pixel here right next to it.

Note that the pixel is identical to the one that comes 8 pixels later, but just

the pixel right next to it and that's a much more reasonable assumption.

So, those are two of the main reasons why we use a DCT.

One is because for certain types of images the DCT is the Karhunen-Loeve.

The other is because the periodicity, the type of the periodicity assumption is

much more applicable to images where we started by this n x n or 8 x 8 blocks.

So now, we have the DCT. We started with an image.

We divided n x n blocks. We do a transform into a discrete cosine

transform, where every coefficient tells us how much of this basis are we using.

I keep talking about 8 x 8. How we came to 8 x 8? Actually there was

a lot of studies before jpeg was defined to get 8 x 8,

but let me just illustrate what happens here.

This is an image that basically, we took the whole image.

In this case, we don't divide. This is already a block of that image,

Lina, that we talked before. We did a discrete cosine transform of

this entire block and we took 25% of the coefficient.

So we took basically only 1/4 of the coefficients, and we made everybody else

0, and then we did the inverse, and that's the result that we got.

We have introduced error and remember, basically, the DCT is, what it's going to

try to do is going to try to bring us into a domain that introduce in error,

basically killing some of the coefficients making them 0 or making

errors on them, won't be as noticeable. But in this case, we did it very simple.

We just took 25% of the DCT coefficients. Which ones is going very clear soon?

But just imagine that we took the, the 1/4 of the basically largest

coefficients. Now, first we did it for the whole image.

Here, we actually took only 2 x 2 blocks. So we went here and we divided in 2 x 2

blocks and we need a DCT of every 2 x 2 and we took 25%..

So be is, basically ended with only one and we came back.

And you see this very, very annoying blocking artifact.

Then we did the same for 4 x 4 and we do the same for 8 x 8,

And we see that basically its gets better and better.

So, you might wonder, okay, 2 x 2 is too little,

but why to stop at 8 x 8? Why not to go to 16 x 16 and 32 x 32?

And, and you know, why not to just do the DCT for the entire image?

There's a couple of reasons for that. One is computational,

the, basically it's cheaper to do multiple small DCTs than one large DCT.

So if you have a, let's say a 16 x 16 image, I'd rather do four 8 x 8 DCTs than

doing just one 16 x 16 DCT. That's one of the reasons.

Another reason is this magic approximation of DCT to the

Karhunen-Loeve. I say that happens under certain

conditions, a Markovian condition.

If you're in a small region at 8 x 8 block,

that condition, that Markovian condition will probably apply.

But if you are on a very large image, the whole image won't have a Markovian

condition. So basically, in small regions you expect

that assumption to hold, but it doesn't hold in large regions and it's not hard

to see that. If you're just, let's say, looking at me

right now, you see that there's a lot of correlation here,, more or less the same

gray values. But if you actually start moving away and

you say, is there a correlation between the pixel value here and the pixel value

here that they are very far away? There's actually no correlation.

I can actually change my shirt and without prior information. So there is no

correlation when you get far way in your image, pictures that are far away,

basically they start being decorrelated. So you need to get to a compromise. On

one side you want very small blocks because you want to be computation very

efficient. on, on the other hand basically you don't

want too small blocks, because you get into, into this problem of not having

enough information in the block, enough correlation to be able to decorrelate

with the transform. And if you get to two large blocks, you

start paying computations, and you start paying that you violated the assumptions

for which the DCT is good. So 8 x 8 is a good that I say was, was,

was significantly studied and it got into a very, very good compromised and that's

why it used today for JPEG. So, to conclude this part of the DCT, I

just want to show you once again this image, Lina. And these are just two

different cases, but don't worry about that for the moment.

Here, basically we took 8 x 8 plots, we did a DCT, and we kept 12.5% of their

coefficients, and then we came back. Not really quantization, which is going

to be our next step, just a DCT and the inverse DCT.

And it looks pretty good than is, it just when we just look for the first time at

the image, we actually have achieved compression, because we only kept 12.5%

of the pixels and we got an image which looks almost identical to the original

one and that's because we did that in the DCT domain.

If we were to actually flow basically a lot of pixels here and keep only 12.5%,.

we will not be able to get this quality of the image.

This is just the error between the original and the reconstructed.

And as you see, it's relatively small error.

So this is pretty good compression by nothing else than doing a DCT,

basically removing a lot of coefficients and coming back.

So now, we have the front end and we have the back end.

What's missing is the middle. I'm in the DCT domain, can I achieve even

more compression by doing quantization in a smart fashion?

And that's going to be the subject of our next video clip.