In this course, you will learn to design the computer architecture of complex modern microprocessors. All the features of this course are available for free. It does not offer a certificate upon completion.

Loading...

In this course, you will learn to design the computer architecture of complex modern microprocessors. All the features of this course are available for free. It does not offer a certificate upon completion.

4.8 (387 ratings)

- 5 stars335 ratings
- 4 stars37 ratings
- 3 stars9 ratings
- 2 stars1 ratings
- 1 star5 ratings

Sep 17, 2017

This is a great course. I learned tremendously.\n\nThe only wish is that it is updated with the new materials and the new text book (H&P 6th, planned soon) in the near future.

Dec 21, 2017

The assignments are somewhat vague and takes time to figure out what was really expected. There is significant gap between the covered material in week 5 and the assignments.

From the lesson

Vector Processors and GPUs

This lecture covers the vector processor and optimizations for vector processors.

#### David Wentzlaff

Associate Professor

Today, we move on to our new topic. Vector computers.

So, a little bit of introduction on vector,

Vector machine is a vector processor. Broadly, it's a way to get at having data

level parallelism. Many times for, let's say, array

operations, you're going to want to take one whole array and add it too another

whole array. And let's say, these arrays are large.

Does it really make sense to have a processor sit in a tight loop doing load,

add, Store, load, add, store, load, add, store

in a loop? And it's the insight that comes out that if you have computations that

work on vectors or matrices or even multi-dimensional matrices,

You can think about building an architecture where you don't have to have

as much instruction fetch, instruction decode bandwidth.

And you don't have to sit there and fetch new instructions and continually operate

on those new instructions. You could just have an instruction which

encodes some large amount of computation. Because its all the same, it's the

insight. Also, in today's lecture, we're going to

be talking about single instruction multiple data architectures.

This is kind of a degenerate case of vector architectures.

And a good example of this is something like multimedia extensions or MMX in the

Intel processors or Ultivec in the power PC architecture.

The newer thing that Intel has added now, they all call SSE.

Streaming, something extensions. I actually don't know what the second S

stands for. And then they also now have something they

call AVX, which is even wider. They can, can,

Basically and continually add in more instructions to make the short vector

nature better. And then, finally today, if we have time,

we'll be talking about graphics processing units.

So, I have some examples here. This is the ATI FirePro 3DV7800 and then

we have the Nvidia equivalence, Nvidia competitor, which is the Nvidia Tesla, I

think this is C075. Both of these, these are both very fast

processors. And what is interesting is, these started out as graphics, graphics

processors. So, they started out to play video games

effectively or to do some sort of rendering of three-dimensional data.

So, you're taking some data, You operate on it and there's massive

parallelism there. Lots of different triangles in a, in a

three-dimensional image, for instance, in three-dimensional rendering.

And people have this insight that, that same processing architecture that is good

at rendering triangles might be good at doing, let's say, dense matrix operations

also. And we've seen this outgrowth and we've seen a whole programming model come

up around this and this is, this is very recent. to some extent,

These architectures don't come from the same lineage as some sort of normal

processors. They come from, really come from fixed

function hardware that was there to design, there to render video games and

three-dimensional sorts of scenes. So, their architectures look quite a bit

different and the naming is very different, so if you go pull, pick up the

manual, it tells you how to program one of these things and you come from a computer

architecture background, you're just not going to understand any of the words.

Your book actually, the Hennessey and Patterson book has a very good table which

and that makes life a lot easier. Okay.

So, let's get started. Looking at vector processors, and let's

look at the programming model first before we look at the architecture.

So, this would be software model, not the not what the hardware looks like, yes.

So, to start off here, A couple things to note is in the

traditional vector architecture, you're going to have some scour registers.

And these are the registers like in a normal microprocessor.

They just hold one value. Thye're maybe, let's say, 32 bits or 64

bits in width. And then, you have a second register file,

which holds. Vectors.

And when you go to access one of these vectors, it's the same thing as a register

file, file here. If you go to access, let's say, vector

register three, or something like that, you're going to, that doesn't denote one

value. Instead, it denotes many values at one

time. And typically, we have a fixed width here

drawn, but typically these things have very long widths.

So, for instance, something like the Cray processor or the Cray-1 processors, had a

maximum vector length of 64 elements where each element was 64 bits.

So, it's a lot of data that you're, you're sort of moving around at one time with one

operation. And an important piece of sort of

architectural or least program model hardware here is the vector-length

register. The vector-length register says, how many

of these elements are actually populated?" And we'll see why that's important.

But for right now, let's just think of having the vector-length register be

equivalent to the maximum number of elements in the vector.

So, think of it as having 64 elements and the vector-length register just says

there's, you're always operating on all 64 bit, entries of data in parallel.

Now, if we go look at the program model connected to this, we need to add some

extra instructions now. In our Scalar processors, or all the

processors we've been talking about up to this point,

It operates on one register with one other register.

And that still exists in this model. But it operates only on these Scalar

registers. Now, the reason why we still have the

Scalar registers around in this model, is we want to have things like branch

conditions, address computation, things like that are not vectorizable.

They don't, you know, you don't have 64 addresses.

Maybe, maybe you do in certain cases. But typically, you're not going to have

that laying around. You're just going to have an address and

you need to load from address and sort of for branches, you need to do the branching

based on some value, And not all 64 values.

But, we now add some special extra instructions.

So, if you go look in your book, they develop this architecture they call VMIPS

or vector MIPS. And they add some extra instructions here

which look very similar to normal MIPS but all of a sudden they put some Vs at the

end here. So,, VV which means it operates on a

vector with another vector. They also developed some instructions

which have a V and a S, which is the Scalar so you can do a vector plus a

Scalar which would be something along the lines of if you were to have, let's say,

add vector Scalar where you're adding one vector with a Scalar register where the

scale register, let's say is loaded with one.

You could do this add and it'll increment every element of the vector by one.

You also have load in stores, which can pull out very large chunks of memory and

put back very large chunk of memory from the arrays in memory.

But if you look at what's going on in one of these instructions, we're taking one

vector, another vector, putting it into Some sort of arithmetic operation and then

storing it into another register. This is a register-register vector

architecture. There has been some register-memory and

memory-memory vector architectures out there, where instead of naming registers,

vector registers, you can name places in memory, but the vector-vector oh, excuse

me the register-register variants are, are the most popular.

Just like the register-register Scalar computer architectures are now the most

popular. One thing I did want to point out here is,

we've said nothing about how many ALUs there are in this architecture.

This is just the abstract programming model.,

So, don't get this confused with having one, two, three, four, five, six

functional units or something like that. This is just a abstract model right now,

we have not talked about the hardware. So, this brings up, how do we get data?

And we have a instruction here that we'll call load vector.

Load vector has a destination, being a vector and the is, Is a register, and you

might have another offset in the register. But let's say, there's only one register

in our, in our basic load vector operation here.

And this is the address that points to the base of the vector in memory.

And when you go to do this load, it's actually going to pull in from memory into

our vector register. You could also start to think about having

interesting offsets or strides here. So, that's what this picture here is

trying to show is we have a base pointer pointing to by register one, it's a Scalar

register and note it's has different naming, these have Vs and these are Rs and

then, We have a stride here which says, where in

memory to take from. So, you can think about having something

where you can do basically multiple locations in memory.

But you want every fifth element or something like that.

So, you could load register two here with five, register one here with the base

address, And then, do this load vector instruction

and it'll take each fifth piece of memory of some data size and load it into the

vector register. And this is our abstract model, but at

the, at the beginning here, let's assume what's called the unit stride which

basically means this here, is always one, so its always getting the next value in a

row. We'll, we'll talk in more complicated

cases about having non-unit stride. Okay.

So, let's look at what this does to code. Here we have a basic code example, it's

going to multiply element-wise.. Different elements of a, of a, of Vector

here, A and B, and deposit it into Vector C.

Now, this is in memory because this is C code so these are actually arrays.

Now, obviously this is not a, you know, array multiplication here, cuz array math

is much more complicated. This is a element-wise multiplication.

If you go look at the Scalar assembly code.

Well, first of all, we need to have a loop.

We have to load the first value, load the second value, do the multiply, do the

store. This is showing code for floating-points

double precision multiplies. Then, you have to increment a bunch of

pointers. Check the, the boundary case and, and loop

around. On your vector architecture, life gets a

little bit easier here because we can do all 64 of these in one instruction, we

don't have to loop. And all you really have to do is load,

load, load vector, load vector, multiply and store.

And this instruction on the top here loads the vector length register.

And we look at the vector-length length register here of 64, cuz we're trying to

do 64. But if we were to load the vector-length

register to, say, with 32, we would only do the first 32 multiplications.

And you can set that vector length register all the way up to maximum vector

length. So, the vector-length register,

There's, there's this value here we call the vector-length register max.

Which is the width, the, the, the, the largest,

It's going to be length of a vector. The vector length register says for the

given operation we're about to compute, How many of those operations we should do?

So, you could either easily have, something with, a vector length of a

thousand. But you only want to do, let's say, the

first 64 operations so you can load your vector-length register of 64 and only do

64 operations. A good example for this actually is some

of the super computers. Cray, Cray machine have relatively short

vector-length maxes, but if you go look at something like NEC the Japanese

supercomputers, the NECSX8 or nine or something like that which is, I think,

actually now probably the fastest computer in the world or the SX9, I think is or

whatever is the, the newest. I actually, I think it's the SX9 the new

Japanese vector shift computer. They have very long vector-length maxes so

they can actually have a vector-length of a thousand. So that, in one instruction,

they can basically encode a thousand operations which is pretty, pretty fancy.

But they can, you still need to be able to set the vector-length because maybe you

don't want to do all a thousand all the time.

Okay, so why, what is this vector stuff coming has some advantages?

Control Data Corp 6600 or the Cray-1 they have

Very deep pipelines. And if you think about the architecture

we've been building up to this point, we had to add a lot of forwarding logic and a

lot of bypasses to be able to bypass one value to the next value.

Well, if you have a very deep pipe line, And you observe back to back multiply or

something like that, you're going to stall a lot.

But in a vector computer, because you know you're operating on, let's say, 64

operations at a time anyway, This actually allows you to take out a lot

of the bypassing. So, while these vector architectures have

no bypassing in them. Because if you're going to be operating on

64 things, and your pipeline length is six anyway, there's no possibility that you'll

ever actually have to forward data back to, let's say, itself or something like

that in the early you could do all the bypassing between different operations in

the register file itself. Also, you know, deep pipelines are good

cuz you can have very fast clock rates. So, to give you an example, the old Cray-1

had a 80 megahertz clock. Now, you might say, 80 megahertz ooh,

that's, that's not very fast. But, you know, 80 megahertz back in the

probably late 60s' early 70s,' was very fast clock rate for a processor.

I mean, these were supercomputers, mind you, but they were very aggressive and

they can do that because they had deep, deep pipelining and lots and lots of

logic, and these things were physically large.

I mentioned the memory system. And, vector computers have some

interesting changes that you have to think about in the memory system.

One of the things you can do is, because you have so many memory operations going

on, You can use vector load.

You can actually overlap going out to main memory with doing the next load

effectively, even if you're doing them sequentially.

And most these vector architectures have many, many memory banks.

And what's nice is if you have unit stride, you know that your one operation,

your one load is going to, to go to this bank, the next operation is going to next,

that bank, that bank, that bank, that bank and have basically a very good bank

distribution or bank utilization. And this is assuming right now that we are

actually only doing one memory load at a time.

And I have a little note up here that says, okay, well, each load takes, let's

say, four cycles. Busy bank time and you have twelve second

link to get out to memory in this Cray-1 machine.

Well, On a normal architecture, this would be

pretty bad, because you'd be stalling, twelve cycles, let's say, to go out to

your memory system. I mean, that's, that's not the end of the

world but that's, that's not great if you, like have a, a load, and then a use, a

load, use and just keep going back and forth, between those load and use.

But in the vector architecture, because we have a long vector length and we're

loading 64 different values and we know that they're going to have good

distribution over many different memory banks,

We can effectively do this one load and we can overlap the latency in the memory

banks with each other. So, we'll start one load here, and then

one lead here, one load here. And if, you know, it has four cycle

occupancy on the respective bank, and we have a 64-entry vector, definitely by the

time we wrap around and get back to using this bank again, that first operation will

be done. So, it's a relatively effective way to

increase the bandwidth of your architecture and guarantee that you're not

going to have bank conflicts.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.