Okay, so we left off talking a little bit about how to encode instructions in VLIW
or how VLIW's encode instructions. And as you may recall, in a VILW
instruction encoding, you've a very long instruction work.
So it could be I don't know many, many bytes.
So, if you go and get something like the multi flow processors, multi flow trace
processors. They had very long instruction words.
They could, I think, execute something like and their largest machine
configuration twenty something instructions.
And they had to encode this but one of the challenges was that you didn't necessarily
want to have all that encoding space used all the time.
So if you are executing a simple instruction, or something which doesn't
actually need all the encoding space. You end up of a whole lot of no ops.
Or if you have a very long dependency string in your execution trace if you
will. So you have a instruction which is
dependent on the next instruction, which is dependent on the next instruction,
which is dependent on the next instruction.
And you cannot fill any of the extra slots.
You're gonna end up having a lot of no ops in your program.
So how do we go about solving this. Couple, couple different solutions here.
Another solution here is what's actually used in the Intel Itanium or IA64
processor line and also the TI DSP's, the C6000 series.
They, they have the ability to effectively mark where a bundle starts and stops.
So it's something sort of between a, variable length instruction and a fixed
format, VLIW instruction. So we'll talk more later in today's class
about how IA64 or Intel's Itanium handles this.
But roughly what happens is, they have a fixed instruction format and you can fit
sort of, three instructions per bundle, or what they call a bundle.
And then, if you wanna go over that, you can keep going over that, and they have
special bits in their instruction where it says, oh, well this is gonna express some
more parallelism later. So you need to look at the next bundle,
and try to fetch that bundle, and that's all can be executed parallel.
So it's still a, a parallel execution semantics, but the formats are not
necessarily fixed. And this is, sort of, outside the bound of
classical VLIW now it's going to a little bit more modified VLIWs, or, or more
contemporary VLIW's. So today, we're gonna as I said, talk
about how to go about getting all the performance that out of order superscalars
can get, but in the context of a VLIW. So we're gonna slowly build up all the
different techniques that out of order superscalars have, and all the ways that
they can get instruction level parallelism.
And then apply that inside of VLIW or how we make small changes to classical VLIW's
to get a lot of the performance back. Okay, so let's talk about the first, the
first trick here. One of the questions that comes up or the
problems that comes up here is if you have branches, and you mis-predict.
This can limit your instruction level parallelism in a VLIW.
Okay, so why is this worse than in a out of order superscalar?
Well an out of order superscalar can dynamically schedule around branch miss
predicts. So if your branch miss predict you can
basically schedule other code, or speculatively schedule code depending on
your prediction. But on something like a VLIW processor,
that's a lot harder to do because you don't have a dynamic scheduling engine
behind there. The compiler had to go do all that
scheduling statically at compile time. So how do we, how do we go about fixing
this? Well, one solution is just to eliminate
all the hard to predict branches. So you take out the branches that might be
hard to predict. So your branch predictor is trying really
hard, you could still have a branch predictor in your VLIW processor, but some
of these things, you don't know. It's like a data dependent branch.
How do you know which way it's going to go.
And, you know, this is a problem even for superscalars, but at least in a
superscalar, you can try to sort of back filled instructions, or try to, try to do
something, or move around, if you will, loads or long laid instructions around
branches. So in a super, in a out-of-order
superscalar you can potentially move a load beyond a branch.
But in a VLIW processor, that's not easy to do, because compiler has to make that
decision, whether that branch is gonna predict take in or not take in.
And it may, that may change over the lifetime of the program.
And it may also change as the depending on the sort of input sets to the program.
So it may not be something that's compiler time even known knowable.
Okay, so our first technique here that we're gonna use is called predication.
So let's, let's define what predication is.
Well, first of all, predication is a technique, that allows you to basically
change the result of a register, based on some other register.
And it's going to let us transform hard to predict branches into data flow or data
dependencies. So we're gonna add something to our
instruction set and it's gonna take what looks like a small branch or short
distance branch and we're going to basically execute both sides of those
branches or the taken and the non taken code path.
And then at the end, decide which one was the right one, with a, something like a
select operator from C. So this is like the question mark colon
operator, which no one ever uses in C. This is sort of the equivalent of that,
but we're gonna do it in one instruction, or maybe two instructions.
So let's, let's, let's take a look at how this helps with small branch regions or
branches which are hard to predict. So, we're gonna introduce two
instructions, and I also want to point out, that predication sometimes shows up
in superscalar's processors or a, a sequential instruction sets also.
So a good example of this is, two instructions very similar to this, are
actually in x86. They're called c-move or conditional move.
And we're gonna introduce conditional move as the most basic form of predication.
So let's look at the semantics of these two instructions we add for a conditional
move. First instruction here, move zero.
So what does move zero do? Well, move zero test whether one of these
input operands here, rt, is equal to zero. And if it's equal to zero, it's going to
move rs into the result register, the destination register here.
So it's gonna move rs and rt, or rs and rd if rt is equal to zero.
And, if rt is not equal to zero, it's just going to leave rd alone.
Okay, well, on first appearance is that seems like a very simple instruction.
It's basically just doing a copy operation.
It's a, it's a gated copy. Looks a lot simpler than something like
add or subtract. We don't have to do any math, at least.
There's no compare, I mean the, the compare is easy as the equal, it's not a
less than equals. So it's pretty simple instruction.
And we're also gonna introduce move not zero.
So it's kind of a complimentary one of this.
And, we're going to want this because when we go to execute code, we're going to want
to, let's say, have a if then else, and we're going to have, want to execute the,
the, the then and else at the same time or roughly at the same time, or
indiscriminately not depending on the branch.
And at the end, we're going to decide which one was the correct one.
And we need sort of two instructions here to choose what was, was the positive one
the right one, or the one where the, the conditional value is true or false the
right one, or the one where it's true, or vice versa.
So, move not zero. Move not zero is going to do the same
thing here. It's going to see except the sense of the,
of the condition code is going to be different.
It's going to sense if rt is not equal to zero, and if rt is not equal to zero then
rs is going to go into rd, else we're just going to leave rd alone with the original
value of rd. Okay.
So let's, let's walk through a, a quick code example here.
And see, see how this helps. So here we have a code example, we have
some C code. We have non-predicated MIPS-like assembly
code. And then we have a MIPS-like assembly code
here with these fancy two predication instructions.
And they were all three of these are doing the same thing.
So let's, let's look at this piece of code here and the C code is probably the
easiest to understand. It's gonna start off it's gonna say if a
is less than b, so is a condition, and there's a then and a else.
So what does this actually implement? Well this is a minimization function or a
min function. X gets the minimum of a or b.
If a is less than b, it gets a. If a is greater than b, it gets b.
So it's going to be assigned to x, the smallest value of a and b.
And, these sort of, if and else is, you can see here it's pretty short, we don't
have a whole lot of code in either the, the then case or the else case.
And that's gonna be important in a second, you'll see why.
Mainly it revolves around if you have lots and lots of code or one of the sides of
the branches is, is, is long, it may not make sense to actually predicate this, cuz
there's some cost to predication. Okay, so here's some assembly code does a
similar sort of thing. Set less than, this is a comparison
operation in MIPS. So we'll do set less than and these are
going to be, R2 and R3 are going to have our two a and b values here.
If it's less than, we're going to put it into R1, if not this gets set to one or
zero. And then we have a branch zero here
effectively otherwise this is a branch. Okay, so branch equal if zero, it's a
little odd. We probably should have just put branch
zero there instead. But is going to check to see whether this
value is zero or one. And if it's the one direction it's going
to jump to L1. If it's the other direction it's going to
fall through and then jump over this move. And these two moves here are the two
assignment operators, the x = a or x = b. Okay, that's, that's not so bad.
Let's analyze how long this takes. How many cycles this will take to execute
on a, on a sequential machine. Okay, it was a set less than.
It's one. This is branch.
So that's two. Let's say the branch is not taken or falls
through. Three, four.
So, all of a sudden, the fall through case is going to be one, two, three, four
instructions. And let's say we have a two cycle branch
mispredict penalty. So let's look at the other case.
So the other case is going to be one, two, and then two cycles of mispredict.
So that gets us up to four. Branch over, five.
So it can be five cycles. So all of a sudden, we have our branch
mispredict penalty coming into the equation here of.
And, and, you know, two cycle branch mispredicts are relatively benign.
If you, if that starts to go longer, then you really start to have problems with
this. Okay.
So let's take this code sequence and look about how to, how to predicate it.
So if we're going to go predicate this, we're actually going to execute, both
sides regardless of the value of the if clause or the conditional clause here.
So, if we look at what we do, we've actually restructured the code here a
little bit. And this is our compare, this is our if
still up here, we have a set less then. And then we say, if R1 is zero, move R2 to
R4, else, just leave it alone. And then here if, R1 is not zero move R3
to R4. And what's important to know here is, this
non destructive operation, especially here in this last instruction is really
important, cuz if the move zero was executed in that, copied the new value in
R4. And then this move not zero, or to somehow
change R4, you lose that result from this previous ones.
This really depends on these operations being non-destructive if, if their, if
their condition is, is not true. Okay.
So let's, let's analyze this from a number of instructions perspective.
First question is, does this matter what happens to the branch, or what happens to
the condition? Is it going to have different numbers of
instructions, or different number of cycles dependents on the, the direction of
the, the condition. Well, no, there's no branch, nothing's
going to change. So, in all cases, this is going to take
three cycles. So all of a sudden we transform something
which can take, as I said if, let's say you have a branch with probably four
cycles or five cycles so, an average, let's say, the branch is taking you 50
percent of the time either the direction we'll, we'll split the difference there,
and we'll call it four, four and a half cycles, on average.
We turned it into something which takes three.
Well, that's a great win. So where, where else is this useful?
Cuz you always need to have if and else clauses to make this work out.
So, know, this also works good for small code hammocks.
So I say a small code hammock is just a if which jumps around a small piece of code.
If you can't predict that if very well then it makes sense just to go execute
effectively what's inside that code hammock, that small piece of code, always.
If, if, if for instance, you have two instructions inside of a code hammock and
your mispredict penalty is like ten cycles and you don't know whether the, the branch
is being taken 50 percent of the time either way, all of a sudden your average
mispredict penalty is going to be something like five cycles.
You could have just executed the code in the middle, that would have been faster.
You could have executed the two instructions and then just done a
conditional move at the end and that would have been basically always better.
Okay, so we have some questions here. What if the if then else has lots of
instructions? Hm, so let's say we have if then else here
but there's a thousand instructions here and a thousand instructions there.
Should we predicate this? Let's say their branch mispredict finally,
let's say five, for know, there's ten cycles and it's 50 percent either way.
But there's a thousand instructions in the if the, the then clause and the else
clause. Well, no, that would, that would never
make sense to do. Because, you're basically doubling your
code, and if there's a lot of code there you'd be better served by just doing the
branch. Because you'd be taking something that
was, could be 1000 instructions, and turning it into always 2000 instructions.
Or doubling the number of instructions executing, executing, versus just adding,
let's say, ten. Cycles of branch [inaudible], or an
average [inaudible] cycles of branch [inaudible] 1,000 that you have to
execute. So all the sudden, you're, you're adding a
little overhead. So where, where this predication really
helps is for, for small pieces of code where you don't necessarily know the
branch. Outcome or can't predict the branch
outcome very well. Okay what if the, the code is unbalanced.
So you have an if then and an else. But the then clause and the else clause
are very different in size. So, one is three instructions, and the
other is a thousand instructions. Does this make sense?
Well, it goes back to the same argument we had before with the very large if then
else clauses. If the code is really unbalanced you have
a lot in the then clause and a little bit in the else clause or something like that.
And let's say the branch is taking 50 percent of the time, it might make sense
just to use the an actual branch there and deal from the mispredict penalty cost
verses trying to somehow predicate it. Because in fact, what's going to happen is
let's say you have three instructions and a thousand instructions on the two
different sides, you're going to be adding, you're going to be executing if
you predicate a thousand and three instructions every single execution time
plus maybe some sort of conditional moves at the end, some overhead involved versus,
if you have 50-50, and if you have to add a little bit of branch overhead, penalty,
your 50-50 is gonna 50-50 chance is gonna say, well, I'm taking a thousand plus
three, and we're going to add those two things together, and take the average of
them, basically. So, you know, it's going to be what's the
average of that, it's going to be like 501, or something like about 502 cycles.
That's going to be better than always executing a thousand cycles.
Let's see, anything else we wanted to say here.
You know, one, one, one last thing I wanted to say here before we move off to
this slide. Move zero and move not zero or sometimes
called c moves, conditional move instructions are what are called partial
predication. And we are now going to move on to full
predication where we can actually put conditions on every single instruction in
our register every single instruction in our ISA.
But this is a little bit easier to do than full predication.
This is sort of the first step when people typically implement and it's called
partial predication. And just to, just to hammer this home one
more time, predication is really turning what was control flow into a data flow
operation or data operation. Okay, so let's take a look at full
predication. Full predication, we're actually going to
change the instruction set, such that all of the instructions can be or almost all
the instructions are executed conditionally based on some predi,
predicate. If the predicate's false, the instructions
are not going to do anything. It's not going to have any, any side
effect. So let's look at a, a code sequence here.
So this code sequence we have an if then and else and then some code after, after
we're done. So here we have if.
A little bit different this is not actually in MIPS code, this is something
just sort of serial code here, but it's going to say, this is our if here; if a =
b, then branch to b2, else fall through. Here we're going to execute our else
clause. And when we're done we're gonna branch
over the then clause. So not, not, not anything interesting here
for basic blocks. I don't know if we've introduced this term
yet in, in this class. But I wanted to introduce the term basic
block. Basic block are is a, is a piece of code
which has strictly one exit point, and one entry point.
So you could enter it from other places, you jump to it from other pieces of code.
But once you enter the piece of code you're gonna execute the code, all the way
to the end, and there's one place you can leave.
So you're going to jump to someplace else or branch, and you can, you can fall out
to, to two different places, but the exit point, is only one place.
So, we're, what we're going to is, big differentiation here is, the piece of code
can have for instance, two branches that, that's not called a basic block.
That, that's called a bigger type of block but this is a compiler term.
And it's just good to know. Okay, so let's apply full predication
here. So, these are relatively short, code
hammocks, two instructions here, two instructions there and two instructions in
the else, two instructions in the then clause, this is like a, really great
[inaudible] place something about predicating.
Okay, so lets, let's take a look at the, the predication here.
And I want, before, before we do that, I wanted to just say that the, one of the
big reasons that we are trying to predicate on a VLIW, in particular, is in
a VLIW, we have lots of extra, or we could potentially have extra parallel execution
slots. So, in our instruction sequence or
instruction encoding we have extra places to put code so it may not actually hurt us
to, to predicate. Because the last time when we go to
predicate we're basically going to execute both sides of the if, then and else.
And it's bad if we have to execute both sides of the if, then and else, and they
are big, sequentially. But, if you try to, sort of, slide them up
next to each other, if you have extra no-op slots or extra slots in your VLIW,
you can actually decrease the [inaudible] path through your program.
So let's look at this, we're actually going to do this in this example.
So in this example here, we took this piece of code and added full predication.
So when we say full predication, we added some extra things to these instructions.
They look a little strange at first. So here are instruction three, instruction
four, instruction five, instruction six. These are same ones that were back over
here. But instead, we added some extra words in
front of them here or these, these parentheses with p's in front of it.
What that is, is these are predicate registers.
So we're going to add a second register file into our processor where we can store
effectively one bid values, sort of true or false values, that are going to
determine whether this instruction executes or doesn't execute.
Excuse me. So, what this means is if p1 is one,
execute this. If p1 is zero, nullify this instruction
three here. And we have, we have a double pipe here.
What am I trying to show with this? What I'm trying to show this is parallel.
This and this are executing at the same time, because we have a very long
instruction where processor can execute multiple things at the same time.
And what we've done here is actually slid this up next to the, the then up next to
the else. And we are gonna execute them at the same
time. Dependent on these predicates.
So this is the else block, and this is the then block, and we're gonna execute
concurrently. Okay.
Sort of, stuff before the if, stuff after the, if then else is all done, that's sort
of from that, that four basic block example.
What's this instruction? This instruction has very interesting
semantics. And this is something that you, probably
have to add if you're going to try have full predication.
Hm, okay so let's compare. This is our comparison.
If a = b and then it writes into two outputs.
Ooh, that looks broken, that looks wrong somehow.
How can we have one instruction write to two things?
And what is the semantics of it? What does it write to here and here?
Well, what this is actually doing is this is writing the positive outcome of this
compare to the one and the inverse, or the, the negation of that to the other.
And this is useful, cuz then we can go use these predicates down here.
We can use the, the positive one and, let's say, the negative one, and that can
drive whether we execute this, sort of, else clause or the, then clause.
So lots of instruction sets that have full predication have either something like
this. This is one option, you can actually have,
sort of, two outputs, and you load to predicate registers.
The other option is, when you go to en, actually encode the instructions, you have
a sort of a not bit. So it denotes the predicate register.
But when we denote the predicate register, we also denote whether we take the
predicate register being true as, or, or the predicate register being false as what
drives one of the instruction executes in our full predication scheme.