Okay, so I wanna just briefly give a case study here of one of the more interesting,
modern day VLIW architectures. Or probably the most famous and possibly
also the most infamous VLIW processor out there.
This is the Intel Itanium, also known as the, Intel I64, or what's known as an EPIC
processor, Explicitly Parallel Instruction Computing architecture.
And a lot of this work actually. Was done.
In collaboration between Intel and HP. Hp uses these a lot in their big servers.
Their sort of big mainframe, well, not quite mainframes.
But big, big, heavy, big iron computers. And Intel was trying to use this to
effectively kill all of the other workstation vendors.
And this was gonna be their 64-bit solution to computing.
So it's a modern, non-classical VILW, and this was going to be Intel's chosen ISA.
There was, they were, they were going to deprecate X-86, and choose IA-64 as the 64
bit ISA. And as we now know, going a few, few years
forward after the creation of all this stuff, that didn't really happen.
Intel went and did this, it built a bunch of processors with this instruction set.
You can still buy processors with this instruction set, but it never got, as, as
good of a, acceptance as competitor. The competitor is, was at the time was
called AMD64, which is a 64 bit extension, to what people already had.
And that's when people ended up wanting, it's just a 64-bit extension to what we
already had versus, you know, something totally different.
Okay, so couple of features here is object code compatible VLIW, so, it's not quite a
VLIW in a classical sense, it's object code compatible, which means different
generations, different micro-architectures in this VLIW can have the same instruction
code in the same binaries, you know, to recompile.
And how they did this is effectively, as I alluded to before, they had the ability to
have parallelism straddle across instruction bundles.
And they had this notion of groups which we'll talk about in a second.
So, the first few implementations of this Merced, was the first Intel Itanium
implementation. It was kind of like the 8086 or x86.
But Merced has, has lots of things that you'll realize, if you look at Intel
codewords and Intel code names named after a river.
Intel likes to name their things after either rivers or places.
I think this has something to do with it, its, you can't trademark a, a place name,
so they, they just sort of get around that and make sure they don't have any
Trademark issues by choosing place names with all their code names, One of the big
problems here, was supposed to ship in 1997.
First customer shipment not until 2001. It's a four year miss.
And superscalar was another thing that sort of had caught up on it at that time.
And it was supposed to be faster and better than everything else.
And the first, the first one was not very good.
It had cold, low clock rates, and was not as high performance as it was supposed to
be. And sort of the, the.
X86 side of Intel's business line, actually, had almost the same performance
as, as the first Itanium, and then very quickly surpassed the first Itanium.
So, their high end processor wasn't actually high end.
Couple, couple other things here, so, McKinley was the second implementation,
shipped pretty quickly after that. This was much better implementation, but,
you know, it's still, still hard to do, but we're still building these things.
So, in 2011 at ISSCC, the Intel introduced the Poulson processor.
Big, big, machine here, eight cores and 32 nanometer, lots and lots of RAM.
We'll, we'll look at that, yeah, so, so 32 megabytes of shared L3 cache, big, big
processor. 544 square millimeters, in 32 nanometer.
So, at the time this came out, this was the biggest processor ever built, most
number of transistors, over three billion transistors, or at least the biggest
commercial thing, Intel might have had a research prototype, I think, that might
have had more transistors than this. I think their, Multicore processor or
they're they call it the SCC their, their single chip cloud computer might have had
it more but I know I should know the transistor count.
But from a commercial processor perspective, huge chip.
But they are selling into extremely expensive sort of sockets.
There is a seller ship for premium was going into big main frames.
That was not what was this was originally destined for.
It was destined for both big main frames and work stations.
But now this is sort of the in 2012 standing here now.
This is not used in lots of other places except for sort of bigger, bigger hardware
or mainframe sort of things. So a few of the interesting here is the
cores are multi-threaded and you can execute six instructions.
You can, you can fetch six instructions per cycle and you can execute up to twelve
instructions per cycle. Per core, and then there's eight cores.
So this is a beast of a machine. Very, very high performance computer.
Okay, so let's dive into some of the details here of Itanium.
Itanium has a 128-bit instruction bundle, and inside of there you can fit the three
operations, and then there is some word called template bits, which sort of says
what is in the instruction bundle. So it's not actually a fixed format
bundle, these instruction boundaries can move around a little bit.
And they did that so you can sort of mix in something like a immediate instruction
with instruction, which doesn't have immediate, and get more space in the
bundle for the immediate bits, so you can have or, or branch offset or something
like that. These template bits also describe how a
particular bundle relates to other bundles around it.
So sometimes these are called begin and end bits, or start and stop bits.
So it says the number of instructions which can execute explicitly in parallel.
And the machine doesn't necessarily have to execute these in parallel.
So for instance, if you say twenty instructions can execute, or twenty
operations can execute in parallel, but your machine's only two wide or they built
it two wide, implementation of Itanium or I-64.
You're just are gonna execute, you know, two wide for ten cycles, or something like
that. But, what's really cool here is the
compiler is able, just like all the other VLIWs, to express the parallelism to the
machine explicitly. Some interesting things about the
registers. They, because this is a VLIW processor,
and because you're gonna have to do code scheduling like what we saw in last class,
that increases the general purpose register pressure.
You don't have a register renamer. So you can't go and use different names
for things. And the hardware's not gonna rename things
for you. So instead, the compiler and the software
gonna have to do the renaming. So they had 128 general purpose registers
and another 128 floating point registers. And they also have these predicate
registers. So, they're not quite full predication,
but they're pretty close to full predication.
So you can have bits that say whether our later instructions are gonna execute or
not and you have to compute that into a little register file.
So they had a predicate register file that you have to bypass.
So that's, that's sort of interesting to see.
And then they had the, really interesting feature here, which is called, ruh,
rotating register file. And let's, let's talk about what a
rotating register file is. So the problem this is trying to solve, is
in a code sequence as we saw before, in last lecture.
If you have, if you have a very long instruction word, scheduled piece of code,
and you want to get good performance, you're going to have to unroll the loop,
and then you're going to have to software pipeline the loop.
But when you do this, this is going to increase your register pressure or
increase your register names, how many register names you need to use.
And, as we saw, you're gonna have to add extra special code in the prologue and the
epilogue, which are different than the main loop body.