Then this basic design here, because now we actually have to read from the
register file four different threads simultaneously, and we have to fetch
different code from different program counters simultaneously in this
processor. So simultaneous multi-threading where it
came out of was, people were building complex out of order superscalars.
And these complex out-of-order superscalars had all this logic,
to be able to track different dependencies between different
instructions, and to basically restart sub portions of instruction sequences.
So this is like, you take a branch of mispredict.
You had to kill all the instructions that were dependent on the branch mispredict
and leave other ones that were not there alone.
And, when you have this out of order mechanism you have all this extra logic
there to figure that out. Dean Tolson, Susan Eggers, Jim Levy, came
up with this idea that, well,
what if we try to utilize all the dead slots in our out of order superscalar,
but intermix different threads in there simultaneously to fill the time?
So they did this study back from ISCA in 95,' and read a bunch of different
applications, and this right most bar here, is our composite or our, our
average here. And what you should figure out from this
is this black bar on the bottom here is how long the processor is busy, actually
doing work and the rest of this is different reasons the processor was
stalled. So we were stalled on instruction task
misses, branch mispredictions, load delays, just pipeline interlocking,
memory conflicts, other, other sorts of things, and we're only using this
processor less than 20% of the time. So, to show this a different way.
We have our multi issue processor. And we have time here.
We have all these purple boxes which are just dead, dead time.
And we might be able to use subsets, you know.
This is very good. We actually issued four instructions in
this cycle. But here, we only issued two.
Here, we issued one. Here, we issued two to these two
pipelines in the middle. Maybe these are the two ALU pipes.
And there's, like, a load pipe here and a branch pipe there, or something like
that. This is,
this is a kind of a disaster from an IPC perspective.
Can we try to re-use that hardware? And we talked about our core screen
multithreading, which was effectively temporally slicing up the cycles here.
So you run one thread, switch to a different thread, run a different thread,
and you can temporally switch between the threads.