As superscalar machines become more complex, the difficulties of scheduling instruction issue become more complex. The on-chip hardware devoted to resolving dependencies and deciding on instruction issue is growing as a proportion of the total. In some ways, the situation is reminiscent of the trend towards more complex CISC processors — eventually leading to the radical change to RISC machines.
Another way of looking at superscalar machines is as dynamic instruction schedulers — the hardware decides on the fly which instructions to execute in parallel, out of order, etc. An alternative approach would be to get the compiler to do it beforehand — that is, to statically schedule execution. This is the basic concept behind Very Long Instruction Word, or VLIW machines.
VLIW machines have, as you may guess, very long instruction words — in which a number of `traditional' instructions can be packed. (Actually for more recent examples, this is arguably not really true but it's a convenient mental model for now.) For example, suppose we have a processor which has two integer operation units, a floating point unit, a load/store unit, and a branch unit. An `instruction' for such a machine would consist of (up to) two integer operations, a floating point operation, a load or store, and a branch. It is the compiler's responsibility to find the appropriate operations, and pack them together into a very long instruction – which the hardware can execute simultaneously without worrying about dependencies (because the compiler has already considered them).
VLIW has both advantages and disadvantages. The main advantage is the saving in hardware — the compiler now decides what can be executed in parallel, and the hardware just does it. There is no need to check for dependencies or decide on scheduling — the compiler has already resolved these issues. (Actually, as we shall see, this may not be entirely true either.) This means that much more hardware can be devoted to useful computation, bigger on-chip caches etc., meaning faster processors.
Not surprisingly, there are also disadvantages.
First, obviously compilers will be harder to build. In fact, to get the best out of current, dynamically scheduled superscalar processors it is necessary for compilers to do a fair bit of code rearranging to `second guess' the hardware, so this technology is already developing. We will consider compilers in chapter 14: for now, we will simply observe that building good compilers for VLIW is non-trival.
Secondly, programs will get bigger. Suppose you cannot always find enough instructions that can be done in parallel to fill all the available slots in an instruction (which will be the case most of the time). There will consequently be empty slots in instructions.
It is likely that the majority of instructions, in typical applications, will have empty code slots, meaning wasted space and bigger code. (It may well be the case that to ensure that all scheduling problems are resolved at compiler time, we will need to put in some completely empty instructions.) Memory and disk space is cheap; memory bandwidth is not. Even with large and efficient caches (see chapter 16, we would prefer not to have to fetch large, half-empty instructions (let alone completely empty ones).
Unfortunately, it is not possible at compile time to identify all possible sources of pipeline stalls and their durations. For example, suppose a memory access causes a cache miss, leading to a longer than expected stall. If other, parallel, functional units are allowed to continue operating, sources of data dependency may dynamically emerge.
For example, consider two operations which have an output dependency. The original scheduling by the compiler would ensure that there is no consequent WAW hazard. However, if one stalls and the other `runs ahead', the dependency may turn into a WAW hazard.
So if you are serious about getting the compiler to do all dependency resolution, you must stall all pipeline elements together. This is another performance problem.
A significant issue is the break in the barrier between architecture and implementation which has existed since the IBM 360 in the early/mid 60s. It will be necessary for compilers to know exactly what the capabilities of the processor are — for example, how many functional units are there?
This has, to a degree, happened before — for example, delayed branches in RISC processors meant that the compiler had to know something about the pipeline length. Also, as stated, compilers that best exploit superscalar processors know a bit about what is going on inside. However, if you upgrade your current, ancient, 2-way superscalar processor to a 4-way processor of the same clock speed (and the same pipeline length), your existing binary programs will [hopefully] show some speedup. However, if you do the same sort of thing with a pair of VLIW processors, you will get (basically) no speedup. To get the performance increase, you will need to recompile.
Worse still, suppose that a manufacturer finds they want to remove a functional unit — say, it turns out to be not as useful as first thought to have three floating point units. To do so could mean loss of binary compatibility: at the very least, some old programs will not run. While sophisticated and corporate users might be able to deal with recompilation, the average home user may be traumatised…
There are possible solutions to this last problem. One is to provide some (transparent) binary code translation. Another is to be a little less `strict' in one's definition of VLIW. For example, rather than grouping operations together into single, large `instructions', you might group together some more conventional instructions into issue packets.
An issue packet is a group of instructions that have been assembled by the compiler and are guaranteed dependency-free. The hardware is free to execute as many, or few, as it can/wants to simultaneously: depending on the resources it has available. This is essentially the approach taken by Intel with the IA64 architecture.
Another relaxation of the VLIW principle is that typically the complete processor does not stall when one component does. Instead, sufficient hazard resolution hardware is included to deal with any dependencies that dynamically occur during execution.
A few years ago, hopes were high that VLIW would provide the `way forward' for high performance microprocessor design. That seems to no longer be the case; VLIW-based chip designs have been less successful than anticipated, with relatively disappointing performance, given the effort put in. In time, the technology might have matured to a point where it lived up to its promise (and it might yet) — however, it seems that the emerging dominant trend in increasing performance of microprocessors is towards multi-core designs geared towards increasing thread level parallelism rather than instruction level parallelism. We will consider these issues further in chapter 18. Since VLIW is architecturally interesting, and still with us for at least a few years to come, we'll look at the main VLIW offering, IA-64, in chapter 12.