Superscalar is a term coined in the late 1980s. Superscalar processors arrived as the RISC movement gained widespread acceptance, and RISC processors are particularly suited to superscalar techniques. However, the approach can be used on non-RISC processors (e.g. Intel's P6-based processors, the Pentium 4, and AMD's IA32 clones), with considerable effort. All current desktop and server market processors are now superscalar.
The idea, basically, is: as a logical development of pipelined design, where a new instruction starts on successive machine cycles, start up a number of machine instructions on the same machine cycle. By taking advantage of available memory bandwidth (if it is available), fetch a number of instructions simultaneously, and start executing them. (In practice, instructions need to be in a dedicated high-speed cache to be available at a high enough rate for this to work.)
The development of superscalar from `vanilla' pipelining came about via more sophisticated, though still non-superscalar, pipelining technologies – most of which were seen on supercomputers from the mid 60s onwards. However, true superscalar processors are a concept that is unique to microprocessor-based systems.
The first developmental step is to observe that as resources grew in availability, high-performance computer systems (we are really talking about the most powerful machines of their day – Control Data 6600 and 7600, Cray-1, top-of-the-range IBM mainframes) started to acquire dedicated hardware to execute different types of instruction. So for example there may be separate execution units (EX) for floating point and integer operations. See figure 7.1.
Once instructions had been decoded, they would be sent to the appropriate execution unit. In general, as we have seen in chapter 6, floating point operations are more complex than integer ones (though integer multiply/divide can also be time consuming), and an FP unit would be broken down into further pipeline stages. Suppose we have an FP operation followed by an integer one. Our basic pipeline (chapter 4) would simply halt execution until the FP operation had exited the EX phase, effectively stalling the pipeline because the FP EX stage takes several cycles. However, there is no real need to (always) do that. We could allow the following integer operation to enter the Integer EX unit. Since integer operations typically take quite a bit less time than FP ones, this means that the integer result may well be available before the FP one that preceeded it in program order. This is known as out-of-order completion, and brings potential performance benefits as well as big problems, as we will see.
A further step would be to observe that now we have admitted the possibility of at least partially executing instructions out of program order, we have a potential mechanism for reducing the impact of data dependencies (i.e. pipeline stalling). Instead of just leaving a pipeline idle while we are waiting for a dependency to resolve, we could pick other, later instructions and execute them instead. We have moved from a situation, with the simple pipeline, where instructions are statically scheduled or ordered, by the compiler, to one where they may be dynamically scheduled by the hardware. (Actually, the situation is a bit more complex than this and we will return to it in section 7.2.1, below.)
As resources continued to grow, yet more execution units could be made available. What is more, these would often be duplicates: there might be two integer units, and two FP units, and perhaps others (for example, to handle memory accesses). To justify the existence of such a large number of execution units, it may no longer sufficient to simply attempt to fetch and decode a single instruction at a time. It is necessary to also attempt to fetch, decode and write back multiple instructions as well – see figure 7.2.
Note however that it is not the case that we would typically be able to fetch, decode etc. as many instructions as there are execution units – for example, in figure 7.2 most of the pipeline can handle three instructions, but there are five execution units. We will return to memory and cache bandwidth later (chapters 15 & 16 respectively) — in practice this is a limiting factor (i.e. how many instructions we can economically get at a time).
Of course, the potential problems that must be solved to do this are complex, and we will get to them later. For now however, we can ask ourselves: how many instructions can we profitably attempt to execute in parallel? Like less sophisticated pipelined machines, there seem to be limits on the number of instructions that can effectively be executed in parallel, before speed-up advantages become small – typically, between 2 and 8.
All current superscalar machines lie within this band (for example, P6-derived processors and the Pentium 4 have three parallel pipelines). However, techniques are being looked at – and some workers are optimistic – that may increase this dramatically. It has to be said though that other workers are very pessimistic.
Before proceeding, we will define some useful terminology and concepts. The key problems we have to solve when building superscalar machines concern the various data dependencies, resource conflicts, and branches (especially conditional ones). However, not all phases of instruction execution are equally affected. For example, we can happily fetch and decode instructions regardless of any dependencies. Even though we may be fetching/decoding the wrong instructions, doing so will not affect the state of a program. However, other phases are more critical.
A processor that attempts to execute n instructions simultaneously is said to be of degree n, or n-way. Of course, a superscalar processor will be pipelined as well: for example, a Pentium 3 has 14 pipeline stages. as well as being degree 3. Notice that a key word is `attempts': there is no guarantee that a superscalar processor will succeed in executing as many instructions as is, in principle, possible.
This refers to the process of deciding in what order to execute instructions. The choices are static, when the hardware has no control, and dynamic, when the hardware is permitted to re-order instructions to at least some degree.
This is essentially when we fetch operands for an instruction. Before we do this, we must be sure that they are either actually available (that is, they have already been computed and stored – either in their final destination or at least somewhere we can access them) or we at least know where they will be available in the future. In practice, resource constraints can also stop instruction issue – we will look at this in chapter 9.
In our simple five-stage pipeline, the issue phase occurs when instructions leave the decode stage and enter the execute stage.
Instruction issue is generally in program order.
Once issued, instructions can be executed once all their operands are actually available. Execution may in program order (in-order execution) or out of program order (out-of-order execution), depending on the machine.
Because we are predicting the outcome of branches (and maybe other things), we do not necessarily know for sure that when we execute an instruction we will actually end up using the result. This process is known as speculation. Once we for sure an instruction is going to be executed then it is committed, or enters the committal phase. The next stage is to store the result.
Ultimately we must store instruction results in their final destinations. (Actually, it is possible to not do this in all cases – for example, if a result actually represents a temporary, intermediate value that will be soon overwritten. However, as we will see in section 7.4, this raises awkward questions when dealing with exceptions, so in general all results are stored.)
This phase is known as write back and before proceeding we must be sure that (a) the instruction will actually be committed, and (b) the destination is free to be overwritten – that is, no other instruction is waiting to use its current value as a source operand. Note that in some machines, (a) does not necessarily hold – they are prepared to write uncommitted results and then `back out' of them later on. However, we will not deal with such machines in this module. It is relatively easy to establish (b) when the destination is a register, but difficult when it is a memory word because of the possibility of pointers creating aliases.
Finally, an instruction finishes and leaves the pipeline. Typically this happens immediately after write back and we say the instruction is completed or retired.
We can divide superscalar processors into a number of classes of varying complexity.
So for example, in a degree 2 machine, it is possible to issue and execute two instructions simultaneously: given instructions i1 and i2, we may choose to issue both, or only i1 (depending on the presence of hazards). We may not just issue i2.
To complicate and confuse matters, because the hardware has a choice (albeit limited) about issuing instructions, we say that instruction issue is dynamic. However, as the actual execution of instructions is in-order we say that scheduling is static.
There are a range of issues we must address to successfully build a superscalar machine. We will look at reorder buffers and scheduling algorithms in chapters 8 & 9, later. Here we will concentrate more on the problems themselves.
First, in addition to the data dependency constraints that apply to simple pipelines, there are two others we must consider. They are sometimes grouped together as name dependencies, because they are a function of the name of the source or target operands (usually registers) of instructions. They arise because we are executing instructions out of program order, and are not `real' in that they are not a property of programs themselves – hence the `true' in true data dependency, which is `real'. The two name dependencies are usually called output dependency and anti-dependency
This occurs when two instructions both store their result in the same place – and is only a problem if the execution order of instructions could be reversed. This would mean that the first instruction's result could overwrite the second, which is the opposite of what we would expect.
MUL R1, R4, 15 ; R1 = R4 * 15 ADD R2, R1, 1 ; R2 = R1 + 1 MOVE R1, R3 ; R1 = R3
There is a true data dependency between the first two instructions, because the ADD uses the result of the MUL. There is a different type of dependency between the first and third instructions – an output dependency.
The MOVE instruction doesn't depend on any of the earlier instructions for operands, but we must delay writing back the result until the first instruction has finished (and also the second instruction has fetched its operands) – otherwise, the result of the MOVE will overwrite that of the MUL (which is exactly the opposite of what is required), and the wrong value will be used by the ADD. Observe though, in this example at least, there does not seem to be any compelling reason why the MOVE cannot simply store its result elsewhere. This is the key to resolving output dependency in practice.
Consider the code fragment:
ADD R1, R2, 1 ; R1 = R2 + 1 MOVE R2, R3 ; R2 = R3
The MOVE instruction cannot complete until the ADD has begun, because it would overwrite the R2 operand needed by the ADD. Again, there seems to be no compelling reason for MOVE to use R2.
Antidependency is so named because it is fundamentally true data dependency in reverse. Observe the difference between output dependency and antidependency – the first deals with the case when two instructions are to both write to the same place, while antidependency deals with the case when the first instruction is reading, and the second is writing. In true data dependency the first is writing, and the second is reading. Just as is the case with true data dependency, these two name dependencies can give rise to hazards. Output dependency can give rise to a Write After Write (WAW) hazard, and antidpendency a Write After Read (WAR) hazard.
We will now look at the next major problem that arises with superscalar systems. If we build a naive implementation which completes instructions out of order, we will obviously end up writing instruction results out of order. Initially, this may not seem to be a problem – provided all the correct results are in the right place when a program completes (recognizing that many programs do not `complete' as such), then so what?
The problem arises because program execution can be interrupted – either by external interrupts (devices, clocks etc.) or by exceptions (runtime errors, virtual memory page faults etc.). If the source of the interruption is some terminal error, then it may not be a big problem (except for debugging). However, suppose it is one of the many forms of interruption which is supposed to lead ultimately to program execution being restarted? If we have written results out of order, we may be in the situation where, say, instructions i1 and i3 have completed, but i2 has not. This leads to two problems.
Recall that they are describing the programmer`s view of the architecture that should not reference any particular implementation, and think about how you reason about the behaviour of your own programs. For example, the P6 architecture predates the concept of superscalar. A correct implementation of such an architecture must maintain this illusion of sequential execution. That is, if program execution is interrupted at some point p, all instructions before p must be complete, and all after not yet started.
The above points require that we are able to retrieve a precise architectual state – that is, a processor state that corresponds exactly to some state of the (implementation independent) architecture. Sometimes we say, equivalently, that the processor has precise exceptions. In some circumstances, as we shall see, this requirement can be relaxed.
The basic solution to these problems is: always write results to architectural components (i.e. programmer-visible registers and memory) in the original program order. It seems at first that these restrictions must entirely remove the point of superscalar implementations – what is the point of speeding up execution by computing results out of order, if we must wait to write them in program order? The key point is the word architectural – as we shall see in chapter 8, the massively-useful reorder buffer comes to our rescue.
We mentioned above that there was an exception to the case where a precise architectural state was necessary. Simplifying somewhat there are two conditions that must be met:
The typical exception that falls into this category is an error generated by a floating point operation, which is almost certain to be terminal (e.g. divide by zero). Also floating point operations take more cycles in general than integer ones, so they potentially put more pressure on dynamic scheduling. For example, if we are required to wait to see if an FP instruction is going to cause an exception before issuing another instruction that will overwrite one of the FP's operands, then we may cause a substantial delay in a pipeline. Such exceptions are (naturally) called imprecise, and potentially performance-enhancing for FP operations. One minor irritation is the IEEE 754 standard for floating point effectively requires precise exceptions, and it is currently unacceptable in most cases not to implement the IEEE standard. Also debugging code with imprecise exceptions is difficult. Some RISC processors that allow imprecise exceptions (DEC/Compaq/HP Alpha, IBM's POWER) have chosen to implement a separate precise mode, which runs much more slowly, for debugging and standards-compliance.
Doing some simple arithmetic on a processor like the P6-derived Pentiums (see chapter 11) would lead you to believe that it was completing three instructions per cycle, and had perhaps 40 instructions in various stages of execution at any one time. In practice, these figures are wildly optimistic. The actual analysis of the performance benefits of particular implementation techniques are complex: Hennessy & Patterson devote considerable space to it. We will not address this in any detail, beyond looking at the limiting factors and some actual performance statistics. There are three main implementation limitations that we must consider.
In addition, there is (the program-dependent) true data dependency, which current implementations cannot do anything about. How much of an issue this is depends on the problem domain. For example, in many graphical applications, computations (for individual pixels, or polygons) are quite independent of their neighbours. This means potentially very high levels of parallelism, and is one reason why specialised parallel graphics processors (like ATI and Nvidia) and vector processing instruction sets have been developed. For example, the SSE (and the previous MMX) for Pentium-series processors, and Altivec for the PowerPC.
If we want to see some actual performance figures, Hennessy & Patterson consider a degree 4 PowerPC processor with six execution units. In principle, it should be able to manage 4 instructions per cycle. In practice, it manages to fetch about that many (slightly less) but can only commit about 1.25 per cycle. This is not untypical of real superscalar processors: it is fairly clear that a lot of work is currently going into gaining very small improvements and perhaps we need to look at something different (eg VLIW — chapter 10).