Pipelining is a natural method of attempting to speed up program execution and the most common form of parallelism around. Its use in computers as a performance enhancing mechanism can be traced back to the late 50s/early 60s, which is remarkably close to the start of the computing era. Most current reasonably state-of-the-art machines are quite heavily pipelined. A common aim in processor design is to increase instruction level parallelism (ILP), though a more useful measure is Cycles per Instruction (CLI) – the lower the better.
The principle of a pipelined processor is that of the assembly line — many components are in different stages of construction at once. The stages of the assembly line are arranged so they all take the same amount of time, and once every time cycle, a new product drops off the end — which may have taken hours to progress from start to finish.
This is a critical point: the length of time an individual instruction takes to execute (latency) is irrelevant. What is important is the total throughput; the length of time a whole block of instructions takes to execute. In practice, an individual instruction is likely to take longer to execute, because of the overheads imposed by implementing pipelining.
We divide the execution of an instruction up into a number of stages, and try to overlap them — when stage i for instruction j has finished, we (a) start stage i+1 for j, and (b) start i for j+1 — or at least we try! See figure 4.1 for a graphical representation of the potential speedup with n stages.
The following five-stage example assumes a load-store RISC architecture.
This five-stage pipeline will potentially produce results five times as fast as a non-pipelined implementation. Five stages is a [commonly-used] example — the (vanilla) Pentium uses five stages; the P6 (core of the Pentium II, III and variants) uses 14; the Pentium 4 uses in excess of 20. There are a number of points to concern us when deciding how many stages to use.
You don't want to keep parts of the processor waiting because some stages are slow. The obvious things to (try) to do in such cases are (a) try to make sure each stage takes the same amount of time in the first place; and if necessary (b) to break down the slow stage into smaller parts. This assumes that we can simply ensure (a) that all stages will always take the the same amount of time; and (b) that the same number of stages is always appropriate.
Consider, say, an integer add and a floating point (FP) multiply. If we use the same number of stages in each case, the FP multiply will take much longer than the add for the EX stage. In practice, in an advanced design we would probably pipeline the execution phase of FP operations, meaning more pipeline stages in this case. All this makes life complicated, and we will return to it when we consider high-performance pipelines in chapter 6.
It is necessary to `tune' the stages so that there are no resource conflicts between stages. For example, Instruction Fetch and Operand Fetch both may require memory access simultaneously. Separating the instruction cache and data cache would help here, and this is a big reason why this is now popular. We will return to the subject of caches in chapter 16.
In general, resource conflicts can be resolved by resource duplication: for example, while one instruction is being fetched, an address computation requiring the program counter (PC) may be required for another (e.g. a PC-relative memory access). Note that in both these cases, an instruction needs to know the value of the `program counter', but it is a different value for each instruction. In this case, it is usual to have two program counter registers, out of step with each other — one (the fetch program counter, FPC) for the next instruction to be fetched, and the other (the `real' PC) for the instruction being executed.
Although many such conflicts can be resolved like this, it is not possible (or economic) to do so for all of them. In such cases, the two competing demands for the same resource must take turns, and one must wait. In such circumstances, the pipeline halts, or stalls, briefly. Resource conflicts are an example of a hazard which we will look at in more detail in section 4.3 and chapter 5.
Although it might initially seem tempting to make pipelines longer and longer, there are drawbacks (you may have already thought of one one connected with branching — see chapter 5 otherwise). Although there are benefits from lengthening pipelines (the Pentium 4 designers have done so) there are also drawbacks. Essentially, these centre around circumstances in which the pipeline can become empty. If this occurs, it takes a long time to refill, which causes a significant performance penalty. Much effort goes into preventing this as we'll see, but it inevitably sometimes happens.
The term hazard was borrowed from digital electronic design, where it referred to the possible occurrence of an incorrect data value for a short period of time. Consider figure 4.2.
On the face of it the output should always be false. However, this ignores real-life behaviour, where computation time is non-zero. Consider what happens when the input changes from false to true. The signal that does not pass through the NOT gate will be quicker than the one that does, meaning there is a small period of time when both inputs to the AND gate are true, and therefore so is the output of the AND gate. There is a risk (`hazard') of this data value being captured and used instead of the correct one if we are not careful (we do not have time to go into the details of how this capture occurs and how to stop it). Although this is a silly example, the same problem happens in useful circuits. We will use the term hazard to refer to the potential for incorrect computation.
For a pipelined processor, there are three sources of hazard:
In the next section, we will confine ourselves to (a) what RAW is; (b) what must happen to the pipeline when it occurs; and (c) some ways to `fix' it.
Consider the following sequence of instructions, and assume it is being executed by a machine with a five-stage pipeline, as outlined in section 4.2.
ADD R1, R1, R3 ; R1 = R1+R3 SUB R4, R1, R2 ; R4 = R1-R2
The second SUB instruction uses the result of the first ADD instruction. That is, there is a data dependency (sometimes called a true data dependency) between the two instructions.
There are inevitably many such dependencies in programs – and many of them do not concern us, because the instructions are some distance apart. (By that, we mean the first one has finished before the second starts, or at least that the first has passed some `critical stage' before the second starts some other `critical stage'.)
However, in this particular case the SUB will read R1 during its second pipeline stage (ID), and the ADD will write its result to R1 during its fifth pipeline stage (WB). This means that the SUB will try to read R1 while the ADD is still in the EX stage: that is, it is still computing the value that SUB is expecting to be in R1. Unless we take steps to manage this, SUB is going to use the old value of R1. See figure 4.3.
In this case, the dependency, which we can think of as a potential problem turns into a RAW hazard — a real problem that we must do something about. The things we need to do are:
In this example, the potential delay is three cycles, which is serious and something we would like to address. Another way of looking at it is that we are effectively putting a sequence of (in this case three) NO-OPs between the ADD and SUB. Actually, if we really did put the NO-OPs in, then the hardware would have to do nothing. This would involve getting the compiler to identify the hazard and is the approach taken in the early MIPS project — the `without Interlocked Pipeline Stages' means that the processor does not have the hardware to detect and avoid such hazards, but instead relies on the compiler. In a different way, Intel's new IA-64 architecture returns to the idea of relying on the compiler to identify certain situations.
What can we do about this? The first idea is to try to stop it occurring at all – get the compiler to try to avoid generating such sequences by reordering instructions (we look at some possibilities in chapter 14).
For example, maybe we can move some unrelated instructions between the ADD and SUB. The chances are that such a sequence of instruction is the result of a statement like:
There is a tendency to think that the instructions implementing a single statement should be together, but this is not necessary — just a hangover from hand-written assembler programs. A good compiler will try to reorder instructions, but it is hard to completely eliminate RAW hazards: there just are not enough unrelated instructions available because of branches (though we will look at ways around this in chapter 5).
A second strategy is to observe that although the ADD has not yet computed its result by the time the SUB reads its arguments, by the time the SUB comes to actually compute its result, the new value of R1 is known (though it has not yet been stored).
We could add a feedback loop from the output of the execute (EX) stage hardware back to the input. We could then add a bit of hardware to identify cases where such a feedback value can be usefully used instead of the contents of a register provided by the decode (ID) stage to proceed without stalling. Such hardware is actually not that complicated. This is a special case of something called forwarding, or bypassing. In general, we may need a value that does not come from the immediately preceding instruction, but from an earlier one: consider the code sequence above with another instruction
ADD R5, R1, R3 ; R5 = R1 + R3
following it. This also uses the value of R1 computed by the first ADD, which is still not yet stored in R1, but in this case a simple feedback loop will not work (we will get the result of the SUB instead). There is some devious hardware called a reorder buffer which fixes this by generalising forwarding, and solves other problems, which we will look at in chapter 8.
Our simple five-stage pipeline is a bit shorter than those often used in practice, and it is usually the case that we cannot completely eliminate RAW hazard delays using forwarding, but only reduce them. One idea, which has not yet made it into commercial processors, is to `guess', or predict values. This is not as mad as it sounds - as we shall see, prediction has been used for branches very successfully. It also appears in some other areas as well.
How do we predict the value of operands? One way is to pattern match on past values. Suppose in the example above that the code forms part of a loop body, and also suppose that R2 is constant. We could quite easily predict future values in this case. Of course, it is possible to be wrong. However, we have lost nothing because we would have had to wait in any case if we had not predicted the value. But in such a case we must not have written any results to registers/memory until we knew are prediction was correct. (Alternatively, but less commonly, we must be able to `undo' instructions.) The corollary of this is that we need some way to mark instruction execution as provisional, or speculative, until we know otherwise. It turns out that this is one of those other problems that reorder buffers solve.