So far, we have considered pipelines for the general instruction execution cycle. However, the execute stage of instruction execution is generally itself pipelined. Typically, modern microprocessors will have a number of functional units, responsible usually for integer, floating point, load/store and branch operations. (Not all of these are always present — for example, branch may be combined with integer, floating point may be absent. Some units may be duplicated. We will return to duplicated functional units when we consider superscalar processors in chapter 7.)
Pipelines without loops are relatively easy to design and control. However, many pipelines need loops if the hardware requirements are to be kept to a reasonable level — particularly arithmetic pipelines. For example, consider a floating point pipeline performing a multiplication. Typically, towards the end of the process the result must be renormalised. That is, the mantissa must be shifted (and the exponent correspondingly adjusted) so it is of the form: 0.1XXX..., where X is any binary value. After renormalisation, the result must be rounded (typically using one of a number of different, and selectable, rounding modes). Under some circumstances, for example when dealing with denormal numbers, the result may need to be renormalised again.
As another example, consider floating point addition. Before we add two FP numbers we must shift one of them so that their exponents are the same (if we are to get the correct answer). If the exponents differ substantially, this could involve a significant amount of shifting: possibly more than the shifter we feel justified in including in our processor is capable of managing in one cycle.
Both these examples will involve pipeline loops. Also, because we are assuming there is a single FP execution unit, we must ask: what happens when we try to do an add followed by a multiply, or some other combination? How do we manage & control the contents of the pipeline? Of course, we could simply use a more capable (though more expensive) shifter, or in the first example we could duplicate the normalisation unit. We could also separate out floating point multiplication and addition units. However, these steps are expensive and not all processors can justify then. Even when it is justifiable, there may be other yet more complex operations present (division, square root). In general then, we must be prepared to deal with pipelines that (a) have loops and (b) have multiple paths through their component parts.
Before we proceed, some notation. In what follows, we will use the terms operation and computation more or less equivalently. We are mainly concerned here with a fragment of the complete pipeline that corresponds with the execution (EX) phase of the complete, microprocessor, pipeline. Most of the time we will (somewhat sloppily) refer to `the pipeline', meaning just the EX part. However, occasionally we will need to also refer to the complete microprocessor pipeline as well, and in those cases we will try to be more careful.
The obvious problem is collisions between the stages — one stage being required by more than one instruction simultaneously. We will look at a technique for managing pipelines that can be implemented in hardware relatively cheaply. This algorithm is credited to E S Davidson, and like many of the techniques used now for microprocessors it has its roots in problems first encounted when developing supercomputers in the 60s and 70s.
Consider the pipeline shown in figure 6.1, with the first two stages used twice.
If we start an instruction in the pipeline, at what future time can we start another instruction? We can construct a reservation table which shows which stages in the pipeline are in use in any given time cycle, assuming that just a single instruction was started at time 1 — see figure 6.2.
T1 T2 T3 T4 T5 T6 S1 X X S2 X X S3 X S4 X
Suppose having started one computation, we start a new computation k clock cycles later — we can superimpose a new copy of the reservation table on figure 6.2, offset by k clock cycles. If any crosses overlap, then k is a forbidden latency. Otherwise, k is an allowable latency. That is, we can start a new computation at that point. Clearly, that leads to a further question: once we have started a second computation, and the first is still in the pipeline, when/if can we start a third?
We can construct a collision vector from the reservation table that forms the basis of a very simple control algorithm. Consider an n-bit vector: Cv = cn, cn−1, ⋯, c3, c2,
where ci = 1 if time i is a forbidden latency, and ci = 0 otherwise. Note that we omit time 1 because the collision vector assumes that an operation has already started at time 1.
In the case of figures 6.1 & 6.2, Cv = 1000.
That is, assuming an operation starts at time 1, we can start new operations at times 2, 3 and 4, which correspond to the three zeros in the collision vector reading from the right. However we cannot start a new operation at time 5, corresponding to the 1. After time 5, any operation that started at time 1 cannot affect any subsequent operations. Therefore, in this case, there is no need to make Cv any longer. (Although we could add one or more leading zeros if we wished: it will not affect the algorithm described below, apart from requiring a bit more [unnecessary] hardware.)
We can use the collision vector plus some very simple hardware to implement a pipeline management scheme. We need a shift register of length n, a set of n OR gates and some simple control circuitry. These are arranged as shown in figure 6.3.
Let's look at what happens in the above pipeline when we start a new compution every cycle for as long as possible in figure 6.4.
Initial Granted? Start? Shifted ORed 0000 Yes Yes 0000 1000 1000 Yes Yes 0100 1100 1100 Yes Yes 0110 1110 1110 Yes Yes 0111 1111 1111 No - 0111 - 0111 No - 0011 - 0011 No - 0001 - 0001 No - 0000 - 0000 Yes Yes 0000 1000
In figure 6.5, we do not start new operations on every possible cycle.
Initial Granted? Start? Shifted ORed 0000 Yes Yes 0000 1000 1000 Yes No 0100 - 0100 Yes Yes 0010 1010 1010 Yes No 0101 - 0101 No - 0010 - 0010 Yes Yes 0001 1001 1001 No - 0100 - 0100 Yes Yes 0010 1010
In these examples, there are some times at which it is not possible to start new operations. What happens if a new operation arrives at one of those times? It has to wait — either by stalling the whole microprocessor pipeline, or by waiting in a queue, or buffer, at the start of the execution unit pipeline. Given we are likely to be dealing with multiple execution units, if a bit of a backlog builds up at one we may be able to clear it later when more instructions are directed at the others, possibly without stalling (much). We will return to this later when we consider scheduling algorithms, in chapter 9.
In both the examples above, we have managed to start four new operations in eight cycles of the pipeline. We can ask ourselves some questions.
The answer to (1) is that, yes, there is a theoretical limit. For a given pipeline's reservation table, find the row with the maximum number of entries. If there are D entries, then the maximum number of operations that can be started per clock cycle is 1/D. In the example from figure 6.1, D=2, so the limit is 1/2. Since we can start four operations in eight cycles, this pipeline meets the theoretical limit.
The answer to (2) is that no, it is not guaranteed that for every pipeline our simple scheduling algorithm will be optimal in terms of throughput. If we wish to maximise throughput (and we might not), then we need to construct a reduced state diagram — see section 6.2.3.
The answer to (3) is again no — there are some pipelines where even if we are careful with operation scheduling, the theoretical throughput limit cannot be reached. However, this can be fixed, as we'll see in section 6.3.
While easy, the simple control algorithm above is not always optimal. In some pipelines, we can at least in principle increase throughput by being more careful with scheduling. To do this, we need to construct a graph of all the possible operation sequences through the pipeline. This is called a reduced state diagram, and that for figure 6.1 is shown in figure 6.6.
The nodes are labelled with the shift register contents, and the arcs with the number of cycles required to move from one node to the next. Although it looks quite complex, such diagrams are relatively straightforward to construct — though the process is error-prone if done manually. We can inspect this diagram to identify minimal latency cycles — cycles with the shortest waiting times.
In this case, there are several. One is the simple `greedy' cycle where we simply start new operations as soon as possible:
1000 → 1100 → 1110 → 1111 → 1000
Of course, in our case, we already knew this. There is another such cycle:
1010 → 1101 → 1011 → 1001 → 1010
The effort involved in this is not always worth it — perhaps not usually worth it. We would have to abandon a very simple and general control algorithm for one that is quite complex and much less easily modified if the pipeline later changes (because the details of its implementation may be much more specific to a particular pipeline — depending on how we choose to implement it). Also, we are (as is usual in pipelines) trading throughput against latency.
Normally, looking at the microprocessor pipeline as a whole, this is a trade-off that is heavily biased towards throughput. However, this may not be the case with a fragment of a pipeline. It may simply be the case that there is not enough demand for a particular execution unit pipeline to need to maximise throughput.
In addition, the performance of the complete pipeline must be considered. For example, suppose we have a relatively simple pipeline with no means of buffering at the execution unit. Now suppose that an operation arrives at a time that is non-optimal for throughput. If we delay it to an optimal time, we are definitely stalling the entire pipeline. The hope is that this will mean fewer future stalls; however, we cannot know this for sure because we probably do not know when operations will arrive in the future.
On the other hand if we are buffering operations (and this is very likely in advanced implementations because it is a consequence of a common scheduling algorithm we'll meet in chapter 9) then we do stand a chance of knowing what operations are required in the near future because at least we can see what is in the buffer, and make a judgement based on that information. Even that is not guaranteed to work: consider the case where there is only one operation waiting, and we choose to schedule it at a non-optimal time — and then a lot more operations arrive soon after.
Recall the theoretical bound on throughput discussed in section 6.2.2. If a pipeline won't reach the maximum possible throughput you would expect, it is possible to insert delays to improve performance. Consider the reservation table in figure 6.7.
1 2 3 4 5 6 7 A X X X B X X C X X D X
The collision vector is: Cv = 1111
The reduced state diagram (figure 6.8) shows a minimal latency of 5 cycles. However, the reservation table has one row with 3 entries, leading us to believe that it should be possible to get a result out every 3 cycles…
The way we do this is illustrated in table 5, table 6, table 7 and table 8.
For each column of the reservation table in turn, we mark both the operation itself (with X), and all those entries which we wish to remain free for subsequent initiations (with R). At time 1, in table 5, we have marked every 3rd cycle, so we can initiate later operations at these times. Notice that we have marked time 4, and on the initial reservation table (table 4), pipeline unit A is used at time 4.
Table 5, Time 1
In table 6 we continue the process for time 2.
Table 6, Time 2
In table 7 we do the same for time 3.
Table 7, Time 3.
We continue for each subsequent column (time) of the table — there are no problems except at time 4 (table 8), where we want to use pipeline unit A and it is reserved at that time. We must delay the initiation of the operation of unit A at time 4, because we want unit A to be free at this time to start a new operation. Effectively this means inserting a new Delay (NO-OP) unit in the pipeline. We direct the result of using A at time 3 to the delay unit at time 4. We then send it back to A at time 5, when A is no longer busy.
Table 8, Time 4
It turns out we can complete the rest of the table with no further problems — the final result is shown in table 9.
Table 9 — Final table.
As we have said, it is common practice to combine a number of arithmetic functions in a single pipeline — each function uses different sequences of stages of the same pipeline. For example, consider a combined floating point multiplier and adder. It is relatively straightforward to modify our simple scheduling algorithm to handle this. First, we must construct collision vectors for every sequence of length 2 of the possible operations. In this case, there are four:
The hardware to implement this requires two shift registers — one is used to grant permission to perform a multiply, and the other for an add: see figure 6.9.
Suppose we wish to perform a multiply. We can do so provided the rightmost bit of the grant-multipy shift register is zero. After we have granted the multiply we then OR the shift register with either the Add followed by Mult or the Mult followed by Mult collision vector, depending on the type of the previous operation to enter the pipeline. (Note that in some descriptions of the algorithm used here, particularly that in Stone's book which deals with multi-function pipelines, collision vectors are represented the other way round.)