In the 1960s techniques were developed by Thornton (working on the Control Data CDC6600) and Tomasulo (working on floating point for the IBM 360/91) for resolving dependencies and speeding up machine execution – in particular, they were aimed at data dependency, antidependency and output dependency. Although old, these machines were considerably ahead of their time – many of the techniques they used were `rediscovered' and used in later pipelined and superscalar processors. Neither was superscalar, though both were pipelined, and did permit a degree of out-of-order execution. Note: in describing both algorithms, we will assume a modern RISC load-store architecture (the 6600 was in fact load-store).
Thornton's technique was quite simple and centralised – relying on a piece of hardware called a scoreboard to keep track of the progress of instructions, and to decide whether to allow further instructions to proceed. It resolved all forms of data dependency at different stages of instruction execution. Unlike the more sophisticated Tomasulo's Algorithm, which `resolved' output and antidependencies by getting rid of them completely, avoiding stalls (by renaming essentially), Thornton's Algorithm held up execution in all cases. The basic structure of execution is as follows:
Note that this effectively deals with output dependency (WAW hazards), since if an instruction will write to the same destination address as an earlier one that has yet to complete, it will not be issued.
This effectively deals with true data dependency (RAW hazards) since all writes to an instruction's operands must be complete before it is allowed to read them.
This deals with antidependency (WAR hazards) when instructions are executed out of order, since data for preceding instructions cannot be overwritten.
Note that in the above list `pipeline stalls' does not necessarily mean the entire processor stalls: in a superscalar machine it is more likely to be just that part of the pipeline concerned with a single execution unit.
Thornton's Algorithms is highly centralized. The whole process centres around the scoreboard, where the hardware keeps track of the status of each functional unit and register. The details of what gets stored in the scoreboard depends on the sophistication of the machine. For example, if you allow out-of-order execution you need extra information about the destination registers of instructions that you have skipped over (because of dependencies) and will execute later. If you don't allow out-of-order, you can leave this out.
We will consider an example adapted and simplified from Hennessy & Patterson. Consider a machine with the following functional units:
Floating point Multiplies and Divides are assumed to take `significantly' longer than other operations (exactly how much is not important).
Now consider the following code sequence.
LOAD R6 A+B ; Load R6 from address A+B (1) LOAD R2 C+D ; Load R2 from address C+D (2) MUL R1, R2, R4 ; R1 = R2 * R4 (3) SUB R8, R6, R2 ; R8 = R6 - R2 (4) DIV R10, R1, R6 ; R10 = R1 / R6 (5) ADD R6, R8, R2 ; R6 = R8 + R2 (6)
We will assume that A, B, C and D are unrelated to the other registers used. There are numerous dependencies between these instructions, which may or may not become hazards.
Here is a (not guaranteed exhaustive) list:
The information a scoreboard will typically keep track of includes: status of instructions (i.e. which stage listed in section 9.2 has been reached); the status of the functional units; and the status of the registers (specifically which ones have been reserved for writing).
The tables in figures 9.1 & 9.2 show the status of the scoreboard when all but the last instruction have been issued. Note that in practice we would not retain information about instructions once their results have been written. In figure 9.2, the information we store for each unit is respectively: whether it is busy or not; the operation being performed; the destination register; and for each of the two operands, the source register; whether it is available or not; and the unit the result will come from (if applicable).
Instruction Issued? Ops Read? Executed? Result Written? LOAD R6 A+B yes yes yes yes LOAD R2 C+D yes yes yes MUL R1, R2, R4 yes SUB R8, R6, R2 yes DIV R10, R1, R6 yes ADD R6, R8, R2
Unit Busy? Op Dest Src 1 Ready? Unit Src 2 Ready? Unit Integer Yes LOAD R2 C – D – Mult Yes MULT R1 R2 No Integer R4 Yes Add/Sub Yes SUB R8 R6 Yes R2 No Integer Div Yes DIV R10 R1 No Mult R6 Yes
We can immediately see from the tables in figures 9.1 & 9.2 table that:
Note that the WAW hazard between instructions LOAD (1) and ADD (6) is not an issue because (1) has completed and (6) has not yet been issued.
We use `-' to mark operands we have read: in practice, we would not distinguish operands that are not ready and ones that have been read.
Figures 9.3 & 9.4 show the situation later in the execution cycle.
Instructionn Issued? Ops Read? Executed? Result Written? LOAD R6 A+B yes yes yes yes LOAD R2 C+D yes yes yes yes MUL R1, R2, R4 yes yes yes SUB R8, R6, R2 yes yes yes yes DIV R10, R1, R6 yes ADD R6, R8, R2 yes yes yes
Unit Busy? Op Dest Src 1 Ready? Unit Src 2 Ready? Unit Integer No Mult Yes MULT R1 R2 – R4 – Add/Sub Yes ADD R6 R8 – R2 – Div Yes DIV R10 R1 No Mult R6 Yes
Now, registers R1, R6 and R10 are reserved for writing by pending instructions. The DIV (5) instruction is waiting for a result from the MUL (3), which has finished but not yet written its result to R1 (we are assuming no forwarding). Until DIV has read its operands, the ADD cannot write its result. (Actually, the destination register for ADD (6) – R6 – is available, but the other operand for DIV – R1 – is not.)
Finally, the tables in figures 9.5 & 9.6 show the situation when all instructions have finished executing, though DIV (5) has not yet written its result.
Instruction Issued? Ops Read? Executed? Result Written? LOAD R6 A+B yes yes yes yes LOAD R2 C+D yes yes yes yes MUL R1, R2, R4 yes yes yes yes SUB R8, R6, R2 yes yes yes yes DIV R10, R1, R6 yes yes yes ADD R6, R8, R2 yes yes yes yes
Unit Busy? Op Dest Src 1 Ready? Unit Src 2 Ready? Unit Integer No Mult No Add/Sub No Div Yes DIV R10 R1 – R6 –
A more sophisticated alternative to Thornton's Scoreboarding is Tomasulo's Algorithm which uses the concept of Reservation Stations. Reservation stations are closely-related to register renaming systems. You can almost think of them as individual reorder buffers on the inputs of functional units.
Unlike Thornton's scheme, Tomasulo's Algorithm is not based on a centralised control system, but is instead distributed. Also, Tomasulo's Algorithm is able to eliminate WAW and WAR hazards arising from output and antidependencies, instead of stalling. (Actually, it can only do this when resources are available and there is always a practical limit on the resources we can provide.)
The disadvantage of Tomasulo's Algorithm is that it is much more complex to implement. Nevertheless, pressure for performance is such that some variant of Tomasulo's Algorithm is widely used in modern processors. Note though that the basic algorithm does not provide for speculation beyond branches. Therefore, we must add some branch prediction scheme together with hardware to deal with speculative execution of instructions (for example, a reorder buffer — see chapter 8).
The idea is that each functional unit has a queue of reservation stations which are used to hold instructions and operands prior to execution, as well as information on those instructions and operands (e.g. if operands are available, and if so where from).
A typical example is shown in figure 9.7. (Note that some details concerned with memory access have been left out to make the diagram clearer).
Whenever an instruction retires, the result is sent simultaneously not only to its destination register, but also to any other reservation stations requiring that result. This has the effect of speeding up execution by bypassing the architectural registers – there is no need to wait for results to be stored, they can be used immediately. However, it can also eliminate antidependency and output dependency.
Antidependency and output dependency are only problems in so far as they cause conflicts between available registers. For example, suppose that two instructions I1 and I2 write to the same register R1 (output dependency). We can eliminate this by simply making one of them write to a different register. If there are enough registers available, the compiler can (and should) do this. However, with a legacy architecture, this may not be the case. With reservation stations, the results of I1 and I2 would both be sent to different reservation stations – both I1 and I2's results would be sent to the reservation stations containing those instructions that needed them. If I2 and I1 are executed out of order, it is a relatively simple matter to ensure that only the result of I2 actually gets written back to the architectural destination register – unless an error occurs, I1's result need not be written there at all.
The steps involved are:
If the operands are available in architectural registers, read them in to the reservation station. Otherwise, mark which functional unit (and corresponding reservation station) will (eventually) produce them. When they are available, read them in.
This step effectively removes WAW and WAR hazards by not writing results to architectural registers, and hence not overwriting later results or operands if execution is out of order.
This eliminates RAW hazards because we wait for the required operands to become available. If more than one instruction becomes available for a particular functional unit simultaneously, usually it is possible to pick one arbitrarily.
An exception to this is loads and stores, where we must ensure that pairs of load/store instructions to the same memory address are maintained in order. (In fact, sequences of loads without any intervening stores can be reordered, but we would hope that a compiler would not generate more than one load to the same address without an intervening store unless it was very short of registers.)
This writing is usually managed by means of a common data bus that connects the registers and reservation stations. (In practice, it is usual to only write back the last in a sequence of results to the architectural registers unless an exception occurs.)
Notice that unlike Thornton's algorithm, there is no checking, or stalling, for antidependency and output dependency. The renaming that (effectively) takes place when results are copied to the reservation stations that will use them eliminates these forms of dependency.
We will now consider the same example code sequence we used with Tomasulo's Algorithm in section 9.2.2. To take advantage of the extra parallelism that we will now find, we will add a second Add/Sub unit. We will start off at the same stage of execution as figure 9.1 above.
See figures 9.8 and 9.9. The columns in figure 9.9 are similar to those for Thornton's Algorithm (e.g. in figure 9.2), except that now we store either the architectural source of the operand (`Src') or the reservation station it is coming from (`Res').
Instruction Issued? Executed? Result Written? LOAD R6 A+B yes yes yes LOAD R2 C+D yes yes MUL R1, R2, R4 yes SUB R8, R6, R2 yes DIV R10, R1, R6 yes ADD R6, R8, R2 yes
Unit Busy? Op Dest Src 1 Res 1 Src 2 Res 2 Integer Yes LOAD R2 C D Mult Yes MULT R1 Integer R4 Add/Sub1 Yes SUB R8 R6 Integer Add/Sub2 Yes ADD R6 Add/Sub1 Integer Div Yes DIV R10 Mult R6
Notice that now we have been able to issue ADD (6) because the WAR hazard (antidependency) with DIV (5) has been resolved: the value of R6 has been read into the revervation station being used by DIV (5). ADD (6) will be able to proceed with execution as soon as the result of SUB (4) is available. However, DIV (5) must wait for MUL (3) to finish execution before its other argument will become available.
Figures 9.10 and 9.11 show the situation near the end of execution.
Instruction Issued? Executed? Result Written? LOAD R6 A+B yes yes yes LOAD R2 C+D yes yes yes MUL R1, R2, R4 yes yes SUB R8, R6, R2 yes yes yes DIV R10, R1, R6 yes ADD R6, R8, R2 yes yes yes
Unit Busy? Op Dest Src 1 Res 1 Src 2 Res 2 Integer No Mult Yes MULT R1 R2 R4 Add/Sub1 No Add/Sub2 No Div Yes DIV R10 Mult R6
The difficulty of effectively implementing algorithms like Thornton's and especially Tomasulo's has led researchers to try and find alternatives that do not require such large quantities of complicated control hardware. We will look at one (VLIW) in the next chapter.