Previous Up Next

Chapter 9  Dynamic Scheduling

9.1  Introduction

In the 1960s techniques were developed by Thornton (working on the Control Data CDC6600) and Tomasulo (working on floating point for the IBM 360/91) for resolving dependencies and speeding up machine execution – in particular, they were aimed at data dependency, antidependency and output dependency. Although old, these machines were considerably ahead of their time – many of the techniques they used were `rediscovered' and used in later pipelined and superscalar processors. Neither was superscalar, though both were pipelined, and did permit a degree of out-of-order execution. Note: in describing both algorithms, we will assume a modern RISC load-store architecture (the 6600 was in fact load-store).

9.2  Thornton – Scoreboarding

Thornton's technique was quite simple and centralised – relying on a piece of hardware called a scoreboard to keep track of the progress of instructions, and to decide whether to allow further instructions to proceed. It resolved all forms of data dependency at different stages of instruction execution. Unlike the more sophisticated Tomasulo's Algorithm, which `resolved' output and antidependencies by getting rid of them completely, avoiding stalls (by renaming essentially), Thornton's Algorithm held up execution in all cases. The basic structure of execution is as follows:

  1. Issue — The scoreboard allows an instruction to be issued if (a) its target function unit is free; and (b) its destination register is free. If not, the pipeline stalls until (a) and (b) are true, when it proceeds.

    Note that this effectively deals with output dependency (WAW hazards), since if an instruction will write to the same destination address as an earlier one that has yet to complete, it will not be issued.

  2. Read Operands — The scoreboard allows an instruction to read its operands provided they're not being written to by an instruction that is either already executing, or has been issued for execution (step 1). Again, the pipeline stalls if this is not the case.

    This effectively deals with true data dependency (RAW hazards) since all writes to an instruction's operands must be complete before it is allowed to read them.

  3. Execution — Once operands are read, this can proceed with no possible ill effects.
  4. Write Back — Once execution has finished, an instruction can write back its result provided that no instruction that precedes it in program order that has not yet read its operands, would have (one of) those operands overwritten as a result. Once again, the pipeline stalls if this is not the case.

    This deals with antidependency (WAR hazards) when instructions are executed out of order, since data for preceding instructions cannot be overwritten.

Note that in the above list `pipeline stalls' does not necessarily mean the entire processor stalls: in a superscalar machine it is more likely to be just that part of the pipeline concerned with a single execution unit.

9.2.1  The Scoreboard

Thornton's Algorithms is highly centralized. The whole process centres around the scoreboard, where the hardware keeps track of the status of each functional unit and register. The details of what gets stored in the scoreboard depends on the sophistication of the machine. For example, if you allow out-of-order execution you need extra information about the destination registers of instructions that you have skipped over (because of dependencies) and will execute later. If you don't allow out-of-order, you can leave this out.

9.2.2  Example

We will consider an example adapted and simplified from Hennessy & Patterson. Consider a machine with the following functional units:

Floating point Multiplies and Divides are assumed to take `significantly' longer than other operations (exactly how much is not important).

Now consider the following code sequence.

    LOAD R6 A+B     ; Load R6 from address A+B    (1)
    LOAD R2 C+D     ; Load R2 from address C+D    (2)
    MUL R1, R2, R4  ; R1 = R2 * R4                (3)
    SUB R8, R6, R2  ; R8 = R6 - R2                (4)
    DIV R10, R1, R6 ; R10 = R1 / R6               (5)
    ADD R6, R8, R2  ; R6 = R8 + R2                (6)

We will assume that A, B, C and D are unrelated to the other registers used. There are numerous dependencies between these instructions, which may or may not become hazards.

Here is a (not guaranteed exhaustive) list:

The information a scoreboard will typically keep track of includes: status of instructions (i.e. which stage listed in section 9.2 has been reached); the status of the functional units; and the status of the registers (specifically which ones have been reserved for writing).

The tables in figures 9.1 & 9.2 show the status of the scoreboard when all but the last instruction have been issued. Note that in practice we would not retain information about instructions once their results have been written. In figure 9.2, the information we store for each unit is respectively: whether it is busy or not; the operation being performed; the destination register; and for each of the two operands, the source register; whether it is available or not; and the unit the result will come from (if applicable).


InstructionIssued?Ops Read?Executed?Result Written?
LOAD R6 A+Byesyesyesyes
LOAD R2 C+Dyesyesyes 
MUL R1, R2, R4yes   
SUB R8, R6, R2yes   
DIV R10, R1, R6yes   
ADD R6, R8, R2    
Figure 9.1: Scoreboarding: instruction status — all but last instruction issued


UnitBusy?OpDestSrc 1Ready? UnitSrc 2Ready?Unit
IntegerYesLOADR2C D 
MultYesMULTR1R2NoIntegerR4Yes 
Add/SubYesSUBR8R6Yes R2NoInteger
DivYesDIVR10R1NoMultR6Yes 
Figure 9.2: Scoreboarding: functional unit status — all but last instruction issued


We can immediately see from the tables in figures 9.1 & 9.2 table that:

  1. R1, R2, R8 and R10 are busy — so we cannot issue any further instructions that write to these registers because of WAW hazards (see section 9.2).

    Note that the WAW hazard between instructions LOAD (1) and ADD (6) is not an issue because (1) has completed and (6) has not yet been issued.

  2. SUB (4) and MUL (3) are waiting for R2 — which is being written by LOAD (2). Therefore, the RAW hazards between instructions LOAD (2) and MUL (3), and LOAD (2) and SUB (4) are stalling Mult1 and Add/Sub.
  3. LOAD (2) can proceed — there is nothing preventing the result of LOAD (2) being written back (no WAR hazards).
  4. Registers we can write to — we cannot write to any registers that are marked as having operands ready (because we have not yet used their values). Conversely, any registers with operands not ready, or marked `-' in figure 9.2, can be written to.

    We use `-' to mark operands we have read: in practice, we would not distinguish operands that are not ready and ones that have been read.

Figures 9.3 & 9.4 show the situation later in the execution cycle.


InstructionnIssued?Ops Read?Executed?Result Written?
LOAD R6 A+Byesyesyesyes
LOAD R2 C+Dyesyesyesyes
MUL R1, R2, R4yesyesyes 
SUB R8, R6, R2yesyesyesyes
DIV R10, R1, R6yes   
ADD R6, R8, R2yesyesyes 
Figure 9.3: Scoreboarding: instruction status — later in execution cycle


UnitBusy?OpDestSrc 1Ready? UnitSrc 2Ready?Unit
IntegerNo        
MultYesMULTR1R2 R4 
Add/SubYesADDR6R8 R2 
DivYesDIVR10R1NoMultR6Yes 
Figure 9.4: Scoreboarding: functional unit status — later in execution cycle


Now, registers R1, R6 and R10 are reserved for writing by pending instructions. The DIV (5) instruction is waiting for a result from the MUL (3), which has finished but not yet written its result to R1 (we are assuming no forwarding). Until DIV has read its operands, the ADD cannot write its result. (Actually, the destination register for ADD (6)R6 – is available, but the other operand for DIVR1 – is not.)

Finally, the tables in figures 9.5 & 9.6 show the situation when all instructions have finished executing, though DIV (5) has not yet written its result.


InstructionIssued?Ops Read?Executed?Result Written?
LOAD R6 A+Byesyesyesyes
LOAD R2 C+Dyesyesyesyes
MUL R1, R2, R4yesyesyesyes
SUB R8, R6, R2yesyesyesyes
DIV R10, R1, R6yesyesyes 
ADD R6, R8, R2yesyesyesyes
Figure 9.5: Scoreboarding: instruction status — all executed, DIV to write back


UnitBusy?OpDestSrc 1Ready? UnitSrc 2Ready?Unit
IntegerNo        
MultNo        
Add/SubNo        
DivYesDIVR10R1 R6 
Figure 9.6: Scoreboarding: functional unit status — all executed, DIV to write back


9.3  Tomasulo – Reservation Stations

A more sophisticated alternative to Thornton's Scoreboarding is Tomasulo's Algorithm which uses the concept of Reservation Stations. Reservation stations are closely-related to register renaming systems. You can almost think of them as individual reorder buffers on the inputs of functional units.

Unlike Thornton's scheme, Tomasulo's Algorithm is not based on a centralised control system, but is instead distributed. Also, Tomasulo's Algorithm is able to eliminate WAW and WAR hazards arising from output and antidependencies, instead of stalling. (Actually, it can only do this when resources are available and there is always a practical limit on the resources we can provide.)

The disadvantage of Tomasulo's Algorithm is that it is much more complex to implement. Nevertheless, pressure for performance is such that some variant of Tomasulo's Algorithm is widely used in modern processors. Note though that the basic algorithm does not provide for speculation beyond branches. Therefore, we must add some branch prediction scheme together with hardware to deal with speculative execution of instructions (for example, a reorder buffer — see chapter 8).

The idea is that each functional unit has a queue of reservation stations which are used to hold instructions and operands prior to execution, as well as information on those instructions and operands (e.g. if operands are available, and if so where from).

A typical example is shown in figure 9.7. (Note that some details concerned with memory access have been left out to make the diagram clearer).


Figure 9.7: Hardware for a Tomasulo-Based Pipeline

Whenever an instruction retires, the result is sent simultaneously not only to its destination register, but also to any other reservation stations requiring that result. This has the effect of speeding up execution by bypassing the architectural registers – there is no need to wait for results to be stored, they can be used immediately. However, it can also eliminate antidependency and output dependency.

Antidependency and output dependency are only problems in so far as they cause conflicts between available registers. For example, suppose that two instructions I1 and I2 write to the same register R1 (output dependency). We can eliminate this by simply making one of them write to a different register. If there are enough registers available, the compiler can (and should) do this. However, with a legacy architecture, this may not be the case. With reservation stations, the results of I1 and I2 would both be sent to different reservation stations – both I1 and I2's results would be sent to the reservation stations containing those instructions that needed them. If I2 and I1 are executed out of order, it is a relatively simple matter to ensure that only the result of I2 actually gets written back to the architectural destination register – unless an error occurs, I1's result need not be written there at all.

The steps involved are:

  1. Issue — Get an instruction from the buffer, and send it to the appropriate functional unit, where a reservation station is reserved. (If one is available – otherwise the instruction has to wait before being issued.)

    If the operands are available in architectural registers, read them in to the reservation station. Otherwise, mark which functional unit (and corresponding reservation station) will (eventually) produce them. When they are available, read them in.

    This step effectively removes WAW and WAR hazards by not writing results to architectural registers, and hence not overwriting later results or operands if execution is out of order.

  2. Execute — When both operands are in the reservation station, proceed with execution.

    This eliminates RAW hazards because we wait for the required operands to become available. If more than one instruction becomes available for a particular functional unit simultaneously, usually it is possible to pick one arbitrarily.

    An exception to this is loads and stores, where we must ensure that pairs of load/store instructions to the same memory address are maintained in order. (In fact, sequences of loads without any intervening stores can be reordered, but we would hope that a compiler would not generate more than one load to the same address without an intervening store unless it was very short of registers.)

  3. Write Result — When the result is available, send it to all waiting reservation stations, and to the architectural registers.

    This writing is usually managed by means of a common data bus that connects the registers and reservation stations. (In practice, it is usual to only write back the last in a sequence of results to the architectural registers unless an exception occurs.)

Notice that unlike Thornton's algorithm, there is no checking, or stalling, for antidependency and output dependency. The renaming that (effectively) takes place when results are copied to the reservation stations that will use them eliminates these forms of dependency.

9.3.1  Example

We will now consider the same example code sequence we used with Tomasulo's Algorithm in section 9.2.2. To take advantage of the extra parallelism that we will now find, we will add a second Add/Sub unit. We will start off at the same stage of execution as figure 9.1 above.

See figures 9.8 and 9.9. The columns in figure 9.9 are similar to those for Thornton's Algorithm (e.g. in figure 9.2), except that now we store either the architectural source of the operand (`Src') or the reservation station it is coming from (`Res').


InstructionIssued?Executed?Result Written?
LOAD R6 A+Byesyesyes
LOAD R2 C+Dyesyes 
MUL R1, R2, R4yes  
SUB R8, R6, R2yes  
DIV R10, R1, R6yes  
ADD R6, R8, R2yes  
Figure 9.8: Tomasulo's: instruction status — all but last instruction issued


UnitBusy?OpDestSrc 1Res 1Src 2Res 2
IntegerYesLOADR2C D 
MultYesMULTR1 IntegerR4 
Add/Sub1YesSUBR8R6  Integer
Add/Sub2YesADDR6 Add/Sub1 Integer
DivYesDIVR10 MultR6 
Figure 9.9: Tomasulo's: functional unit status — all but last instruction issued


Notice that now we have been able to issue ADD (6) because the WAR hazard (antidependency) with DIV (5) has been resolved: the value of R6 has been read into the revervation station being used by DIV (5). ADD (6) will be able to proceed with execution as soon as the result of SUB (4) is available. However, DIV (5) must wait for MUL (3) to finish execution before its other argument will become available.

Figures 9.10 and 9.11 show the situation near the end of execution.


InstructionIssued?Executed?Result Written?
LOAD R6 A+Byesyesyes
LOAD R2 C+Dyesyesyes
MUL R1, R2, R4yesyes 
SUB R8, R6, R2yesyesyes
DIV R10, R1, R6yes  
ADD R6, R8, R2yesyesyes
Figure 9.10: Tomasulo's: instruction status — near end of execution


UnitBusy?OpDestSrc 1Res 1Src 2Res 2
IntegerNo      
MultYesMULTR1R2 R4 
Add/Sub1No      
Add/Sub2No      
DivYesDIVR10 MultR6 
Figure 9.11: Tomasulo's: functional unit status — near end of execution


The difficulty of effectively implementing algorithms like Thornton's and especially Tomasulo's has led researchers to try and find alternatives that do not require such large quantities of complicated control hardware. We will look at one (VLIW) in the next chapter.


Previous Up Next