Between 10 and 30% of instructions in typical programs are branches — mainly conditional branches. A [taken] branch interrupts the normal sequence of program execution. As we shall see, if we fail to take steps to prevent it, this will cause a pipeline to (more or less) completely empty. We have been seeing processor pipelines increase in length — recall the Pentium 4 with over 20 stages. There are reasons to think this trend may not continue, and many other processors have much shorter pipelines, but the same problems will still apply, if not so dramatically (and there are no guarantees that pipelines will stop getting longer).
The problem is made worse because processors like the Pentium 4 are superscalar, meaning they have multiple parallel pipelines (see chapter 7). In such processors, there may be of the order of 100 instructions in some stage of execution — and 10-30 of them are going to be branches.
If we accept naively that branches should cause pipelines to empty, huge amounts of work fetching (the wrong) instructions will be wasted, and such pipelines will never be operating at full efficiency. In effect, the vast majority of the time such processors will be simply refilling pipelines, with the occasional burst of productive activity. Most of their speedup potential will be lost.
This is actually an instance of Amdahl's Law, named after Gene Amdahl, main architect of the IBM 360 series. This basically states that if you focus on speeding up one part of a computation (in this case, everthing that is not a branch), then eventually the remaining, slow, parts (the branches) will dominate. (You may remember that such a strategy is more-or-less that of early RISC processors — speed up the common cases and ignore the rest.)
Branches are therefore a significant problem; particularly conditional branches because there are two possible outcomes, and we do not know which will occur.
Given the problems they cause, one possible solution may occur to you: modify your compiler to use less. This is less silly than it sounds, and has been done. We will return to the issue of how compilers can help speedup hardware and how hardware can help compilers later, in chapters 14 & 12, respectively.
There are two main difficulties. The first, which applies to all branches, is that the next instruction to be fetched doesn't come from the next sequential address, but from some other location. This problem can be resolved by decoding branch instructions and addresses as quickly as possible — if it can be done in the first stage of the pipeline, no penalty is incurred for an unconditional branch. This is obviously optimistic, but early decoding can reduce the delay.
Related to this is the problem of determining the branch target. Even if you are dealing with an unconditional branch, you may well have to compute the target of the branch, which would typically be done late in the pipeline. This is usually managed by adding a branch target buffer, that stores the branch targets of the last few hundred (or thousand) branches (conditional and unconditional), so they only have to be computed once. In most typical machines, the branch target will not change between calls (though see below).
In a typical pipeline, we will know at the end of the second (ID) stage that we are dealing with a branch. Suppose it is an unconditional branch or a conditional branch that we believe will be taken (see most of the rest of this chapter). We know (or at least believe in the conditional case) that the instruction following the branch in the first (IF) stage of the pipeline is invalid. If we have a branch target buffer, we can dump that instruction and start fetching from the target in the buffer, meaning we only lose one cycle.
A slight variation on this scheme is to store not only the branch target address but also the branch target instruction. That way, we can bypass fetching the branch target instruction, feeding it straight into the rest of the pipeline with potentially no delay at all.
An important exception to the rule that branch target addresses do not change is procedure call returns. A typical procedure (where `procedure' here is shorthand for procedure, function, method etc.) can be called from different places, so the return address will change. This complicates matters because (a) it means that the branch target buffer value may be wrong; and (b) we cannot easily use a simple branch target buffer to deal with return addresses.
We can deal with (a) by observing that the value from the branch target buffer will appear early in the pipeline, and provided we spot any error before any results are written, no damage will be done (apart, possibly, from wasting time). We can use the intervening time to do some checking.
We can deal with (b) by providing a separate stack-based buffer of return addresses. Since procedure call returns occur in the reverse order of calls this should work perfectly provided our stack does not overflow. In practice, procedure calls are not nested very deeply so this is not a problem. However, it does mean that the hardware must be able to recognize a procedure call return — in most cases this is possible, but not all.
The second problem applies to conditional branches — there are two possible next instructions, and in general you don't know which one you want until some computation has taken place, either in the branch instruction itself, or some earlier instruction.
The simplest approach to dealing with this is: when you encounter a branch, you wait until you know what the branch destination is. In the case of a conditional branch, this generally causes a delay even if you have a Branch Target Buffer. Recall that computation typically occurs in one of the final stages, so you have to wait until the branch instruction has nearly completed its journey through the pipeline. This essentially flushes the pipeline [almost] completely — a major performance penalty when you recall that a high proportion (10-30%) or instructions are branches, mostly conditional. (This flushing is actually quite easy to manage – one way is to let the pipeline simply continue as normal but tell the writeback (WB) phase to not store any results.)
Note that a slight complication is that in some (but not all) instruction sets, the computation determining whether or not a branch is taken is done by an instruction preceding the branch - exploiting this can be part of the strategy for managing branches (see chapter 5).
We will look at four strategies for addressing the problem of conditional branches: delayed branching (section 5.4); multiple condition codes (section 5.5); branch prediction (section 5.6); and branch history (section 5.7).
We will also consider some more advanced forms of history: correlating predictors (section 5.8) and tournament predictors (section 5.9).
This strategy that was popular with early `pure' RISC machines — machines that adhered very closely to the principles of the Berkeley RISC processors. The idea is that a branch instruction does not cause an immediate branch, but is delayed by some number of cycles, depending on the length of the pipeline. The system works best with `pure' RISC processors where all instructions take the same time to execute (or nearly all), and the pipeline is short (often two stages).
In this case, the instruction immediately following the branch is always executed regardless of the result of the branch. The compiler will generally generate code in which each branch is followed by a NO-OP instruction. This of course gets us very little — we will still effectively flush the pipeline because the NO-OP does not do any useful work. The only obvious gain is that we no longer need any hardware to flush the pipeline — it happens automatically; we just waste a little memory on the NO-OP.
The gain generally comes from an optimising pass of the compiler, which trys to get rid of the NO-OP by moving an instruction from before the branch to the slot immediately after the branch.
The following (slightly contrived) example illustrates the basic point. The conditional branch essentially consists of the pair of instructions CMP and JNE (compare and jump not equal respectively). An important point is that the ADD instruction is unrelated to the branch. In a `normal' non-delayed branch machine, the NO-OP would not be present. In this case however, it is filling what is commonly called the branch delay slot:
ADD R2, R3, R4 CMP R1,0 JNE somewhere NO-OP
The compiler can now rearrange the code by observing that the ADD need not be computed before the CMP — it can be moved after the JNE, replacing the NO-OP, safe in the knowledge that it will be executed.
CMP R1,0 JNE somewhere ADD R2, R3, R4
Experiments with the RISC II processor showed that over 70% of NO-OPs could be removed by this technique. The approach could be applied to longer pipelines, by inserting extra NO-OPs — however, it starts to perform badly. A 6-stage pipeline would require 5 NO-OPs, and it is not likely that you could find 5 instructions from before the branch to move, given that, as we have said, on average 10-30% of instructions are branches.
In general, this approach is being abandoned, because with longer pipelines, it is increasingly difficult to realise the potential performance gains — you simply cannot get rid of all the NO-OPs.
One trick that can be used to improve matters is the cancelling or nullifying branch. This behaves exactly like a normal delayed branch when the branch is taken, but if the branch is not taken, then any instructions in branch delay slot(s) become NO-OPs. How does this help? It allows the compiler to choose instructions from the branch destination. Because of the way code is often structured, you may get more to choose from here than from immediately before the branch.
The disadvantages are: you cannot `mix and match' instructions from before and after (at least not in any examples I am familiar with); and it does not help when a branch is not taken. (Actually as we will see, many branches are taken far more times than not, so this may not be a problem). In practice, this technique helps a bit but not enough. You can also start to see increasing complexity in the code. For example, consider a branch that is part of a loop. You are effectively moving code from the start of the loop to the end, which is likely to mean you need to take special measures to ensure the loop works properly the first time you execute it.
Another drawback of delayed branching in general is that the organisation 'shows through' to the architecture — [low-level] programmers, and compiler writers, need to know how many pipeline stages there are. This is generally considered A Bad Thing. Since the 1960s (after the introduction of the IBM 360 series), the received wisdom has been that the architecture (which the programmer can see) should not be coupled to its implementation (which they cannot – or at least should not – see). An important pragmatic reason for this is code portability: if you need to change the architecture when you change the organisation, then old code breaks. However, for a variety of reasons, this separation is possibly weakening — we will see why later.
The basic problem with conditional branches in pipelines is that they [usually] depend on the value of some condition code that has not yet been computed (because the instruction computing the value of the code is still in the pipeline). Usually the condition code is computed either by the branch itself, or by some other, earlier instruction.
In the first case, the multiple condition code technique cannot be applied. However, in the other case it can. Traditionally, the condition code computation is done by the instruction immediately preceding the branch, but this is not necessary — it is simply easier for programmers to organise code that way when manually writing assembly code, and compiler writers often follow hand-coding practices, at least initially. We could deal with conditional branches quicker if we could compute the condition codes earlier.
SUB R1,R1,4 ADD R3,memval,5 CMP R2,0 JNE somewhere
could be changed to
CMP R2,0 SUB R1,R1,4 ADD R3,memval,5 JNE somewhere
The condition codes computed by the CMP instruction do not depend on the previous 2 instructions — hence the code can be re-ordered. By being clever with the hardware it would be possible to speed up conditional branches because the condition code values would be known early on in the pipeline, and hence we would know where the branch was going before we got to it. The problem with the scheme is that it is fundamentally the same as delayed branching, and faces exactly the same difficulties — there are simply not enough instructions available to be moved when the pipeline gets long.
A solution to this problem is to have more than one register of condition codes. Each comparison and branch operation must designate which condition code register it will use. This allows us to have different branch operations controlled by different codes.
Consider the case when we need a gap of three machine instructions to compute the condition codes before a conditional branch can proceed without delay.
SUB R1,R1,4 SUB R4,R4,2 PUSH memval CMP R2,0 JNE somewhere CMP R3,0 JNE somewhereelse
By using two condition code registers a and b this becomes:
CMP-a R2,0 CMP-b R3,0 SUB R1,R1,4 SUB R4,R4,2 PUSH memval JNE-a somewhere JNE-b somewhereelse
It works because the two jumps can `share' the instructions that need to be inserted between evaluating the conditions, and the actual jump. The use made of this scheme is a little hard to judge since many modern RISC architectures do away with special condition code registers and allow general-purpose registers to effectively be used for storing condition codes. So there is nothing to stop a compiler writer using more than one register at a time: i.e. effectively use multiple condition codes. This means that even though the implementations of such architectures generally employ branch history schemes (see below), compiler writers are free to use variants of the multiple condition codes scheme to further improve performance.
We said earlier, truthfully, that we do not know the outcome of a conditional branch. However, we can guess. An effective way of reducing the conditional branch penalty is to try to predict the outcome of branches — make an `informed guess' essentially. Between the time the guess is made, and either confirmed or found to be wrong, results produced are marked as provisional, or speculative, and are retained within the pipeline — they do not overwrite memory or registers (we will look at strategies to enable this using reorder buffers in chapter 8). If the prediction was correct then values are written out; otherwise the pipeline is flushed, and restarted.
On face value, we would expect to be correct 50% of the time — a lot better than nothing, but not great. How can we improve the quality of our `guess'? If we know a bit about the software, it is not difficult to become very good at prediction. For example, backward branches are almost certain to be part of a loop. Loops are usually executed quite a few times, so it is reasonable to guess that they will be taken, and start fetching from the branch address. On the last iteration, the guess will be wrong — if the loop executes 100 times, this is pretty good guessing.
Good prediction on forward branches is a bit more problematic — it is really necessary to know if they are part of a loop termination condition, or a conditional. If they are part of a while loop termination condition, then in general they will either be taken or not taken (depending on the sense of the branch) most of the time. If they are part of a conditional, it is not clear what to do without more information (which we will consider shortly, in section 5.7). In practice, in most cases, the processor does not know what type of structure a forward branch belongs to, and there are studies to suggest that the best option is to predict all branches taken in such cases.
It is possible to have some mechanism for the compiler to let the hardware know what sort of structure the branch is part of. There are a number of possible ways to do this. For example some architectures include software hints — special codes to indicate the purpose of a particular branch instruction (which the compiler is free to pay attention to, or not). The idea is that several semantically equivalent branch instructions exist with different opcodes. The compiler chooses the appropriate one, depending on whether it is generating code for a loop, conditional, etc. There is no guarantee that compiler writers will use the correct one. However, most compiler writers would want their code to be as fast as possible, so are likely to use them appropriately.
In general, the more information you have, the more likely you are to make a correct, informed decision. A more sophisticated prediction strategy is to use the past behaviour of a branch. The idea is to keep some information around – branch history information – about what a particular branch instruction did the last time it was executed. It would obviously be difficult to keep information around about all branches, but because of the general locality of execution of code within a program this is not necessary — a table with a few hundred (or thousand) entries is very useful.
How many bits of information do we need to keep? A single bit (usually called a 1-bit predictor) is enough to keep track of what the branch did last time, and a very simple and effective prediction strategy is to predict that a branch will go the same way as last time.
There are however, a number of cases when it does not work well. For example, if you have a backward branch that was not taken last time, but was the time before, this is likely to be a loop that has been executed and has terminated once, which is just about to restart. A single bit of branch history would get this wrong. In such circumstances a second bit of information is useful. In general, figure 5.1 shows the behaviour of a 2-bit predictor that overlooks a single mis-prediction before it changes from predicting a branch taken to not taken (or vice versa).
On a large test using a set of benchmarks called SPEC89, this predictor produced hit rates of between 82% and 99%, with an average around 89%, using a buffer of 4096 entries (i.e. it keeps track of the last 4096 branches). Further tests show that (a) using an unlimited number of entries does not materially improve performance much and (b) using more bits of information does not help much either. To do substantially better, we need to think of something else.
A significant problem with straightforward schemes like those above is that they consider branches in isolation. That is, each individual branch is predicted solely on its own past behaviour. In practice many branches, particularly those that are part of conditionals rather than loops, depend quite heavily on the behaviour of other branches.
Correlating predictors exploit this by basing their prediction on the behaviour not only of the branch being predicted, but of one or more earlier branches. For example, consider the following code, adapted and simplified from Hennessy & Patterson.
JNEZ R1 next ; if R1 != 0 -> next (b1) SET R1, 1 ; set R1 = 1 next: SUB R2,R1,1 ; subtract 1 from R1 JNEZ R2 somewhere ; if R2 != 0 -> somewhere (b2)
It takes a bit of working out what this does…If R1 initially contains 2, then the first jump is taken, and so is the second (because R1=2 initially, we then subract 1 and the second jump checks if R1=1). If R1 initially contains zero, then the first jump is not taken and then neither is the second (because we set R1=1 if the first jump is not taken and then subtract 1). So if this code is part of a loop and each time we start it we alternate the value of R1 between 2 and 0, then both branches are alternately taken then not taken. If we use 1 bit of branch history and initially predict not taken, then we will always get this wrong! (Exercise: check this.) This might seem a contrived example, but it actually quite realistic. It is exactly the kind of thing that happens when code that consists of a sequence of conditionals is compiled.
The simplest correlating predictor bases its behaviour on that of the immediately preceding branch as well as the curent one. In the example above we assume the code forms the body of a loop and that branch b1 immediately precedes b2 and vice versa. (This is actually not realistic. Exercise: explain why.) Instead of a single 1-bit predictor for each branch we now have two: one will be used when the previous branch is taken, and the other when the previous branch is not taken.
In the following, the notation X/Y means we will do X when the previous branch is not taken, and Y when it is. X and Y can be one of T for taken and NT for not taken. Brackets around X or Y mean that we will use that particular predictor.
Branch b1 Branch b2 R1 pred actual new pred pred actual new pred 2 (NT)/NT taken T/NT NT/(NT) taken NT/T
Branch b1 Branch b2 R1 pred actual new pred pred actual new pred 0 T/(NT) not taken T/NT (NT)/T not taken NT/T 2 (T)/NT taken T/NT NT/(T) taken NT/T
So after the first pass (illustrated in figure 5.3(a)) through we predicted b1 not taken, but it was taken. Therefore we update the predictor for b1 to T/NT: that is, if b2 is not taken we predict b1 to be taken.
We do not update the predictor for b1 when b2 is taken because we have no information about that case. Note that this first pass through is a little artificial because we have no `real' information about the actual behaviour of b2: only the value of its predictors which have both been initialised to not taken. A similar situation holds for b2: we also got it wrong (not looking too good so far), and have updated its predictor to taken if b1 is not taken. Note however that this is `real' information since we know that b1 was not taken. The next steps are shown in figure 5.3(b).
Now we are getting it right every time. We can of course expand this scheme to look at more than one previous branch, and to use more than one bit of history per branch. The scheme above is a (1,1) scheme: a (2,2) scheme would look at two previous branches and hold two bits of information per branch. Note that this means that each branch now has four lots of two bits per branch, meaning four times as much state information as before. However, if we repeat the previous tests using a 1024 entry buffer, meaning we hold the same total state as before, we find prediction rates of 94-96%, which is a substantial improvement. Also, the actual hardware required is not particularly complex in practice.
A logical extension of the concept of a correlating predictor is to observe that while many branches will depend on others in their neighbourhood, not all will. Therefore, even though a correlating predictor seems to do better overall, on some branches, a conventional `local' predictor will do better. A Tournament Predictor will typically run a simple branch history scheme in parallel with a more sophisticated scheme (like a correlating predictor) and will choose which one to use on the basis of their past performance for a particular branch. (The name comes from the idea that the two forms or predictor are competing against each other.)