Previous Up Next

Chapter 14  Compilers

14.1  Introduction

Intel's IA64, and similar VLIW/EPIC architectures, require compilers to generate code in a way that is substantially different to the traditional model. In fact, the same is effectively true of superscalar and pipelined processors: for code to effectively exploit the hardware, compilers must be more aware of what resources are available and must take a different approach to optimisation.

Traditional code optimisation basically meant reducing the number of instructions – since there was a direct link between instructions executed and time taken. As soon as heavily pipelined systems appeared, it was naive to assume that instruction count was significant. For example, a lot of instructions can be executed during a pipeline flush/stall, so a longer code sequence with a lower chance of a stall would be preferable to a shorter sequence.

There are basically three areas of optimisation for compilers for modern microprocessors. In increasing order of difficulty: basic block optimisation; loop optimisation (or software pipelining); and global optimisation.

14.2  Basic Block Optimisation

This is the simplest, and most commonly done. It is also the least effective. A basic block is a group of instructions between breaks in control flow (i.e. branches). A basic block optimiser, or scheduler, first identifies these blocks, and then attempts to speed them up by rearranging the code. This generally involves constructing a Directed Graph, representing the instructions; looking for the critical path, and shortening it by applying heuristics (e.g. : “try to move loads up the code so operands are available in registers when needed”).

In practice, moving operations with a high latency (memory access, floating point) nearer the start of basic blocks is a common goal of such simple optimization. That is because their results are more likely to be available to later, dependent, operations without stalling the pipeline (or stalling it for less time). The general term for rearranging code in this way is code motion.

Although successful, there is limited scope for improvement because lots of basic blocks are very small. To get better results, we need to link together as many basic blocks as we can. Since basic blocks are separated by branches (generally conditional), then by necessity linking blocks requires us to span branches, which means we must be quite confident that the separating branches are well predicted. We must also account for the possibility that the prediction will be wrong.

In general, a series of such linked basic blocks is called a trace. We will briefly consider trace scheduling in section 14.4. However, a special – and common – case of a trace is a basic block forming a loop body, which is likely to be executed multiple times.

14.3  Loop Optimisation

The basic technique for loop optimisation takes as its starting point a long-established technique developed for pre-pipelined systems called loop unrolling. It can be illustrated quite simply with a high-level language example. Consider the loop:

    for (i=0; i<1024; i++) {
        a[i] = b[i] * 2.0;
    }

We can speed this up by not computing the loop termination condition every time:


    for (i=0; i<1024;) {
        a[i] = b[i] * 2.0; i++;
        a[i] = b[i] * 2.0; i++;
        a[i] = b[i] * 2.0; i++;
        a[i] = b[i] * 2.0; i++;
    }

On the face of it this only seems to work for (a) loops which execute some convenient multiple of times; and (b) loops that execute a known number of times (i.e. for loops). We can solve (a) by simply adding a second loop to finish off any `left over' iterations. For example, suppose we have a loop that executes 95 times, and we wish to unroll it so 8 iterations are computed. We can execute the first loop 11 times, and then add a second loop to finish off the final 7 iterations:

    for (i=0; i<88;) {
        body1; i++;
        ...
        body8; i++;
    }
    for (i=0; i<7;i++) {
        body;
    }

or, equivalently:

    for (i=0; i<11;i++;) {
        body1;
        ...
        body8;
    }
    for (i=0; i<7;i++) {
        body;
    }

The second problem is harder to fix — at least for the compiler. However, if it is possible to determine a minimum number of iterations, the technique can be applied. However, at the moment, this technique is probably of more use to the programmer than the compiler.

As an aside, if you wish to manually apply this sort of technique to your code, make sure you profile it first! That is, use appropriate software tools to find out where your code spends most of its time, so you can target optimizations appropriately. If you do not, you are quite likely to spend more time on the optimization than will ever be saved during execution. Also, remember that `slow and right' is better than `fast and wrong', and that a common outcome of programmer's optimizations is broken code. So be careful.

In loops that iterate many times, substantial speed-up can be obtained using this technique simply because it removes many of the loop termination tests. This is particularly true if the loop body is small, where much more time can be spent testing the loop termination condition than executing the loop body. In pipelined systems, there is another major benefit – we have made the basic block forming the loop body bigger, and given ourselves more scope for optimization. This leads on to the technique of software pipelining which interleaves loop operations.

14.3.1  Software Pipelining

Consider the same code example as above:

    for (i=0; i<1024; i++) {
        a[i] = b[i] * 2.0;
    }

Before basic block optimization, the loop body might be represented as follows:

    load  R1, b(R31)     ; Load b[i] into R1
    fpmul R1, R1, 2.0    ; FP multiply R1 by 2.0
    nop                  ; Wait for fpmul
    nop                  ; Wait some more...
    store a(R31), R1     ; Store R1
    incr R31             ; Increment loop counter (R31)

We have assumed that (a) the floating point multiply takes three cycles, and (b) that we must explicitly put in nops to delay the store (usually this would be taken care of by the hardware). We have also (somewhat dubiously) assumed you can do things like look up the memory word at address R31-1. However, these things only simplify the example and can be circumvented in real code. The increment can proceed before the multiply has finished, so we can perform some simple basic block optimization to eliminate a nop between the multiply and store operations:

    load  R1,b(R31)      ; Load b[i] into R1
    fpmul R1, R1, 2.0    ; FP multiply R1 by 2.0
    incr R31             ; Increment loop counter (R31)
    nop                  ; Wait for fpmul
    store a(R31-1), R1   ; Store R1

Now we try to unroll the loop in parallel for a 2-way superscalar machine. We assume there are no resource conflicts.

    Cycle    Iteration 1             Iteration 2
    -------+-----------------------+-------------------
    c        load R1, b(R31)
    c+1      fpmul R1, R1, 2.0       load R2, b(R31+1)
    c+2      incr R31                fpmul R2, R2, 2.0
    c+3      nop                     incr R31
    c+4      store a(R31-2), R1      nop
    c+5                              store a(R31-1), R2

Note that we have renamed register R1 to R2 in iteration 2. Observe that the code in iteration 2 can proceed in parallel with that in iteration 1. Therefore, by interleaving the code sequences for two loop iterations, we can keep a 2-way superscalar machine's pipeline full except for the two cycles involving nop, and (possibly) one cycle at the beginning and another at the end: though we may be able to fill these with loop control code. If we had a 4 way machine, we could interleave four iterations, and so on.

14.4  Global Optimisation

Global optimisation goes beyond loop and basic block scheduling by looking at larger blocks of code. The name is not entirely accurate, as they do not look at complete programs.

A variety of techniques are used, but a common one is based on the concept of a trace, which is a sub-sequence of code within a program that spans conditional branches in the presence of high-quality branch prediction. It is very much an experimental technique, and it is not yet clear that for most code that the benefits would outweigh the disadvantages – basically the need to include substantial amounts of code to deal with the case when the trace does not proceed as expected (for example, when a branch is incorrectly predicted). Traces are quite a general concept, and can have multiple entry and exit points for code.

A slightly less general idea, which has less complications, is a superblock. Superblocks can have multiple exit points but only one entry point. This makes life easier because when we try to move an operation X nearer the start of the trace, we do not have to worry about moving it before an entry point that requires it.


Previous Up Next