Previous Up Next

Chapter 3  Example: Berkeley RISC II

3.1  Introduction

Of the three early RISC projects we will look at one, which took place in the early 1980s and which gave the name to the concept. The RISC II project and its predecessor RISC I (there were others: RISC Blue and RISC Gold), together with MIPS and later the IBM 801 contributed both a fundamental philosophy of architecture design, as well as some specific design `tricks' – some of which are still used, and others of which have been abandoned.

3.2  Code Studies

The essential idea was to build a `stripped down' microprocessor, which concentrated only on necessary features. The first part of the project's effort consisted of finding out what was necessary. They analysed a large number of Unix programs, nearly all written in C and from the then-current BSD (Berkeley Standard Distribution) version of Unix. They also did some runtime code profiling of these programs.

3.2.1  High Level Language Statements

They discovered the following proportions of high level language statements were executed in the programs they considered (figures rounded to 5%).

This is probably more or less what you would expect — most people would guess that loops were more frequent than call/return, but remember that this effectively includes all the calls to pre-defined library functions (or APIs as we would probably now call them).

However, the list is misleading because some of these constructs need far more machine instructions, or memory accesses than others. So, if we weight them according to the number of machine instructions required, we get:

If we look at the number of memory accesses we get:

From this it is clear that although procedure calls are not that common, they are very expensive in terms of both instructions and memory accesses, and so it is worth devoting some effort to making them fast. It is also clear that branching should be fast, to speed up loops and conditionals. Depending on the precise choice of instructions this might also help with speeding up call/return (or at least the return part). The concern about memory accesses was serious at the time, since processors operated at about twice the speed of memory. Now, processors operate at about 100 times the speed of memory, and dealing with that is a very important issue that we will return to when we look at caches in chapter 16.

3.2.2  Operands & Expressions

Although assignments were clearly not the most resource-intensive constructs, they also did some analysis of operands in expressions. They found that 66% of all assignment statements were of the form:

    a:=b;

and 20% were of the form:

    a:=b op c;

i.e. most were very simple. They found that types of operands were distributed as follows:

That is, over half of all operands were scalars, and the vast majority of them were either parameters of functions/procedures, or local variables (and hence potentially storable in registers). Nearly all arrays and structures were global1. These all need to be stored in memory (actually not strictly true, but in practice this is the case), however the pointers to them (recall we are essentially talking about C programs) could be kept in registers.

3.2.3  Addressing Modes

When looking at addressing modes, they found the following:

That is, 75% of all operand accesses use these four modes. Memory direct accesses can be easily synthesised using indexed addressing — and so (with a possible overhead in time and instructions) can more advanced modes, which do not get used much anyway.

3.2.4  Parameters & Local Variables

When looking at the numbers of arguments and local variables for procedures/functions, we get the following:

In other words, most procedures/functions have very few arguments, and very few local variables. When looking at the nesting depth of procedure/function calls, it was found that only 0-15% of programs when executed nested deeper than 4, and only 0-3% nested deeper than 8. These last two may seem an unusual thing to analyse; as we will see, they are significant in their later design decisions.

3.3  Architecture/Organisation Requirements

Compiler writers are often said to adhere to the following policy: “make the frequent cases fast and the infrequent cases correct”. That is, concentrate time and effort on optimisations of things that happen often, and do not worry about optimising things that happen only occasionally — simply make sure they work correctly. (Of course, it goes without saying that the frequent cases should also be correct.) This philosophy is essentially the one applied to RISC II (and the other, related, projects) — the designers did not worry about some, relatively rare, cases going relatively slowly because they could concentrate resources on making the frequent cases faster.

For example, consider the issue of complex addressing modes, of which some machines had many. In total these account for only 25% of all memory accesses (in the particular analysis done) — each one individually will account for much less than that. Therefore it makes sense to not include them, and simply use sequences of simpler instructions instead. This may not be as fast (though sometimes it was faster, or the penalty was small) but the resources freed-up could be devoted to making the common cases faster, giving a performance gain overall.

The analysis quantified the requirements by identifying the common cases. The processor should make procedure/function calls with few local variables and parameters as fast as possible. We are also looking for an implementation that allows fast access to scalar variables, since there are lots of these. Complex addressing modes are unnecessary.

3.4  RISC II

The final architecture and corresponding implementation developed for the project – the RISC II microprocessor – generally meets the requirements highlighted by the code analysis. The vast majority of the chip is occupied by the data unit, unlike normal microprocessors of the time, which were dominated by control. Furthermore, the vast majority of the data unit consists just of a huge file of registers — 138 of them. (This incidentally made RISC II a good match to emerging chip design techniques which we'll consider in chapter 17.)

3.4.1  The Register File

It was generally thought at the time that there was a limit to the number of useable registers. A typical procedure only uses a few for parameters and local variables. Furthermore, if you have too many, you will have to save and restore them every time a procedure is called (actually you do not have to save/restore all of them — just the ones you are using). However, not all the registers in the RISC II are available at the same time. The first 10 registers are global — they can be seen by every procedure/function currently active. (Register zero is hardwired to constant zero — we will look at this later). The rest of the registers are arranged as a stack.


Figure 3.1: The RISC II register file

Whenever a new procedure/function is called, it is allocated a window of 22 registers from the top of the stack. This means that we do not have to save/restore the contents of registers when calling/returning from procedures/functions. Furthermore, these register windows overlap — the bottom 6 registers are also part of the calling procedure/function's register window, and the top 6 registers will be part of the window of any new procedure/function that is called. This allows very efficient parameter passing — just put the parameters, and return value, in the appropriate registers. (As we shall see, there are other reasons for not having too many registers, At the time of RISC II however, these did not seem to apply to the kind of system being targetted.)

These features are shown in figure 3.1.

If there are more arguments/local scalar variables than can be stored in registers, the extra ones are held in memory (this is rare however — remember that studies were done to identify the typical number of local variables and parameters, and the number of registers per window was chosen with that in mind). Arrays and structures are held in memory — remember that studies showed these were generally global anyway. If procedures were nested more than 8 deep, it was necessary to save the register file to memory, which was slow. But again, this is rare — again recall that the studies analysed typical nesting depths.

Pointers to arguments held in registers are a problem, because a register does not have a memory address. A scheme was devised to map registers to memory, but not implemented. In RISC II, the compiler must detect any pointers to variables and store those variables in memory. Pointers are very common in C, so this is a potential performance problem. However, usually they point to arrays/structures, which will be in memory.

3.4.2  Instruction Set

The instruction set was very basic. The arithmetic operations include only add, subtract, AND, OR, XOR, and shifting (and there is no floating point). Other operations must be synthesized. For example, a register move is just:

    Rd -> Rs+R0

Remember that R0 is hardwired to 0. This is useful not only in synthesis of arithmetic operations, but also when dealing with memory, as we will see. Incidentally, it was not a new idea but first appeared in Ferranti's 1956 Pegasus machine.

A clear operation can be constructed as follows:

    Rd -> R0+R0

Remembering how 2's complement representation works, negate is:

    Rd -> R0 XOR -1

In RISC II, all ALU operations are performed on registers — if you want to operate on memory operands, you must read them into a register first. Such an architecture is called load-store, and as we saw in chapter 2 a large proportion of RISC machines since the Berkeley RISC II are load-store.

3.4.3  Addressing Mode & Instruction Format

There is only one addressing mode:

    Rd <-> M[Rs1+S2]

where S2 can either be another register or an immediate value (i.e. a constant). This single addressing mode is sufficient to synthesize everything we require. Examples of usage from high level languages include global scalars:

    M[Rb+imm]

where Rb is the base pointer, pointing to a block of global variables, and imm is the offset of the appropriate variable within the block. A pointer reference (e.g. *p in C) is:

    M[Rp+R0]

where Rp contains the pointer. We can reference a structure field (p->field in C) as follows:

    M[Rp+fieldoffset]

where fieldoffset is the number of memory words between the start of the structure and field. An array element (a[i]) can be accessed as follows:

    M[Ra+Ri]

where Ra contains the address of the base of the array a, and Ri contains i * eltsize, where eltsize is the size in memory words of an element of a (e.g. typically 4 currently for an integer).

The instruction format was fixed, so instruction decoding was greatly simplified. This was unlike most other architectures of the time, which used different instruction formats, and instructions of differing lengths. This was to save memory — something that is less of an issue now memory is very cheap (though we must still consider memory bandwidth — getting instructions and data from/to memory). However, as we shall see, even at the time it had little practical effect.

The RISC II architecture used a two-stage pipeline (actually, arguably, a three-stage pipeline but we will not get into that). That is, while one instruction is being executed, the next is being fetched. Pipelining has many ramifications for processor design, which we will get into in chapter 4. It turns out that a two-stage pipeline is sufficiently short to avoid many of the more irritating problems (with care), but one is unavoidable: what happens when the current instruction is a branch?

The RISC philosphy is to keep things simple, and a simple solution was to use delayed branching (section 5.4) — one of the first uses of this in practice. Essentially it means that we change the semantics of branch instructions so that the jump does not occur until after the instruction immediately following the branch is executed and, at least for short pipelines, makes the problem more-or-less go away (though there are drawbacks, as we shall see).

3.5  Performance in Practice

How well does all this work? Below are some test results, comparing RISC II with other popular processors of the time, and with the time for RISC II normalised to 1.

The VAX-11/780 was a very popular, large minicomputer/mainframe; the PDP-11/70 was a minicomputer (generationally, the predecessor to the VAX); and the M68000 was the most powerful commercial microprocessor in general use at the time.

This shows that RISC II is substantially faster than comparable microprocessors (e.g. 68000), and even beats what was at the time a large and expensive machine (the VAX-11/780) nearly all the time.

These figures are open to some criticism — the RISC II has no virtual memory overhead, and the VAX does (though the others don't). More serious perhaps, the C code run on all machines was generated using the notoriously dreadful PCC (Portable C Compiler). This sacrifices speed to obtain its relatively straightforward portability in such a way that the simple RISC II instruction set (39 instructions only) would be likely to benefit. It has been suggested that with better compilers, particularly DEC's own highly-regarded VAX-11/C compiler, things would not have been quite so clear cut. You can argue that it is fairer to use the same compiler; you can also argue that if you are trying to evaluate the relative performance of simple and complex architectures, it is unfair to (effectively) prevent the complex one from using its complex instructions; and the counter to that is that in such circumstances a more advanced optimizing compiler for RISC II might have done even better.

Speaking of C…the designers analysed C programs almost exclusively when deciding what to devote resources to on the RISC II — it was suggested that it would not do so well on other programming languages.

Another issue is multiprocessing — in an environment with a lot of processes running, the overhead of saving 138 register every time a context switch occured would be serious. (In fact such large register sets are generally no longer seen for this reason, though the idea seems to have reappeared in IA64 (chapter 12). Similarly, delayed branching has been dropped because it only works well with short pipelines. All these points and more were made by critics of the project.

Even so, for a project that consumed little in terms of time and money, the resulting performance is remarkable. The other potential area of concern was code compactness and memory usage. Because RISC II sometimes used a lot of simple instructions instead of a few complex ones, it was felt that programs might become excessively big. The following table gives some figures, with RISC II again normalised to 1.

As can be seen, RISC code was potentially twice as long as that of the DEC processors, and potentially about 50% longer than Motorola's. However, in some cases it could actually be a bit shorter. Again, we might expect more compact code from more sophisticated compilers. However, you also need to remember that this is only static code size — not runtime size, which would include data memory. There is no real reason to think that RISC II data would be larger than the others, which means that the overall memory size differences would be less.

The controversy over the RISC movement, and the various processors involved, went on for some years. However, the RISC argument has clearly won. Partly this is because the original arguments about high- and low-level instruction sets (as encountered in section 1.4) turned out to be correct: it was more useful to the compiler writer to concentrate on making the most-used, simple instructions that were the building blocks of most code, as fast as possible. It was also because pipelining — a technology previously reserved for top-of-the range machines — was making its way into microprocessors, as we will see next.


1
Probably not true any more, by the way…

Previous Up Next