Intel invented the microprocessor, almost by accident, in 1971 — the 4-bit 4004 was designed for a Japanese electronic calculator company called Busicom. (NAH actually used one of these at school - they were bigger than a PC and slow, though a significant advance over the mechanical and slide rule alternatives).
Intel engineer Ted Hoff was asked to design a custom 12-chip set for Busicom, but realised that a simple, general-purpose chip could do the same thing more cheaply — hence the 4004. Intel later realised they might have something significant, though views were very much divided. They bought back the rights to the 4004 for what they were paid for it, and developed the 8008 (1972). It was a huge success, selling orders of magnitude more than expected. An improved version, the 8080, was released in 1974 (later extended by Zilog's very successful Z80), and the seminal 8086 in 1978.
Each new generation was more sophisticated than the predecessors. The 8008 was an 8-bit machine; the 8080 an 8-bit machine with 16-bit memory addresses; the 8086 was a 16-bit machine (though the cheaper code-compatible 8088 — used in the original IBM PC — had an 8-bit external data bus, making it slower).
The architecture also evolved. The 8080 and earlier processors were accumulator machines. That is, they had a single register that was used as an operand and destination for pretty-much all arithmetic operations. The 8086/88 had more registers, though they were not for the most part general-purpose: each had a dedicated role in a range of instructions. This architecture has been called an extended accumulator architecture.
In the late 1970s, Intel was somewhat concerned with providing compatibility with previous processors, but not to the extent of full binary compatibility. Because early users were relatively sophisticated (pre-PC pretty much anyone who used a microprocessor could program, commonly in assembler), it was sufficient to enable automatic translation of assembler programs. However, subsequently, with the huge success of the 8086/88, and the growth in naïve users, Intel had problems. (And even sophisticated users might not be happy at the prospsect of somehow translating massive amounts of code.) Furthermore, the decisions they took were not always in retrospect particularly good — though that was not necessarily clear at the time.
The first development was the 8087 Floating Point coprocessor chip. For reasons that seemed good at the time, but turned out not to be, they made the 8087 a hybrid stack architecture. A stack architecture is one in which all expressions are evaluated on a stack. This seems a very elegant and clean model of expression evaluation, and to a lesser extent other forms of computation. Consequently, various generations of hardware architects have been drawn to it. Unfortunately, it has been repeatedly found to be inefficient — there are problems with common sub-expressions being repeatedly evaluated; managing stack overflow/underflow and saving/restoring the stack to/from memory etc. It has now more or less been abandoned. (The Java Virtual Machine (JVM) is stack-based, but proposed hardware implementations of the JVM have not been successful.)
Intel thought it had solutions to these problems, and introduced a hybrid model where arithmetic operations could also access any of a set of eight FP registers forming the stack. Unfortunately, though it helped with common subexpressions, the mechanism proposed for dealing with overflow/underflow (a software interrupt handler) turned out to be unworkable, because it was very difficult to tell after the fact if an interrupt was caused by overflow/underflow or another form of error (basically, nobody had tried to write the code before the hardware was built). Consequently, compiled FP code for the 8087 (and all subsequent Intel processors) tends to play it safe, and make about twice as many loads and stores as are in principle necessary to guard against overflow.
The 8086 had an address space of 1Mbyte (20 address bits) - though effectively using it was non-trivial (see section 11.2.1). Quite soon after it was launched, this was obviously too small. Subsequently, the 80286 launched in 1982 (the 186 was not really a `mainstream' processor) extended it to 24 bits, with an elaborate and difficult to use memory mapping system. It also added a few more instructions. Additionally, it had a real mode of operation, in which it pretended to be an 8086. This was for backward compatibility — for many years, substantial amounts of code ran in 8086 real mode and it is still present in the architecture. The alternative to real mode is called protected mode.
The 80386 was launched in 1985. It was a 32-bit extention to the 286, with new instructions, and an even more elaborate memory model (though you could effectively turn it off). The new instructions reduced the level of dedication to specific tasks of the machine's data registers. The data registers were also extended to 32 bits. By now, the architecture was basically that of a general register machine — except for the floating point and a few dark corners. Subsequent machines (80486, Pentium, Pentium Pro/II/III/IV etc.) have made very few changes to the architecture (except for MMX and SSE), though the implementation has changed significantly. It is effectively this architecture that we are referring to as `Intel architecture', or `x86', or (more properly) `IA32'.
The result is a messy collection of past decisions on architecture; variable length instructions that take variable times to execute (and hence are hard to pipeline); and instructions that reference memory a lot (which is currently slow compared to registers). The register set is shown in figure 11.1, with the original 8086/87 set highlighted to distinguish it from the 80386 extensions.
Like most CISC architectures of its generation, the IA32 provides a variety of addressing modes: most of which would be left out of a modern RISC. However, the main problem with IA32 memory is the legacy of past architectures.
The original 8086 mode, or real mode, is shown in figure 11.2. Because all internal registers were 16 bits, and addresses 20 bits, some deviousness was required. The concept of segment registers was introduced. There were four of these, pointing to the start of segments of memory — one for code, one for the stack (effectively, variables in compiled code), one for data (effectively dynamic data structures in compiled code), and one for extra data. To get to the memory address you were interested in, you took the segment register, shifted it left four bits, and added a 16-bit offset. The problem with this was that it was difficult to access a single data item/program/etc. that wouldn't fit into a 64Kbyte segment. This resulted in non-standard, and horrible, extensions to programming languages by compiler vendors.
The 80286 (figure 11.3) extended addresses to 24bits, but kept 16-bit internal registers. This made matters worse: to look up a word in memory, the segment register accessed a segment table, which generated a segment descriptor, which contained a 24-bit address, as well as segment protection information. The 24-bit address was added to a 16-bit offset to generate a 24-bit address: so, even though the total address space was now 16Mbyte, it was still difficult to use blocks bigger than 64Kbyte, meaning the non-standard extensions were still needed.
Worse, the segment protection information, designed to implement memory protection in multitasking operating systems, was in an inconvenient place (on the least-significant end of the address). This meant that the address arithmetic needed to generate addresses to data and code larger than 64Kbyte was complicated. Irritatingly, Microsoft's operating systems did not implement memory protection. Only a few, rarely-used, Unix-derivatives did — this was pre-Linux. It must have been very frustrating for the 286's designers to have expended so much effort on something that was essentially unused.
Finally, the 80386 (and successors — figure 11.4) allowed all this to be fixed by extending both the address space and the internal registers to 32 bits. Although the segmentation system was basically the same, a 32-bit offset is used so by simply loading the same value into all the 16-bit segment registers, you can pretend that there is a simple flat address space. Optionally, the resulting 32-bit address can be used to implement a virtual memory system, via a two-level page table system. (Figure 11.4 does not show the details of this, but the version in the online appendix of Hennessy & Patterson does.)
Intel had real problems in using modern techniques with its architecture — and it has done spectacularly well, considering. Nonetheless, a lot of performance is arguably lost when you compare it with more modern competitors.
To get the same performance, an IA32 processor must run faster than, say, a Power series, Sparc or Alpha processor. For example, decoding is much more difficult, so you need a longer pipeline: which means more penalty on a mis-predicted branch — meaning better (more expensive) branch prediction, etc. Economy of scale (Intel — and clone-makers AMD — far outsell the other, RISC, processors) is sufficient to overcome these disadvantages.
The first pipelined x86 was the 80386. The 80486 had two functional units, and the Pentium was the first to be superscalar (2 way). The P6 was Intel's internal name for the successor to the Pentium. This first appeared as the Pentium Pro, in 1994/95. The Pentium Pro was unusual in that it came as a dual cavity chip package, with a closely-coupled Level 2 Cache (see chapter 16) packaged with the processor. This was fast, but expensive. In 1997, the original Pentium had MMX — multimedia extensions — essentially a fairly basic set of vector operations added, and later the same year the Pentium II was released.
The Pentium II was a P6-based processor, like the Pentium Pro, but with MMX added and larger Level 1 instruction and data caches (16KB each vs. 8KB each). The larger Level 1 caches offset the fact that the closely-coupled Level 2 cache was replaced by ordinary, off-chip static RAM (see chapter 15), which was cheaper.
The P6 organisation was also subsequently used for the Pentium III, with MMX being replaced by SSE (streaming SIMD extension). Pentium III's had either an on-chip Level 2 cache, or a (larger) off-chip Level 2 cache, depending on the version. In addition, the `low-end consumer' Celerons (with, for example, smaller Level 2 caches) and the server application Xeons (bigger Level 2 caches, multiprocessor support) were also based on the P6. Although the organisation of the P6 remained unchanged through these various processors, it was re-engineered in detail a number of times, to increase performance — partly by making improvements and partly to take advantage of improving chip technology (see chapter 17 for more on chip technology). The various versions of the P6 are typically known by their internal Intel names. For example, Katmai was the first Pentium III which was later replaced by Coppermine. This is normal practice in the microprocessor industry; the same happened with the Pentium 4 and is happening with Intel Core & Core 2.
Modern superscalar and pipelined techniques work best with RISC machines, because instructions are a uniform length (which makes prefetching instructions easier), and because they take a (generally) uniform time to execute (thus making scheduling of pipelines easier). Also, by making modern architectures load/store, memory traffic (another bottleneck) is reduced.
Stuck with an existing architecture, Intel had problems. One approach would have been that adopted by DEC for the MicroVAX — a `small' VAX. DEC chose to implement the trickier, more complex instructions in software. This enabled them to implement a simpler instruction set in hardware (actually, a subset of the original), hence leaving them with more on-chip space available for performance increasing things.
Unfortunately, Intel's view was that unlike in the MicroVAX case, the long/complex instructions were commonly-used by compilers, and putting them in software would unacceptably impact performance. Therefore, they implemented a hardware decoder front-end, which breaks down the more complex instructions into simpler micro-ops, or µ-ops. These, together with the simpler instructions, are then sent to a high-performance RISC processor for execution. This technique is not really new, and was used in a similar situation for a high-performance VAX implementation. The same technique is also used by AMD, to solve the same problem.
This solution, while ingenious, isn't perfect — first of all, it means that a part of the chip is taken up by the translator. The µ-ops are chosen to make decoding etc. easy, and hence compact and fast. However, the translator is still an overhead — in space and time. Furthermore, although the translator makes life easier for the RISC part of the P6, the translator still has to deal with the variable length and time of instructions.
The overall structure of the pipeline is shown in 11.5. The primary path of data and control signals through the pipeline is shown in bold; secondary data and control is shown in solid, non-bold; and dotted lines show some control-only signals. The RISC core has five execution units — two for integer ops, one for FP ops, two address generation units for loads and stores — and can execute three instructions simultaneously (three-way superscalar). It has 14 pipeline stages.
(Note it is often said that the P6 has 10 pipeline stages. This is not correct: however, operation results are available for other instructions to use, before the instruction that generated them retires, after 10 clock cycles. This is probably the source of the confusion.)
Instructions are fetched from the level 2 cache (assuming they're not in the main instruction cache) by a bus interface unit. This gets two lines from the level 2 cache, totalling 64 bytes. IA32 instructions can be long, so fetching this much is necessary to ensure that an entire instruction is fetched. The fetched instruction is then stored in the primary instruction cache, which has a branch target buffer attached.
Instructions are then taken, three at a time, by the instruction decoder. This consists of three separate decoders — two simple, one complex. The simple decoders are able to handle those instructions that can be represented by a single µ-op, and the complex instructions that are representable by up to four uops are handled by the third decoder.
All this can be done in a single clock cycle: most instructions can be represented by a single µ-op, and most of the rest by four or less. The few remaining possibilities are handled by a microprogrammed decoder, which may take some number of cycles. The IA32 instruction set is complex, and instructions can have various modifiers. The worst case is several hundred uops — however, such cases will be very rare. The actual uops themselves are very similar to typical RISC instructions — three operands: two source, one destination.
At this point, the µ-ops are executed as if they were conventional RISC instructions. The P6 uses a register renaming based on a reorder buffer with 40 entries. As they are decoded, µ-ops are sent via a register alias table, which keeps track of the eventual destination of re-order buffer entries. After this, instructions go to the reservation stations, which resolve dependencies and issue those instructions to the functional units. Results go either to the reorder buffer, or to the data cache.
Ultimately, data may have to be read from the main memory, if it isn't present in either cache. The main data bus of the P6 (the frontside bus) is transactional — that is, you don't need to wait for a result, you can just dispatch a read request and get on with something else. When the result becomes available, then you can return to processing the instruction that needs it. The memory order buffer ensures that memory operation results are dealt with properly (i.e. in the correct order).
With such a long pipeline, the penalty of missing a branch prediction would be high. The P6 uses a branch history scheme very similar to the correlating predictor we saw in section 5.8. Predicting up to 15 branches ahead is possible (says Intel…).
When it was launched, Intel claimed that the P6 was twice as fast as the Pentium. But…they compared a 100MHz Pentium with a 133MHz P6!
The reasoning behind this apparently-dubious logic, is that rather than comparing clock speeds, you should compare processing technology. The processing technology you use to make a chip determines its clock speed — generally, the more advanced it is, the faster your chip will go (actually, as we will see in chapter 17 when we look at chip technology, there are other factors involved). On this basis, it would seem sensible to compare chips of the same speed. However, Intel claimed that because there are more pipeline stages in a P6, and hence the stages are shorter, you can make a faster P6 than Pentium on a given process for the same money. This is because clock speed also depends on how far signals have to travel across a chip — more stages, means smaller pipeline units, and hence a faster clock. This is actually a perfectly plausible point of view.
Unfortunately though, we cannot be sure — another factor in the process is `alchemy'. Chip manufacturers essentially make large batches of chips, and then test them: quite a lot don't work at all. Some will run faster than others, because of uncertanties in the manufacturing process — these are then sold for more money!
For example, all the early Klamath-based Pentium IIs (clock speeds 233–300MHz) probably came from the same fabrication line: they would then have been separated out according to the predefined speed bands (233MHz, 266MHz, 300MHz) and sold accordingly. The same is probably true of all the other P6 variants (and indeed of all other microprocessors). For a high-cost, high-performance chip (like the P6 when it was launched), Intel may be quite happy to bin a huge number of underperforming chips, just keeping the small proportion that run very fast (and charging enough for them to offset the cost). In the case of lower-end processors, they may set the advertised clock speed lower, simply to ensure that a higher proportion of their production run is saleable, and hence cheaper/more profitable/both (and to differentiate products: high-end=fast=expensive; low-end=slow=cheap).
This information is commercially very sensitive, and there is no way that Intel is ever going to tell us — so we cannot know for sure if their claims are really true. If directly comparing processors of the same clock speed, the P6 was only about 1/3 faster than the Pentium – this clearly shows just how difficult it now is to get performance upgrades from the IA32 architecture. (Actually, to be fair, it is now difficult to get substantial performance gains from any establised architecture.)
As of around 2000, Intel had moved on from the P6 to a newer organisation — called NetBurst. (The term `P7' is also commonly to be found, describing the same thing — though not by Intel.) This forms the basis of the Pentium 4s, but a new architecture (`Intel Core') started replacing it in 2006.
In many ways, NetBurst is substantially similar to the P6. For example it is also 3-way superscalar. However, there were obviously been substantial `improvements'. The quotes are because some of the changes are debatable from a purely technical point of view. Some of the more interesting changes are considered below.
Branch prediction is better in NetBurst and the Level 1 instruction cache has been replaced by a trace cache. Instead of caching IA32 instructions, this caches µ-ops. More importantly, it attempts to identify traces — sequences of instructions that cross conditional branches.
A consequence of this different caching policy is that it is not easy to directly compare cache sizes. The current versions can cache about 12K µ-ops, and the actual cache size is 20KB. However, it is not very meaningful to directly compare this to a P6 cache.
An additional integer and address computation unit are added. Also the integer units operate at twice the core clock speed. That is, a 3.0GHz Pentium 4 will have integer ALUs operating at 6.0GHz. (Thankfully, Intel marketing do not try to describe such chips as `6 GigaHertz'.)
The P6's reorder buffer is replaced by a register renaming scheme. Also, there are more registers available (128 versus the 40 in the P6 reorder buffer).
This is the most contentious change: NetBurst has a pipeline that is nearly twice as long as the P6.
The net result of this is that the IPC (instructions per cycle) of NetBurst is less than that of the P6. That is, a 1GHz P6 will outperform a 1GHz NetBurst. The reason for this is the higher penalties for things like branch mis-prediction. Also, the Level 1 data cache (at least for the current processors) is only 8KB versus 16KB for most P6-based processors.
This was an economic trade-off by Intel, presumably — they must have felt that the resources were better used elsewhere. The smaller data cache is at least partly offset by the faster, on-chip, Level 2 cache (256KB in the early Williamette processors, 2MB in the final Pentium 4, `Cedar Mill', due 2006).
The net consequence of the longer pipeline is that Intel must clock NetBurst significantly faster than the P6. Actually another consequence of the longer pipeline is that it is much easier to do this — as we will see when we look at chip manufacture in chapter 17.
It has been suggested that this was not obviously the best technical decision — perhaps making a 6-way superscalar P6 may have been better. In fact, there are many complex issues here — much of the details of which are not going to leave Intel. To be fair, it is certainly not obvious that the longer pipeline is a bad decision. Hennessy & Patterson call it `at least reasonable'. However, it is worth considering that Intel have been in a `MegaHertz War' with AMD, and a design that had higher clock rates would be attractive even if it did not reflect real performance gains. As we shall see, this may not have turned out so well for Intel after all...
The first NetBurst Pentium 4s — the Williamette series — were launched in late 2000, with clock rates of about 1.3GHz. The Williamette processors continued until late 2001 when they had reached 2GHz. Williamette processors have 256Kbyte Level 2 caches, about 42 million transistors, consume about 50 Watts and use a chip fabrication technology with a minimum feature size of 0.18microns (0.18 thousands of a mm). The minimum feature size is the smallest size object that can be reliably manufactured on a chip.
They were replaced in early 2002 by the Northwood series, with clock rates starting from 2GHz, and reaching 3GHz in 2003. They have bigger Level 2 caches (512Kbyte), as well as other changes (for example, in the chip packaging). Northwood processors have about 55 million transistors, and consume 55-70Watts, depending on the clock rate. They have a minimum feature size of 0.13microns, so despite having more transistors they are actually smaller than Williamette processors. Commercially, Northwood was more successful than Williamette, which some saw as being a `rushed to market stopgap' introduced in response to pressure from AMD's highly successful Thunderbird.
Prescott succeeded Northwood in early 2004, with a minimum feature size of 0.09microns. They have 1MB Level 2 caches, clock rates from 3.2-3.8GHz, and 100 million transistors. Unfortunately, these turned out to produce 60% more heat than Northwood without significant performance benefits — to negative acclaim all round, naturally. The thermal problems proved so severe that Intel abandoned the Prescott processor series altogether, and the whole affair is now commonly perceived to be a result of `marketing led design' in which increasing clock speeds became the overriding concern, at the expense of architectural matters.
The final series of NetBurst Pentium 4s were the Smithfield and Cedar Mill processors, which were primarily interesting for introducing dual-core to the architecture series. In addition to the `vanilla' Pentium 4s, there are Celeron and Xeon variants. Foster and Prestonia are the Xeon equivalents of Williamette and Northwood, and include dual processor support.
NetBurst processors ran rather hot in general, and as such were unsuitable (in particular) for mobile computing (i.e. laptops). As such, in parallel with NetBurst, Intel revived the P6 line, making a number of modifications helpful in the mobile setting (mainly aimed at improving thermal and power performance), and called it the Pentium M family. In 2006 Intel launched the Core microarchitecture, focusing on multiple cores and hardware virtualisation as key components, and sporting better thermal properties.
In general, the future of high-performance 32-bit chips seems limited, with 64-bit processors becoming available. 32-bit architectures like IA32 are clearly approaching the end of their life. They cannot (easily – there are nasty hacks) address more than 4Gbytes of memory and this is already a problem in high-end applications requiring lots of memory (graphics/video etc., some server applications). There are a number of potential 64-bit competitors, and we will look at one architecturally interesting one, Intel's IA-64, next, in chapter 12. The legacy of the 8086 continues, however, and Intel's next generation x86 architecture, `Core 2', unveiled in early 2006, is indeed to be 64-bit. It should also be noted that AMD's `AMD64' architecture extends (and cleans up) x86 quite successfully, and has even been licenced back to Intel (who call it EM64T and use it in the Core 2), in an interesting reversal of fortune.