Previous Up Next

Chapter 12  Intel IA-64 and Itanium

12.1  Introduction

Intel's IA32 architecture is problematic for a variety of reasons. One is simply that it is hard to implement using modern techniques – however, Intel (and AMD) have successfully addressed this, though at some cost. Another problem is that it is a 32-bit architecture, and 32-bit architectures are clearly coming to the end of their life. Already high-end applications and servers are running out of memory address space — 4GB is simply not enough. Even consumer systems are now routinely sold with 512MB or more. The usual guideline is that memory requirements double every two years. At that rate, `top-of-the-range' consumer systems will reach 4GB in three or four years, and `entry-level' ones in five to eight years: we can reasonably conclude that a decade from now, not many people are going to be buying 32-bit systems if the growth in memory use continues at the same rate.

Major chip manufacturers have come to the same conclusion, and for some time have been developing, and in some cases selling, 64-bit systems. Until recently, the most prominent desktop example was Apple's G5 — however, in this chapter, we are going to focus on successors to IA-32.

In addition to the memory issue, 64-bit processors promise higher performance: partly because they are processing twice as much data and partly because they can dump earlier bad (in retrospect) decisions. Floating point performance tends to be substantially better for the new 64-bit chips for example, which already makes them attractive to certain `niche' consumer markets.

The first of these 64-bit systems was DEC's Alpha architecture, which we examine in chapter 13. DEC's market was high-end systems, and Alphas have been available for over a decade now (though as we will see, the implementations at least were not truly 64-bit). An interesting point to consider is that given we are seeing the end of 32-bit systems, there is no absolute guarantee that the same companies will continue to dominate the market for 64-bit systems. Having said that (a) it is unlikely that Microsoft is going away; and (b) successors to IA32 are in a very strong position because for many years to come they will be able to execute IA32 code. So 64-bit SPARCs, Alphas and PowerPCs are definitely outside chances at best to dominate the consumer market (although the current 64-bit market is dominated by Sun and IBM).

However, AMD is in a strong position, and could `replace' Intel. AMD and Intel have chosen different 64-bit architectures, and it is not yet clear who will come to dominate. AMD's decisions are much more conventional than Intel's IA-64 (which is why we are going to look at IA-64), developed in conjunction with HP. AMD have extended IA32 in much the same way Intel extended its predecessor to 32-bits. The existing registers have been extended to 64 bits, and a further 8 registers have been added (total of 16 GP registers). New 64-bit instructions have been added, but the 32-bit architecture still remains. AMD have stated that they believe that CISC architectures now rival RISC architectures in performance. There are two possible reasons for this.

  1. Economy of Scale — AMD and Intel sell a lot more processors than anyone else, so they can afford to do things that others cannot in both development and manufacture. Note that this reason does not really have anything to do with RISC vs. CISC.
  2. CISC Overhead Relatively Fixed — It may be the case that the overhead of translating CISC instructions into internal RISC instructions (which both AMD and Intel do with IA-32) is not growing, or not growing as rapidly, as chip resources grow. This means that over time the relative proportion of resources devoted to translation is falling.

All this means AMD has a probable advantage in the short term — it is much easier to adapt compilers etc. to effectively take advantage of AMD's architecture than IA-64; hence its adoption in the Core 2 family, for example. However, in the long term, the potential performance gains for IA-64 — given the necessary compiler etc. developments — is probably greater. Intel have chosen a VLIW based approach (see chapter 10, with modifications designed to address some of the disadvantages of usual VLIW processors. They call the approach EPIC — Explicitly Parallel Instruction Computing.

12.2  IA-32 `vs' IA-64

Intel have discarded all the expertise they have developed — at considerable cost — on IA-32 series processors, in the IA-64 project (of course they haven't actually discarded it: the point is it isn't being used here). This essentially explains why the early Itaniums are very big, use a lot of power, and don't perform well. Intel are also asking software developers to discard their expertise. IA-64 processors run IA-32 code in hardware, but not very quickly. Companies have in the past changed architectures successfully, usually using emulation software to run old code (for example, Apple). However, none of these companies faced the direct competition that Intel does.

For the first few years at least IA-64 processors have been outperformed by IA32 successors (which are now increasingly 64-bit themselves). However, in the long term (maybe), they will be able to exploit more instruction-level parallelism, leading them to outperform. In the short to medium term at least, however, it seems that the move towards multi-core will continue to drive performance increase in the non-VLIW world, and it seems highly likely that this will squeeze IA-64 into a footnote of history. Even so, Intel have spent about $1billion on IA-64, and they can be expected to support it for many years before it either becomes successful, or they are forced to abandon it. It is also quite possible, or even likely, that AMD have a project shadowing IA-64.

To complicate matters further, the existing 64-bit architectures (i.e. the ones from HP/Compaq and especially Sun and IBM) are well-established in the existing, specialised, markets (e.g. high-end servers) and there does not seem to be any good reason for their customers to change. Initially at least, both AMD and Intel will be trying to get part of this market. In the case of servers, Intel at least has experience of building processors for the market (the 32-bit Xeons). AMD may have a cost advantage if their 64-bit system costs are more in line with typical PC-based systems than current 64-bit systems: established 64-bit systems are not cheap, and neither are current IA-64 systems.

In the case of high-performance desktop systems, Windows and Linux are both currently available, though they have yet to reach any level of maturity (especially Windows) for IA-64. Initially at least, performance on AMD processors is not going to be substantially better because of the need to migrate to the new architectural extensions (though at least legacy code will not be slow).

The first IA-64 processor was called Itanium, and previously had the internal name Merced1.

Intel's collaboration with HP started in 1994, and the first processor was not released generally until mid-2001 (though samples were available to hardware manufactures before that, as well as an emulation system for software development). The release of Itanium was delayed at least twice — indicating that Intel had trouble with it. This is not particularly suprising, as it is a radical departure. However, rumour had it that the problems concerned IA-32 compatibility. There are two main areas we will look at: the IA-64 architecture; and the Itanium's implementation of IA-64 and its performance.

12.3  IA-64 Architecture

IA-64 is a load/store architecture, with 64-bit memory addresses and registers. There are 128 64-bit general-purpose registers, 128 82-bit floating point registers, 64 1-bit predicate registers (see section 12.3.2) and 8 64-bit branch registers (used to hold branch destination addresses). There are also assorted control registers that we will not concern outselves with. Intel have chosen to use a RISC 2-style register stack (see chapter 3), with some modifications. The first 32 registers are global, and the remainder are allocated to procedures/functions. The size of the block of registers allocated to a procedure/function is variable. Also, instead of overlapping blocks of registers, the calling procedure's first local register is used as a pointer to a block of the called procedure's local registers for passing parameters.

12.3.1  Groups and Bundles

Rumours had it for some years that IA-64 would be a VLIW machine – this turns out to be basically the case. Consequently, Intel have had to face and solve the problems of VLIW machines — how to take advantage of performance increases in newer processors when running old code. The solution is quite sophisticated and clever, though – in the early processors at least — it seems to remove some of the usual VLIW advantages. That is, it is still necessary to provide significant quantities of hardware to resolve dependencies, rename registers, etc.

IA-64 divides instructions up into groups and bundles. A group is a sequence of instructions that could in principle be executed in parallel provided:

The architecture does not put any bound on the length of a group. However, the boundary between groups must be explicitly marked by a stop. A stop is coded between instructions as part of a bundle.

A bundle is a group of three instructions packed into a 128-bit word. Each actual instruction is 41-bits long, with the remaining 5 bits being used for the template. One of the roles of the template is to indicate the presence of a stop within a bundle. Stops can appear at the end, or between instructions. There may be no stop within a bundle at all.

Figure 12.1 shows an example of 11 bundles and 6 groups. The table shows how groups and bundles are related (a bracketed bundle is split between two groups). Note that the subdivision of instructions into groups and bundles is essentially independent: bundles can span group boundaries and vice versa. Also note that there are limited options for more than one stop within a bundle. However, it is possible to insert No-Ops. In practice, a good compiler will probably insert a significant number of No-Ops. We will look at compilers in more detail in chapter 14.


Figure 12.1: IA-64 Bundles and Groups

As well as encoding the presence of stops, the template encodes the type of each of the three instructions in a bundle. The options are as follows.

For example:

The fact that both I and M type instructions can do integer ALU operations increases flexibility without having to increase the size of the template field. For example, suppose you wished to put two ADDs and a SUB in a bundle. This would need a template pattern of I, I and I (which does not exist) if M type instructions could not do integer ALU operations.

The actual instructions themselves are 41 bits long. The first 4 bits are the major opcode. Together with the template, they specify the instruction type in more detail. A (variable) number of further bits are used to precisely define the operation. The next 6 bits are used to specify the predicate register (see section 12.3.2. The remaining bits are used to specify operands (or are unused).

12.3.2  Predication

Another aspect of IA-64 that is interesting — though not really new, as a simpler version appeared in the ARM2 architectures — is a way of dealing with (some) conditional branches forming without needing to flush the pipeline or predict their outcome.

The way it works is that each instruction is conditional. That is, it is only executed if a specific condition, or predicate is true. The IA-64 architecture includes a set of 64 1-bit predicate registers. Comparison instructions can be used to set and clear these predicate registers, and subsequent instructions are conditionally executed, depending on their values.

For example, consider the following C (or C++ or Java or C#) code:

if (a == b) {
     a = 0;
} else {
    a = b;
}

Traditionally, assuming this would translate to something like:

        CMP Ra,Rb        ; Check a=b
        JNE else         ; No - goto else part
        MOV Ra, 0        ; Otherwise, Ra := 0
        JMP end          ; Skip past else
  else  MOV Ra, Rb       ; Ra := Rb
  end   whatever         ; Rest of program

The problem with this, of course, is that you've got to predict the outcome of the branch, and if you get it wrong you've wasted work (though not much in this case). A predicated version might look like this (note that this is made up code, not IA-64).

        CMPEQ  Ra,Rb,P1/P2   ; Check a=b
    [P1]MOV Ra, 0            ; If true, Ra := 0
    [P2]MOV Ra, Rb           ; else Ra := Rb

The CMPEQ instruction compares a and b, and if they are equal, sets predicate register P1 to true, and if not it sets it to false. Predicate register P2 is set to the inverse of P1. The next two instructions are each predicated on one of P1 and P2. The first MOV only actually sets Ra to zero if P1 is true: otherwise, it just becomes a no-op. The new result is that there is no need to predict the outcome of a branch, and stall if we are wrong. In IA-64, (nearly) all instructions are predicated, with the 6-bit predicate field selecting one of the 64 predicate registers.

It may seem that the usefulness of this is questionable — although you avoid the overhead of stalls on mis-prediction, you have to execute instructions that turn into no-ops. Surely this is effectively the same as stalling and flushing the pipeline except that you have to do it every time? Not necessarily: in the case above, you execute one extra instruction, taking probably one cycle. If you mispredict the branch, you will have fetched many instructions ahead — 14 in the P6 series processors (see chapter 11, about 20 in NetBurst (i.e. Pentium 4) — which is going to be costly.

The tradeoff balances the likelihood of mis-prediction (higher with conditional statements than loops) against the number of instructions in each part of a conditional (many have only a few). In practice, there will be a break-even point: for if statements with fewer instructions than the break-even point then predication will be good, and for larger ones normal conditional loops will be good. The compiler must make appropriate decisions for each case.

12.4  The First IA-64 Processors: Itanium

Merced was finally released as Itanium in mid-2001. It was an 800MHz machine, with 295 million transistors. The vast majority of these formed a Level 3 cache on a separate chip, packaged with the main processor (which was only about 25 million transistors). The whole package is very large, and consumes 130 Watts, which is substantially more than other processors. There are 10 pipeline stages, and the pipeline divides into four parts. The first is responsible for fetching and branch prediction (3 stages); the second with instruction issue and register renaming (2 stages); the third with operand fetch (2 stages); and the fourth with execution and write-back (3 stages). It should be noted that 10 stages is a lot for a VLIW-based machine. There are nine functional units: two I-units; two M-units; 3 B-units; and 2 F-units. All of these are internally pipelined.

Itanium was only really intended as an engineering exercise. The first commercial processor was supposed to be McKinley (Itanium 2). However, it is useful to compare the performance of Itanium with other processors. On integer computation, even if we scale performance to account for clock rate differences, it is substantially outperformed by a 2GHz Pentium 4 and a 1GHz Alpha 21264. However, for floating point computation, even without scaling, it does better than both of them a substantial part of the time. Floating point code usually has more potential parallelism (though it is also much less common). None of this looks particularly good, but remember that Intel are starting more or less from scratch in terms of expertise, and so are the compiler writers. The power dissipation is a potentially serious problem — and not only for mobile systems (not that early IA-64 processors will be aimed at them). Getting rid of heat is already a big problem, and more power means more and bigger heatsinks and fans (which themselves use power), and bigger and more expensive motherboards, cases, etc.

12.5  Subsequent/Later IA-64 Processors

The next IA-64 processor, McKinley (Itanium 2) was intended to be a commercial processor. However, Intel changed their mind and that too became essentially an engineering exercise. It had a 1GHz clock, with the Level 3 cache integrated on the main chip. The total transistor count is 221 million, which means the actual chip is very big. As we will see in chapter 17, this means it is inherently expensive. The pipeline was shorter, and there are other changes as well — performance was improved, but still slower than competing IA-32 processors.

The first `real' IA-64 processors were Madison and Deerfield (2003), with Deerfield being the low-cost/low-power version. Madision has 500 million transistors (!), a higher clock rate (up to 1.6GHz), a much faster memory bus, and a bigger on-chip Level 3 cache (up to 9MB). At 130W, power consumption was still a real issue with Madison, though Deerfield improved matters at 62W — more in line with IA-32 figures. Future Itanium processors are all expected to feature two or more processor cores, faster clocks, and Intel's HyperThreading technique (first seen on Xeon processors).

HyperThreading introduces the idea of thread level parallelism, and is a way of making better use of under-utilized functional units in superscalar (and VLIW) processors. It is equally applicable to IA-32 and has started to appear in later Pentiums (and on TV adverts). Essentially, in many processors much of the time, many functional units are under used. Since a lot of modern code is multi-threaded (particularly server code, and servers will be targetted first) it makes sense to try to interleave multiple threads simultaneously. The hardware pretends to be multiple virtual processors. This has to be done carefully. For example, suppose we have a simple machine with one integer unit and one floating point unit, executing mainly integer code. The limiting factor in such a machine is likely to be the single integer unit, and unless we can interleave a separate mainly floating point thread, we are unlikely to see any increase in performance. In fact, we are likely to slow things down. However, suppose we have many integer units. Inter-instruction dependencies are now likely to be the limiting factor, and one integer thread is unlikely to keep many units busy simultaneously. In this case, it does make sense to try to interleave multiple threads. More on this later.


1
This is a river in California, which runs though the Yosemite National Park — in case you were wondering, most of the internal names that get chosen are somebody's favourite place. There are exceptions however. Exercise: Why was one of Apple's machines called `Yikes!', and why (alledgedly) did a now-dead famous(ish) person threaten to sue them when they (alledgedly) planned to use his name. Very Obscure Hint: a name they did use at the time was `Piltdown Man'.
2
http://www.arm.com/

Previous Up Next