Previous Up Next

Chapter 15  High Performance Memory

15.1  Introduction

Typically, the programmer of a system is aware of the presence of registers and memory. In practice, for performance reasons, we also usually have one or more levels of cache (see chapter 16), and the memory itself will not be the simple, uniform structure that is (abstractly) visible to the programmer.

15.2  Aside — Memory Technology


Figure 15.1: n-type and p-type Field Effect Transistors

Before we get into how we can improve performance, let us consider the basics of memory technology. There are essentially two forms of memory: static RAM (SRAM) and dynamic RAM (DRAM). To understand how each works, first we need to some basics of silicon chip technology.

The vast majority of current chips use a technology called CMOS – complementary metal oxide semiconductor. The `metal oxide semiconductor' part is a hangover from the past; the `complementary' part refers to the fact that two types of transistor are used: n-type and p-type. The two types, and their behaviour, are shown in figure 15.1.

The two types of transistor act as switches. Essentially, and simplifying a lot, if the value on the gate of an n-type transistor is 1, then the source and drain are connected, and if the gate is 0, then they are not. The behaviour of the p-type transistor is reversed (`complementary'). (There are also Silicon technologies that use only one of the transistor types – nMOS and pMOS. However, they are now outdated – especially pMOS.)

15.2.1  SRAM — Static RAM


Figure 15.2: SRAM Cell

SRAM is logically the more complex technology – requiring six transistors per bit – however, its electrical behaviour is simpler, so we will look at it first. An SRAM cell is shown in figure 15.2.

Four of the transistors are used to make a pair of inverters – NOT gates, essentially. Each inverter requires a pair of transistors – if the input is 0, then the p-type transistor will be on, and the n-type off. This will connect the output to power, which is equal to logic 1. Otherwise, if the input is 1, the output will be connected to ground, or logic 0. The two inverters are connected in a loop, with the output of one the input of the other.

This arrangement has two stable states: we interpret these two states as 1 and 0. The other two transistors are used to control reading and writing. To read the contents of the RAM cell, the word line is set high, allowing the contents of the cell to be read out to the bit line (and its inverse) to the not bit line. To write the cell, we again set the word line high, but this time we set the bit line (and its inverse) to the value we wish to store, forcing the cell into the appropriate state.

15.2.2  DRAM — Dynamic RAM

DRAM only uses one transistor per bit. However, understanding its behaviour is a little more difficult because we have to look beyond the purely logical behaviour and consider some of the physics. A DRAM cell is shown in figure 15.3.


Figure 15.3: DRAM Cell

To store a bit, the word line is set high, and the value to be stored is put on the bit line, and then the word line is set low. If the stored bit is a 1, then a charge will be stored on capacitor Cbit. To read a bit, the bit line is set high, and then isolated – leaving a charge on the line. The word line is set high then low. If the stored value was a 1, then the charge on the bit line and capacitor Cbit will be (more or less) equal, meaning the charge on the bit line will be basically unchanged. However, if the stored value was a 0, then a proportion of the charge on the bit line will charge up Cbit – meaning there will be a small, and detectable, drop in the charge on the bit line. Specialized hardware (sense amplifiers) are needed to detect the change. There are a number of consequences of this method of storing bits:

15.2.3  SRAM versus DRAM

Clearly, SRAM cannot compete on memory density with DRAM, because it requires six times as many transistors. It also requires more wiring, which also takes up space. Conversely, DRAM cannot compete with SRAM on speed – the changes that are being detected when reading an SRAM bit are much more substantial than those in DRAM. Consequently, each type of memory has been developed to maximise its strengths, which has further differentiated them. That is, DRAM designers know there is no point in trying to compete on speed with SRAM, so they don't bother trying: they focus on density instead.

Typically, as a rule of thumb, DRAM is about five times as dense as SRAM, and about 10 times slower and more expensive. This means that DRAM is used for main memory and SRAM is used for caches.

15.2.4  DRAM Chips


Figure 15.4: A typical DRAM chip

The (logical) structure of a typical DRAM chip is shown in figure 15.4. (Note that in practice – for efficiency reasons – the memory array will be divided into many smaller arrays. However, the basic principles are not affected.)

To reduce the number of signal lines to the chip, memory addresses are divided into two halves. The half representing the row of the bit we are accessing within the memory array is sent first – called the Row Address Strobe (RAS). When reading a bit, all the bit lines in the array are charged up, and the Row Address bits are used to select the word line controlling a complete row of bits. This complete row is read out into a Row Buffer. While this is going on (and hence no time is wasted) the other address bits are read in – called the Column Address Strobe (CAS). These address bits are used to select a single bit from the Row Buffer. (The writing process is substantially similar.)

15.3  Main Memory Systems

Memory has improved enormously in capacity, but not much in speed, over the past few decades. In 1980, you could buy a 64Kbit Dynamic RAM chip with an access time of 150ns, and a cycle time of 250ns. In 2002, capacity had risen to 1Gb (about 4 orders of magnitude), but access time had only dropped to 40ns and cycle time to 80ns (a factor of 4).

Although caches substantially reduce the amount of data traffic to and from the memory, we would still like to increase memory speed – doing so would reduce the penalty of a cache miss (when the word we are looking for is not in the cache) without compromising performance of cache hits (because we wouldn't have to mess with the design of the cache). However, we must accept that because caches are so effective, we must make substantial improvements in memory performance to see much effect. Making memory faster would be the ideal solution, but that is too expensive. The main techniques used are based on the Truck of Tapes principle.

Suppose you have two sites 1Km apart, linked by a 2Gb/sec link. If you want to transfer a Terabyte over that link, it will take you over an hour:

             1 TB = (1024 ^ 4) * 8 bits

             2 Gb = 2 * (1024 ^ 3) bits

   1 TB at 2 GB/s = ((1024 ^ 4) * 8) / (2 * (1024 ^ 3))
                  = 4096 seconds
                  = about 1 hr 10 minutes

On the other hand, if you put all the data (on tape, say) in the back of a car (or truck), it'll take you a few minutes (ignoring the obvious logistical problems of reading and writing a Terabyte's worth of tapes). The car wins in this case not because it's fast, but because of the sheer volume it can carry.

So if our memory is slow we can at least try to get it to transfer a lot of data slowly – which with careful use is nearly as good as fast transfer. Basically, we want to increase the memory bandwidth, and we can do that in a number of ways.

15.3.1  Make Memory Words Wider

This is the most obvious technique, based directly on the principle outlined above. Typically, architectural memory is byte addressable – that is, each byte has its own memory address. However, that doesn't mean that the physical memory in the machine has to consist of 8 bit words – provided that it appears that way to the architecture, we can (and do) make memory words (much) bigger.

At first sight, it's not obvious that this will actually help. For example, suppose that a program wants to access a single byte that isn't in the cache (cache miss), and that memory words are 128 bits. It may seem that there would be no difference in access time, since you'd still have to wait for an entire memory access – it's just that you'd get 16 bytes (15 of which you're not interested in – at least at the moment) rather than one.

However, what actually happens on a cache miss is that you have to wait for an entire cache block to be loaded, which are typically quite a bit bigger than memory words. So if cache blocks are 256 bits long and memory words are 128 bits instead of 8, you can load an entire block in 2 memory cycles; as opposed to 32 cycles for 8-bit bytes.

There are, of course, disadvantages. First, you need a wider bus between the processor (or caches) and memory. This costs money. Furthermore, processors don't generally deal with wider words, so you need to put hardware (called a multiplexor) between the processor and the memory, to allow the processor to select a single word – this can slow down memory access (though in practice it probably won't much if you put it between level 1 and 2 caches). Another problem is that it increases the minimum size memory upgrade: if you quadruple the memory word width, you quadruple the smallest possible memory upgrade.

A final problem is connected with memory error correction codes. It is now common to include error correction bits on memory words, to correct single-bit `soft' errors (at least on high-end systems). If the memory system allows parts of words to be written, then there are difficulties. For example, suppose you wish to write a 32-bit word within a 128-bit memory word. To recompute the memory correction, you need to read the remaining 96 bits, which takes (a long) time. The usual `solution' to this is to use separate memory correction for every 32 bits (instead of the complete – potentially much longer – memory word), since most writes are 32 bits. (At least at the moment – until we move to fully 64-bit systems.)

15.3.2  Memory Interleaving

Another approach is to divide the memory into banks, each of which can be accessed in parallel. The memory is arranged so that adjacent banks contain adjacent memory words – the banks are interleaved. For example, if there are four memory banks, bank 1 will contain words with addresses 0,4,8,…; bank 2 will contain words with addresses 1,5,9,…and so on. Hence, in this example, four adjacent words can be fetched from memory at once.

The nice thing about this system is that you don't need wider buses or multiplexors – you just need the bus to be able to operate faster. For example, if you have four banks of memory, and a bus that can transfer four words in the time it takes to do a single read, then the bus can just be one word wide. It is quite easy to make a bus that operates at several times the speed of memory.

There are some problems with interleaving. The first is the possibility of conflict – needing two (or more) words that happen to be in the same block at the same time, and being forced to wait. An obvious solution to this is to have lots of banks (though see below why we can't always do that anymore), making this unlikely. However, this doesn't always work. The way to look up a word in the bank is to compute the bank number and the address within a bank:

Bank No = address MOD number of banks
Address in bank = address/number of banks

To make the hardware used to select the appropriate bank simple using this mapping scheme, you need a number of banks equal to a power of 2. Unfortunately, we use powers of two for lots of things – suppose you had 16 memory banks, and a program that accessed an array with 16 columns in column order. All the elements you want at/near a given time will be in the same bank. This is shown in the example program below and accompanying table. (Note that we are simplifying quite a bit here – for example, we haven't considered the relative sizes of memory words and elements of the array a.)

    for (i=0; i<16; i++) {
        for (j=0; j<16; j++) {
            a[j][i] = a[j][i] * 2;
        }
    }
Bank 0Bank 1Bank 15
a[0][0]a[0][1]a[0][15]
a[1][0]a[1][1]a[1][15]
a[15][0]a[15][1]a[15][15]


Repeated conflicts in interleaved memory

In practice, this kind of situation is depressingly common. One way to minimise conflicts is to have a prime number of banks – which on the face of it, would make the hardware complex. However, provided the prime number chosen is one less than a power of two (e.g. 3, 7, 31), then it turns out we can just replace the computation of the address within a bank by:

Address in bank = address MOD number of words in bank

This produces a new mapping scheme called modulo interleaving, shown below with three memory banks (along with the old scheme, called sequential interleaving). A minor drawback of this scheme is that for other reasons prime numbers are inconvenient – for example, combining standard memory chips into a prime number of banks leads to `funny' size memories.

SequentialModulo
Bank 0Bank 1Bank 2Bank 0Bank 1Bank 2
012048
345915
6786102
910113711


Sequential and Modulo Interleaving

Memory interleaving also suffers from the same problem of minimum upgrades, and there's another related problem. Because memory chips have got larger, it is becoming difficult to have many banks, because it's not realistically possible for a single chip to be part of more than one bank. If you want to make a banked 2GB memory with 256Mb chips, you can easily do 8 banks of 256MB, with 8 chips per bank. However, with 1Gb chips, you can only have two banks (again with 8 chips per bank). This is because the traditional internal structure of a memory chip means that each one contributes 1 bit to a memory word, meaning you need 8 for each bank of byte addressable memory (assuming no error correction).

This problem is likely to get worse as the rate of growth in density of memory chips has generally outstripped the rate of growth in memory size. That is, over time, we have been using fewer individual memory chips. One other consequence of this is that the minimum memory size of such systems tends to be large – unless a manufacturer uses smaller (i.e. old, slow, relatively expensive and harder to get) RAM chips, which is obviously not attractive.

The problem has been postponed somewhat by reorganising RAM chips so they each supply more than one bit per word. For example, in the case of our 2GB banks, suppose our 1Gb chips were internally arranged to supply bytes not bits – that is, it is `really' a 256MB chip. Now we could have 8 banks of 256MB again. Another problem that arises from this is that memory technology improves over time, and often we wish to take advantage of that when upgrading. Memory controllers that are flexible enough to be able to manage interleaving with a mix of different size memory chips are quite complex.

One obvious question to ask is: how many banks are optimal? A rule of thumb is that you should have enough to ensure that if memory is being accessed sequentially (e.g. when processing an array) then by the time you try to read a second word from a bank, the first access has finished. For example, suppose it takes 8 cycles to access a memory word. If you have 8 (or more) banks the first memory access will have finished by the time you start the second one – meaning no delays.

A solution to the problems of interleaving, sometimes called a superbank is to have separate independent memories, each with their own address and data buses, all connected to a controller, in turn connected to the processor. The controller's job is to decide which memory request to send to which superbank (and do any address translation needed). Because the memories are independent, multiple simultaneous requests are possible. Furthermore, the processor need know nothing about it, so it can be applied to existing designs. The controller is an obvious cost in time and money, however. This is essentially what Rambus did with their ill-fated (as it turned out) RDRAM technology.

15.4  RAM Chip Techniques

A final set of techniques we can consider (which are widely-used) are ones related to the way RAM chips work. Consider figure 15.4 and notice that as part of the process of reading a single bit, we actually read out a whole block of bits into the Row Buffer. Once in the Row Buffer, these bits can be accessed much more quickly than when they are in the memory array. Instead of simply discarding these bits, we can keep them on the grounds that future memory accesses are likely to require some of them. This is the basis of page mode chips, which have existed in some form for a considerable time now. There are a variety of different RAM chip types that use this and other techniques.

15.5  Visible Effects of Performance Increases

Because (as we will see in chapter 16) cache is so effective, the effects of memory performance improvements are largely masked. Even quite dramatic changes will have relatively small observable effects if most memory accesses find their data in caches. What is more, the techniques described in section 15.4 do not help with the time taken to retrieve the first bit: only the subsequent ones.

The net effect is that you are unlikely to be able to measure much, if any, performance increase between, say, conventional SDRAM and RDRAM, or DDR RAM. You will probably need a very memory-intensive application to notice a worthwhile difference. Ironically, the more expensive chips are most likely to be useful in systems that are unlikely to use them – those with cheaper processors, and smaller less-effective caches.


Previous Up Next