Looking at the relative speed-ups of CPUs and memories since 1980 we see that memory has increased in speed by a factor of about 4; CPUs by a factor of about 20000. The mismatch, which continues to worsen, means that CPUs are continually held back by slow memory – unless we do something. As we saw in chapter 15, it isn't that we can't build high-speed memory – we just can't afford it – at least, not enough of it.
One obvious strategy is: if memory is too slow, don't use it as much. This is not as stupid as it sounds (and perhaps wasn't really that obvious) – if you put enough registers on your processor, you should be able to dramatically reduce the number of memory accesses, once you've read the data in in the first place.
This strategy was used extensively in the early RISC processors, which – often had hundreds of registers. However, it does nothing for instruction access. At least one memory access is required per instruction – in a pipelined processor, this means one per cycle. Futhermore, it is difficult to put arrays etc. in registers, and every time you swap processes you need to save them all, which is slow. (Actually you don't always need to save them all, but there are complications that arise if you don't. For example, you need some mechanism for saying how many you will save, which complicates your instruction set. It can also get complex when you deal with separate compiliation – typically the caller does the saving and may not know how many to save.)
A longstanding solution to this problem is the cache – a small(ish) block of high-speed memory (small enough to be affordable). Cache memory takes its place in the memory hierarchy – see figure 16.1. The vast majority of references to instructions and data are local – that is, the next data item/instruction we will use is going to be near the last one. Locality of reference means, if we are careful, we can keep the things we are using most high up the pyramid, and only rarely need to access the lower levels.
Just as with virtual memory, when most of the memory pages of the current processes are sitting on the hard disk, with only a little in main memory, small parts of the main memory are copied to the cache (and some of this, in turn, to the registers). The system works very similarly to virtual memory in practice – when a memory word is accessed, it is looked for in the cache. If its there – fine.
Otherwise, we have a cache miss (like page faults in virtual memory), and have to read the word, together usually with a few of the surrounding words, into the cache. However, there are some significant differences. Typically, the access time for a word in a cache is 1 clock cycle – if a cache miss occurs, and the word needs to be fetched from main memory, the access time could be up to 300 cycles: over two orders or magnitude more, which seems a lot. However, if an access to main (virtual) memory results in a page fault, the access time could be a 10-20 million clock cycles: five orders of magnitude. Therefore, it is much more important to avoid page faults than cache misses. Furthermore, the whole point of a cache is that it's fast – many methods used to reduce the miss rate also increase complexity, which is likely to reduce speed.
Therefore, in the tradeoffs between complexity and effectiveness, simplicity is much more favoured by cache designers than virtual memory designers. (Another reason is that given such long delays, it makes no sense to stall the processor for a page fault. Instead, a context switch will be performed by the operating system to another process while we wait for the page to load. Hence page faults are handled in software and cache misses in hardware. It is a lot easier to implement relatively sophisticated algorithms in software than hardware.)
Figure 16.2 shows the relationship between main memory and a cache. Each slot in the cache can hold one block of (contiguous) memory words – a block is usually some number of main memory words (in figure 16.2 it is 3). The cache memory is accessed not via a memory address as such, but by pattern matching on a tag stored in the cache. The tag is constructed from the main memory address (usually it's just a subset of the bits), and means that a block may be stored in a vacant slot in the cache. (Actually not necessarily any vacant slot – see later.)
The obvious problem is how do we look things up in the cache. The mapping function is the strategy used to store and locate memory words within the cache. There are several variants, but in all cases we have a mapping from main memory address to location in cache.
Note that we cannot use the system typically used for virtual memory, i.e. a lookup table where a separate page table in memory is consulted to translate virtual address to physical addresses – it would be too slow. Speed is essential with cache: adding an extra lookup cycle would be unacceptable, and completely defeat the point, in fact.
There are three strategies, forming a continuum:
The simplest strategy is direct mapping – each block of main memory can only map to a single slot in the cache, deterministically. Since cache is, by its nature, much smaller than main memory, that means each cache slot has many main memory blocks which could map on to it. Obviously only one can do so at a time, so the questions are:
A simple example is shown in figure 16.3. We divide the memory address up into three fields: the slot field, which identifies which slot in the cache can contain this memory address; the tag field is then used to check if a particular block is in the cache (i.e. does the slot contain the memory address we actually want, or some other address which happens to have the same slot field); and the word field is used to identify a particular memory word.
The slot field is used to lookup a block within the cache just like a memory address. (The slot field is often also called the index.) When a block is stored in the cache, its tag field is stored in the tag field of the appropriate cache slot. Therefore, if the tag field matches the tag in the cache, then the required block is present in that particular slot. The word field identifies the actual word being accessed within the cache slot.
The example in figure 16.3 shows a specific example, with: 32-bit memory addresses; 2K cache slots; each of which can store 8 words. Therefore the total cache size is 2K × 8 = 16K words. The last 3 bits of the memory address are used to access one of the 8 words within a cache slot (3 bits gives us the range 0–7); i.e. , the word field is 3 bits. A cache with 2K slots requires 11 address bits: therefore, the slot field is 11 bits. The remaining bits (32 – 11 – 3 = 18) form the tag field. (Note: you really need to know the powers of 2 quite well to be able to effectively work out the various field sizes.)
Notice the order of the tag and slot fields – if we swap them around, then adjacent blocks of memory will map to the same cache slot. This would not be good for performance, since most programs show locality of access – both to code and data. So mapping adjacent memory blocks to the same cache slot would mean that in all probability blocks would be continually swapped in and out of the cache.
Past experience (i.e. exam question answers) says that a lot of people have trouble with working out the various field sizes used in with different sizes of cache, and different slot sizes. So figure 16.2.1 shows some values, assuming 32-bit memory addresses and 8-bit words:
Block size Cache size 2 4 8 16 32 1KB 22/9/1 22/8/2 22/7/3 22/6/4 22/5/5 2KB 21/10/1 21/9/2 21/8/3 21/7/4 21/6/5 4KB 20/11/1 20/10/2 20/9/3 20/8/4 20/7/5 8KB 19/12/1 19/11/2 19/10/3 19/9/4 19/8/5 16KB 18/13/1 18/12/2 18/11/3 18/10/4 18/9/5 32KB 17/14/1 17/13/2 17/12/3 17/11/4 17/10/5 tag/slot/word field sizes
Notice the various relationships between the fields. The word field is independent of the others: it depends only on the number of words in a block. The slot field depends on the number of slots (divide the total cache size by the number of words per block). The tag field is whatever is left over. Doubling the cache size means the slot field gets larger by one bit, and the tag gets smaller by one.
The advantage of direct mapping is simplicity. However, it is inflexible, and if two commonly-used blocks clash, slow – because we will have to keep swapping the two blocks in and out of cache. Associative mapping is the opposite extreme – any block of memory can map to any slot in the cache. This is illustrated in figure 16.5 – notice now we need to use nearly the whole memory address as the tag. There is no slot field because, obviously, we do not actually lookup a particular slot — a word can be stored anywhere we can find space for it. Associative cache is flexible, but expensive and/or slow since we need to simultaneously search all cache slots. There is special memory that allows this – associative memory, or content-addressable memory (CAM) – but it is neither cheap nor particularly fast.
See figure 16.6 for another table, this time containing only tag & word field sizes. Notice that this time the cache size has no effect.
Block size Cache size 2 4 8 16 32 1KB 31/1 30/2 29/3 28/4 27/5 ⋯ … … … … … 32KB 31/1 30/2 29/3 28/4 27/5 tag/word field sizes
A compromise, is to allow each block of memory to occupy one of a small set of slots of cache (typically 2 or 4). This is set-associative mapping. Figure 16.7 illustrates the basic idea with two-way mapping. Effectively, we are dividing the cache into a number of smaller caches, each of which may contain the word (or more properly the block) we are looking for, and which we must search simultaneously. We now only need to search a few slots simultaneously (instead of every one in the cache), and retain sufficient flexibility to avoid clashes the vast majority of the time. We don't need special memory – just a few small ordinary memories, that we can access in parallel.
This is a widely used technique in practice (e.g. 80486 uses 4-way, P6 uses 2-way for the instruction cache, 4-way for the data cache).
Once again, we have a table — see figure 16.8. This time, we keep the block size fixed at 8 words, and instead vary the degree of associativity.
Degree of Associativity Cache Size 1-way 2-way 4-way 8-way 16-way 1KB 22/7/3 23/6/3 24/5/3 25/4/3 26/3/3 2KB 21/8/3 22/7/3 23/6/3 24/5/3 25/4/3 4KB 20/9/3 21/8/3 22/7/3 23/6/3 24/5/3 8KB 19/10/3 20/9/3 21/8/3 22/7/3 23/6/3 16KB 18/11/3 19/10/3 20/9/3 21/8/3 22/7/3 32KB 17/12/3 18/11/3 19/10/3 20/9/3 21/8/3 tag/slot/word field sizes
Notice as the degree of associativity rises, the size of the slot field falls. When it reaches zero, we will have a fully-associative cache. At the other extreme, notice that a 1-way Set-Associative cache is just a Direct-Mapped cache. In general, Direct-Mapped and Fully-Associative caches are special cases of the more general Set-Associative cache.
When a memory write occurs to a block in cache, we can either update the corresponding main memory word immediately, perhaps in the background (write through), or just write out the changed block when it is thrown out of the cache (write back). The latter is obviously more efficient, but can complicate things like direct memory access (DMA) I/O, when an input/output device accesses memory directly, bypassing the processor – because if the cache and memory are inconsistent (that is, the version of word in the main memory is out of date, because the copy in the cache has been changed), we must be careful about DMA'ing blocks of memory out to disk.
The problem with writing is one reason for the trend towards separate instruction and data caches – the instruction cache (shouldn't!) need to write out at all. However, a more pressing reason for this trend is pipelined architectures – separate caches allow simultaneous instruction and operand fetching by different instructions in the pipeline. In addition, a specialised instruction cache means we can pre-decode parts of instructions, so it is very easy and quick to tell if they are, for example, branches. In addition, you can link an instruction cache with a branch target buffer (see chapter 5).
A further issue to consider with writes is: if you get a cache miss on a write, do you load the page into the cache or not? Doing so is called write allocate, or fetch on write. It's done in the expectation that if a write is made to a word, then adjacent words will either be read or written soon. The alternative is called no write allocate, or write around, and here the page is not loaded. For not especially compelling reasons, write back caches also tend to be write allocate; and write through caches tend to be no write allocate.
Another point is: when we come to replace a page in the cache, which one do we choose to discard? With direct mapping, the solution is easy – we have no choice. But in other circumstances, we do. The three most commonly used algorithms are Least Recently Used, First in First out and Random. The performance penalty of using Random (or pseudo-Random, of course) is very small, particularly as caches get bigger, which means it is becoming more popular, particulary as it's very easy to implement.
One of the first things to think about when considering performance is how big should the cache be? Studies show that cache miss rates for instrutions are much lower than for data. An instruction cache of 8Kb has a miss rate of less than 1%; for 256Kb this falls to about 0.002%. There does not seem to be much point in making instruction caches much bigger. For data caches however, the miss rate at 8Kb is 4% and at 256Kb it is still 3%. So there is still room for improvement – on the other hand you do not have a whole lot to show for increasing the cache size by a factor of 32: just a 25% fall in the miss rate!
Although it seems obvious that the bigger the better, this isn't always true. Eventually, we may get to the stage where the extra performance gain is not worth it. Bigger caches reduce the chance of a cache miss, but when the percentage possibility of a cache hit is already in the high 90s, diminishing returns start to set in. Also, larger caches are slower (bigger search space). In practice however, caches have been getting bigger. Another factor driving up caches sizes is the footprint of a process in a cache. When process A is suspended, and process B takes over, process B may have to spend some time getting its data and instructions into the cache. If the cache is quite small, the slots occupied by B (ie B's footprint) may well significantly overlap A's. Hence when A is restarted, it must read significant amounts back into the cache, slowing it down. Bigger caches make this less likely.
Microprocessors have historically tended to go for smaller caches, but keep them on-board the processor chip. This reduces the communication delays, and speeds up access, making up for the smaller size (e.g. 486 has on-board 8K cache, Pentium and Pentium Pro/II have 16K, split between data and instructions).
A trend in processor design is to provide multi-level caches. Behind a relatively small on-chip cache, there may be a much larger level 2 cache, originally implemented by fast memory chips and now usually integrated onto the processor chip as more on-chip resources become available. We are beginning to see level 3 and level 4 caches on microprocesors. The point of these extra caches is to reduce the performance penalty in the event of a cache miss, without compromising the performance of the level 1 cache – which may adversely affect the speed of cache lookup (as we have said, bigger caches are slower).
A question that arises with multi-level caches is: should the data in one also be in another? That is, should the contents of the Level 1 cache also be in the Level 2 cache? The `natural' answer is yes: if we consider the main memory to be just another, very large and slow, cache, then we would not delete blocks of memory that were moved to the cache. However, if we have a small Level 2 cache, then exclusion can be a sensible thing to do because it stops us wasting space in the Level 2 cache on data/instructions that are already present at Level 1. The cost of doing this is making some operations more complex (for example, checking the consistency of data in the cache and main memory).
The purpose of a cache is to maximise the speed at which memory accesses occur. This basically means: make sure that (a) cache hits are very quick; (b) cache misses are rare; and (c) when they occur, they're not too slow. These are a bit mutually exclusive (if we take a naïve approach) – for example, the fastest caches are small and simple, but that is likely to increase the chances of a cache miss. However, in general, most microprocessor Level 1 caches are quite small, and are either direct mapped or 2 to 4-way set associative, to keep them fast. Nonetheless, there are other parameters we can play with to speed things up. In retrospect, early work on caches put too much focus on cache misses – in practice, we don't care specifically how many misses there are, as long as our system is fast overall. Cache misses are an important part of that – but not the only one. The first thing to do is to categorise misses.
The two traditional approaches to reducing miss rates have been (a) to increase the block size; and (b) to increase the degree of associativity. The idea behind (a) is that data and programs are localised, and hence by loading a bigger block you'll get data/program that you'll eventually need. It certainly reduces the compulsory misses, but it can increase the others – by having bigger blocks, you have to have fewer, which means more sharing and hence conflict misses. Also, because the blocks are bigger, you could be loading stuff you don't need – which could waste space in a small cache (leading to more capacity misses). Also, when a miss does occur, it takes longer to load the (bigger) pages.
The idea behind (b) is to increase flexibility and hence reduce conflict hits. A rule of thumb is that a k-way associative j Kbyte cache is as effective as a 2k-way associative cache of size j/2 Kbyte. However, as we've seen, such caches get slower – by reducing the number of misses, we're also reducing the speed of hits.
There are newer techniques as well. Some involve hardware and others focus on software. For example, victim cache is a small (2-8 entry) fully-associative cache sitting between the main cache and the data path to memory. It is used to store the last few pages thrown out of the cache, and is automatically checked when a miss occurs. If the page is found to be present, it's swapped back into the main cache. The idea is to catch references to pages that have just been discarded, but which really should have been kept. Because it is small, such a Fully-Associative cache can be made fast and does not consume too much chip space.
A less hardware-intensive approach is to get the compiler to understand and exploit the limits of the cache – by increasing locality of reference. By grouping data, and combining, or in some circumstances, splitting loops, we can arrange for data and code to be swapped in and out of caches less often.