Note well: this chapter is supplementary, and is not examinable (or included in any lectures). Feel free to ignore it.
In this chapter, we are going to look at storage technology – particularly, non-volatile storage technology. That is, data storage that does not rely on continous power to retain its content. Currently, this means magnetic disk and optical disk systems, plus a few alternatives – such as flash memory.
In this section, we will look at various issues to do with disk systems. We will not describe in detail how they work – it is assumed that you already know this. (If not, there are many readable descriptions available, on-line and in text books, that are easy to find.) We will also describe the characteristics of flash memory.
By magnetic disks, we mainly mean fixed, hard-drive-type disks. There are of course numerous removable magnetic disk systems available – some more popular than others. However, their performance is generally poor for a number of reasons: they have more environmental exposure, and must be robust (though some are not really); and if they are to stand any chance of being reasonably popular, the removable data storage medium must be relatively cheap. Fixed disks do not have the same environmental exposure (though some need to withstand considerable shocks in practice), and people are generally prepared to pay more – they also quite reasonably expect much more in the way of performance (speed and capacity).
Magnetic disks, and semiconductor memory, have been the dominant storage technologies for decades. There have been numerous attempts to develop newer technologies, but all have failed. Although some showed initial promise, by the time sufficient development had been done to make them viable, the established technologies had moved on as well. (This is largely because of the general principle that it is cheaper and easier to develop established technologies than to develop new ones. Eventually though, the established technology reaches some kind of limit – cost, physical laws etc. – and you have to develop a new one.) The main drawback of disks is speed. If you consider them to be part of the memory hierarchy, below main memory, then you would expect them to have a higher capacity and to be slower. However, they are massively slower – the relative step in access speeds between disks and main memory is larger than any of the others in the hierarchy – typically, disks are five orders of magnitude slower than main memory. Consequently, some attention is given to addressing that (see section 19.4), in much the same way the relative speed of main memory was addressed (see chapter 15).
The reason for the speed problem is essentially that disks are mechanical. This should also mean that they are relatively unreliable – and this is probably true, compared with the rest of the components of a computer system. In fact though, modern disks are extremely reliable and also very cheap when you consider the engineering tolerances required in their manufacture. Most modern disks are 3.5 inches in diameter (approx. 89mm), with formatted capacities of up to (about) 100GB. However, there are also smaller versions for laptops (2.5 inches/approx. 63.5mm) and smaller devices, like digital cameras (1 inch/approx. 25.4mm). The smaller devices have less capacity (a few tens of GB for the 2.5 inch ones, and a few GB for the 1 inch ones). They are also generally slower, and have are designed for a less stressful duty cycle – that is, the proportion of time they are expected to be powered on during their lifetime, and the proportion of that time that they are actually doing anything, is significantly less than for the larger disks. A typical 3.5 inch disk is designed to be powered on all the time, and to be actively operating 90% of the time. A typical laptop disk (2.5 inch) is only designed to be powered on half the time, and in active use about a quarter of the time. On the other hand, the smaller drives are designed to be powered on and off much more often, use much less power (20W versus 2W) and have a much higher shock tolerance (about 10G for a desktop drive in actual operation, and over 100G for a portable).
There are three main components contributing to the speed of a modern disk.
In addition to the actual disk itself, we have to consider the control electronics. All modern disks will include a buffer – a block of RAM intended to hold data ahead of that which was requested. Given the long access times for data, it makes sense to read more than was requested, and store it in RAM because locality of access means there is a good chance that precisely that data will be requested in the immediate future. Typical buffers are currently a few MB for desktop drives. We also need to consider the interface between the drive and the rest of the system. We will look at some buses in section 19.3.
However, for now, we will just observe that there are basically two sets of standards normally used for this. Integrated Drive Electronics (IDE) is the cheaper, and generally the slower of the two, and is sometimes called ATA – AT-bus Attachment, where AT is Advanced Technology, from an early IBM PC (all the acronyms are essentially meaningless). The other is Small Computer Systems Interface (SCSI). Generally, SCSI is faster (provided you compare modern SCSI – e.g. Ultra320 – with modern IDE) at transferring data, and allows multiple, parallel requests simultaneously. However, IDE is normally preferred – because it is much cheaper – unless performance is a serious issue (e.g. servers, high-performance workstations). Also, IDE is normally only used for disks (magnetic or optical) whereas SCSI is used for a variety of devices.
Optical disks have more or less replaced floppy disks for software distribution, and are likely to replace them (and many other, less common, magnetic disk technologies) for most, relatively small-scale, data storage and transfer. (The internet is also a contender for this of course, but at the moment it has a few drawbacks – I'm sure I do not need to list them.) We are currently seeing the first generation of optical disks (CD-ROM and writeable derivatives) being replaced by the second (DVD and writeable derivatives – though this technology, particularly writeable and re-writeable disks, has yet to properly stabilize). The reasons for the popularity of optical disks are as follows.
The drawbacks of these technologies when compared with magnetic disks are capacity and speed. However, they are clearly not aimed at the same market. One contributer to the relatively low speed, is the way data is recorded. Rather than concentric rings of data divided into sectors, CDs and DVDs use a spiral track of data. In the case of CDs the reason for this is obvious – they were designed to replace vinyl LPs for music, which also had spiral tracks, and access to music data is generally sequential. Random access seeking is still possible with this model, but more difficult and slower. Given that at the time the idea of using CDs for computer data storage had not occured to anyone, these decisions were not unreasonable. However, it is a surprise that the same decisions were made for DVDs.
Although it is possible to manufacture small, low power disks, there are some applications that cannot justify them. For example, PDAs, mobile phones and other embedded applications. Such systems generally use flash memory, which is a variant on EEPROM – electrically erasable programmable read only memory. It also gets used for small ROMs on, for example, interface cards because it allows periodic upgrades. Flash (and EEPROM) work by storing charge on floating gates – essentially, insulated capacitors that can store charge almost indefinitely. The main difference between them is that EEPROM allows words to be written and read byte-by-byte, but Flash requires writes to be in much larger blocks (a few KB, typically). This is because the writing control hardware is quite large, and the space saved by reducing writing flexibility can go towards making the memory capacity of a chip larger. Reading Flash memory is quite quick – comparable to SRAM. However, writing is complex and slow. First, a block of memory has to be erased. This may take a few ms for a block of a few KBs. Then the new data must be written – this may take 1-2 microseconds per byte. If you do not wish to erase all the data in a block, you must copy the `good' data somewhere else first. This means you must always have some free blocks around. Data can easily become fragmented, and you can end up with small amounts of data distributed over many blocks. Hence there is software available that collects good data together and packs it into as few blocks as possible. The last issue with Flash is that it has a limited lifespan of between 100 000 and 1 000 000 million cycles. However, typical Flash applications do not do that much writing, and this actually compares well with the five-year design life of a typical disk. You could perform between 50 and 500 writes a day for five years.
We need some mechanism to interface our storage devices with our computer. Typically, we use buses – shared data transfer media. The drawback of a bus is that because it is shared, it is a potential performance limiter. However, the alternative – each device connected individually – has high associated costs and is inflexible because it is difficult to design adaptable and expandable systems.
There are various classes of bus that can be involved in data storage, and a number of ways to characterize them. However, a convenient first step is to divide them into synchronous and asynchronous buses. Synchronous buses use clock signals to control when data is to be transmitted/received. Asynchronous buses use a sequence of request/acknowledge operations. For example, in figure 19.1, a device A requests data from some address. First it sends the address; then it asserts a read request signal. The device B that will service the request first de-asserts a wait signal, letting A know that its request is about to be handled. A in turn de-asserts its read signal, to let B know that it is ready to receive data. Finally, B sends the data. Note that we have shown data/address signals separately, but in many buses they are actually shared, or multiplexed, to reduce costs.
The obvious disadvantage of asynchronous buses is the complexity of the negotiation (`handshaking') between reader and writer, which takes time and limits the speed of the bus. The problem with synchronous buses is that as they get longer, they begin to be affected by clock skew. That is, the signal is delayed as it passes along the bus, meaning that it arrives at different points on the bus at different times. Of course, the data and control signals will also be delayed, so this is not per se a problem, except that in practice it is impossible to ensure that the clock, data and control signals are all delayed equally. After a bus exceeds a certain length, the timing differences between signals will stop it working. You can of course `fix' this by slowing the clock down (which increases the acceptable skew), but in practice inherently long buses are asynchronous and short buses are synchronous.
A second way to characterize buses is: do they allow only one or multiple bus masters? A bus master is able to control the bus, and only allowing one simplifies the hardware because there is no need to consider and resolve the issue of arbitration – what happens if more than one device tries to control the bus at the same time? For some types of device, only allowing one bus controller may have no real impact on performance. However, in the case of disks, the time between initiating a read/write and it completion is quite long, and much of the time is wasted from the point of view of the bus. Allowing multiple bus masters would mean that a read, say, could be initiated by the device controller, and then the bus released. Other transactions could then take place until the required data was ready, when the disk would take control of the bus and initiate the transfer.
We can also divide buses into serial and parallel. A general rule of thumb is, other things being equal, serial is slower than parallel. However, often things are not equal – performance gains from technology mean that one generation of parallel technology can be outperformed by more recent serial technology. For example, serial IDE is currently being promoted as a future disk interface, and it is faster than current (parallel) IDE. As well as being faster, there are other advantages: cabling is more convenient for instance. Cabling and its associated costs is an important issue. Serial cable is generally simpler because fewer `wires' are required – where `wires' need not be electrical. In the case of wireless data transmission, serial is the only realistic option. In the case of conventional, electrical wires, having fewer means it is easier to sheild them from interference from their neighbours (crosstalk). You do not have to ensure that the data signals in different wires in the cable remain in step – a very similar problem to clock skew, and something that makes cables expensive (especially long ones). They are easier to run, and cheaper to buy – which is an issue in persuading people to accept a bus. Also, it is easier to build robust, `user-proof', connectors which are small (portable computers do not have much room for connectors).
Figure 19.3 shows a range of buses, divided into synchronous/asynchronous, single/multiple bus masters, and serial/parallel. Putting Ethernet in is a bit of a cheat, since Ethernet is a generic term covering a variety of interfaces, not all of which can be buses (for example, GigaBit), and many of which in practice are not in modern installations (100baseT for example). However, as initially conceived, Ethernet was a bus technology.
Synchronous Asynchronous Serial Parallel Serial Parallel Single Master IDE Multiple Master PCI, AGP RS-232 (`Serial'),USB, IEEE 1394 (`Firewire', `i.Link') SCSI
There are a number of things we can observe from this table.
Because buses have various strengths and weaknesses, and because systems have various different requirements, you typically need several. We can categorize them as follows.
They are generally connected to the CPU-Memory bus by a bus adaptor, which handles translation between the buses. Commonly, this is integrated into the CPU-Memory bus controller logic. While these buses can be used to directly control devices, they are normally used as 'bridges` to other buses. (Sometimes to see this we need to think carefully about what is going on – for example, AGP bus devices – i.e. video cards – act as bridges between the CPU-Memory bus and the actual display device: the monitor(s).) For this reason, these buses are sometimes called mezzanine buses.
In the following section, we will consider some example buses and their operation.
PCI – Peripheral Component Interconnect – was originally developed by Intel as a replacement for the previous `standard' bus, ISA (Industry Standard Architecture). The key problem with ISA at the time was that it was becoming too slow to support some of the faster IO buses that were clearly going to be available soon – it only provided 16-17MB/second bandwidth at most. At the time, this essentially meant video devices, but now ISA would also be unsuitable for GigaBit ethernet, and a bit borderline even for 100baseT if other devices are also attempting to use the bus at the same time. The issue is most clearly obvious when considering the amount of data required in a simplistic view of full-screen, full-motion video. With a display resolution of 1024 by 768, and 30 frames/second, we will need a bandwidth of nearly 70MB/second. This assumes that the data originates in the CPU – if we are, say, playing a DVD, then the data must first come from the DVD bus (IDE, typically) to the CPU, and then back to the graphics card. The video data on the DVD is, of course, compressed so we are not talking about doubling the data rate (at standard playback speed, DVDs need about 1.4MB of bandwidth). However, there is nevertheless an extra overhead – and this all assumes nothing else is going on.
Consider also 100baseT ethernet, which can transfer (peak rate) 100Mbit/second of data – say 10MB/second for simplicity. Although this is within the capabilities of ISA, remember that once again this data will typically come from a disk, and unlike DVD video, this data is unlikely to be compressed. So your internal bus may need 20MB/second bandwidth. Hence the principle aim of PCI was to provide more bus bandwidth. Ironically, the move to intelligent graphics cards that take over much of the display processing from the CPU means that we are now sending more data to the graphics card than PCI can handle in some systems (for example, texture maps). This has lead to an extension of PCI called AGP – Accelerated Graphics Port. Because it is basically a variation of PCI, we will look at PCI first and briefly consider AGP later.
PCI has been through a variety of versions, and is currently (2002) at version 2.2, with maximum data rates of 528MB/second. PCI is a synchronous, multiple master bus. At any one time, one device on the bus is the master (initiator) and is sending data to a target (assuming the bus is not idle of course). Like all multiple master schemes, we must consider the issue of arbitration – who gets to control the bus? PCI uses a centralized controller (arbiter) for this purpose, and each device has two signals for the purpose: request (REQ) and grant (GNT). The PCI specification does not include an algorithm, so implementers are free to choose – it obviously makes sense to choose something fair to avoid lockout.
To gain control of the bus, a device asserts its REQ signal and waits for a responding GNT from the controller. After it has finished its transaction, the device should relinquish the bus. However, if nothing else wants to use it, and the device still has transactions to complete, it can retain control of the bus. However, as soon as another device requests control, the central arbiter will contact it using the GNT signal, and it must release the bus. In order to transfer data, the initiator must send the address of the data, and then either send the data itself (for a write operation), or wait for the target to send it (for a read). To reduce the cost overhead, the address and signal lines in PCI are shared, or multiplexed. The minimum number of clock cycles for a transaction is three: on the first cycle, the address is put on the bus by the initiator; on the second they are removed; and on the third the data lines are but on the bus (by either initiator or target, as appropriate). Note that more cycles may be required if, for example, the target cannot respond quickly enough (e.g. it is a relatively slow device).
The actual signal set used by PCI is divided into two parts: mandatory signals, which must be present, and 51 optional signals which may be present. The mandatory part contains: 32 data/address signals; a parity bit for the data/address signals; a clock; and a variety of control and error-handling signals (including REQ and GNT). The optional signals include: a further 32 data/address signals (to handle 64-bit addresses and data); a further parity bit for the extra data/address signals; and further control signals (to handle, among other things, multiprocessor systems).
Although PCI is quite fast, it is not comfortably fast enough for current and near-future graphics cards. To address this, a variant of PCI, called AGP, was developed by Intel. AGP differs from PCI in a number of ways.
PCI is not suitable for low-speed, external systems: first, it is too expensive; and second it is limited to distances of a few cm. (This is actually becoming a serious irritation for high-performance systems, and alternatives are coming along.) A variety of different solutions have been used, and are still used. Some of these are very old: for example, the `serial' interface still found on PCs is more or less the same as RS-232 which is four decades old. Others are very specialized: for example, PS/2 is used only for PC keyboards and mice; as is the (now dropped) Apple Desktop Bus (ADB). Recently, there has been an attempt to rationalize the large number of buses. Two that seem likely to replace more or less all their predecessors are the Universal Serial Bus (USB) and IEEE 1394, usually known as Firewire and sometimes i.Link. Typically, USB is used for slow devices (keyboards, mice, scanners, printers, digital still cameras, PDAs) and Firewire for faster devices (digital video cameras, external disks). The new version of USB is actually faster than Firewire: however, a faster version of Firewire will be available soon. The two systems have similarities: both are serial, highly distributed, and hot pluggable. We will consider them both, but look at USB in a bit more detail. Both USB and Firewire communicate the CPU-Memory bus via a Backplane Bus (usually PCI).
Typically, USB and Firewire devices are connected together in a tree-like structure: the root of the tree (usually a computer, but this is not necessary) will contain a hub, or bridge, providing one or more sockets. Devices connected to the root can in turn contain hubs, or bridges, to which further devices can be connected (in fact, a device may be nothing more than a hub). Systems are hot pluggable – that is, there is no need to manually configure them so new devices should be automatically detected and incorporated. (In practice, this is slightly limited in the sense that for non-trivial devices you often have to install drivers. However, once this is done, a device can be removed and replaced arbitrarilly.)
Connected USB devices are all assigned a unique address, and up to 127 devices can be connected together. When a new device is connected, it has an address of zero. The root hub periodically searches for such devices, and when it finds them, it informs the operating system. The operating system asks the device how much of the available bandwidth it needs, and if there is enough it assigns it a new, unique, address. Firewire works similarly, except that the addressing scheme is a little more complex: up to 63 devices can be normally connected together, but this can be extended to 1022 with bridges. Data flow in USB is between the device and the root hub and a device. It is not possible for two devices to communicate directly – they must do so via the root hub, even if the the tree topology makes this theoretically possible. USB is fundamentally intended for devices that only communicate with a computer, and not with each other. Firewire on the other hand, is intended for device-device communications (a so-called peer-to-peer) system). In fact, the root hub of a Firewire system can be dynamically decided by the system at startup – that is, any hub can act as the root. Traffic over USB is divided into frames and there are four types.
SCSI (Small Computer System Interface) is the high-performance IO bus of choice for internal disk drives, and is also used for a range of high data rate external peripherals (e.g. samplers, some scanners, external drives). However, the inconvenience and cost of SCSI for such devices mean that it is rapidly being replaced by USB and Firewire. However, for internal disk drives SCSI is still the fastest commonly-available system, though its high costs mean that it is limited to high performance systems. SCSI dates from 1979, when it was originally introduced as SASI – Shugart Associates System Interface – named after the company that developed it. (Incidentally, the founder of Shugart invented the floppy disk.) Shugart made the interface open to all who wanted to use it, and it became SCSI (and a US ANSI standard) in 1986. The first version was slow by today's standards, only able to transfer 5MBs/second. Current implementations can manage 320MB/Second, and are commonly called Ultra4 SCSI, or Ultra320 SCSI. There are a number of (irritatingly confusing) variations.
SCSI devices are `daisy chained' together. This means that each SCSI device has two connectors (which are integrated together for internal devices) – one for the each direction of the daisy chain. The devices at either end of the chain are electrically terminated – either by plugging a special terminator into the appropriate connector, or setting a switch. Termination stops electrical signals being `reflected' back along the signal wires, potentially causing interference. Also, SCSI devices must have unique identifiers. Initially only eight devices where allowed (with device zero always being the controller this means an effective limit of seven). However, this has been increased to 16 (effectively 15). These identifiers are allocated by the user, by setting a switch. SCSI is not hot-swappable – generally devices are identified on startup, and if you plug a new device in it will not be visible. This is not a big problem for internal drives, but is one of the irritations for external devices. If you forget to turn something on before/while the system starts up, you either need to run an application to get the SCSI controller to re-probe the bus (if you have one) or you need to reboot the system. The other irritations are: needing to set the device ID, and making sure the one you pick has not been taken by something else (made harder because some devices restrict your choices); getting termination wrong, leading to unpredictable and hard to diagnose errors; and big expensive cables and connectors.
SCSI is fast because unlike IDE it allows multiple masters. This means that there is no need for the device controller to wait for data accesses to complete. Once a request has been initiated, the device controller can relinquish the bus, and when the data is ready the disk can take control of the bus to send it back. This means that the bus can be used for other things in the intervening time. Note that on cheap systems with only one disk (or at least, only one disk per IDE bus), much of this advantage is lost – meaning there is not a lot of point in using SCSI. However, high-performance servers will typically have lots of disks – and may be using RAID (see section 19.4), which works much better with SCSI. Another, important, reason for the performance difference between SCSI and IDE is product differentiation. IDE is a cheaper interface, but it is not likely to be able to compete on speed with SCSI in large, multi-disk RAID environments – even though with development it could get closer than it is now, there will still be a performance gap and the cost gap will fall. Similarly, SCSI has an inherent speed advantage over IDE, but it is never going to be as cheap. Consequently, each interface is targetted at a particular market, which means they move further apart. The main emphasis of IDE is low cost – so it uses electronics and drives that are cheaper and slower (and incidentally less reliable). On the other hand, the main emphases of SCSI are reliability and speed, so it uses fast (and more reliable) drives and electronics.
Just like memory, we can use the truck of tapes principle to reduce the impact of the slow speed of disks. However, in the case of disks, we also have an incentive to address the reliability of disks. This is for two reasons: first, they are likely to be less reliable than many other parts of the system (because they are mechanical); and secondly because disks are where we permanently store data. A range of schemes intended to address these issues are collectively called RAID – originally redundant array of inexpensive disks, now commonly redundant array of independent disks. (The change from `inexpensive' to `independent' is partly simply an error, and partly because the disks turned out not to be so inexpensive. For there to be any real performance impact, SCSI disks are needed.) The precise form of RAID we would choose depends on circumstances: some are better at protecting data than speeding up access; some are more expensive than others; some emphasise response time; others rate of transfer.
This is not part of the original RAID system (for one thing, it isn't `redundant' at all, and also it predates RAID). It works on exactly the same principle as memory interleaving: data is divided into strips (and the technique is often called striping) which are distributed over more than one disk. The main benefit of RAID 0 is that the rate of data transfer is increased (though initial response time is not) for the same reasons that interleaved memory is able to supply data at a faster rate than non-interleaved. The main disadvantage is that reliability is reduced. This is because there are now more disks to fail, and the failure of any one will result in data loss. How big should the strips be? It depends on the desired performance. If the data rate is important, then it is better for the strips to be small. This is because a typical access will span a number of strips, meaning each disk can be accessed in parallel, to provide part of the required data. If response time is more important, then it is better for the strips to be large. This is because a single access is likely to be contained in a single strip, so multiple, parallel, accesses can be performed.
RAID 1 is another scheme that predates the RAID nomenclature – the same data is written to two separate disks. For obvious reasons, this is often called mirroring. RAID 1 is the most expensive scheme, needing double the normal number of disks. However, data recovery is very simple, since a complete copy is available. It would seem that RAID 1 does nothing for speed: however, it does on average improve response time. This is because while one disk is busy dealing with a data access, the other disk is free. Furthermore, there are no restrictions on which data is accessed on the second disk – unlike some other schemes. RAID 1 can be combined with RAID 0 to increase performance.
A scheme that is less expensive overall, but which requires a large number of disks, and can be improved on, is RAID 2. This uses error correcting codes (e.g. Hamming Codes), similar to those used on some systems for memory protection (and also, in extreme circumstances, for data communication). In RAID 2, each bit of a data word is written to a separate disk (that is, the data is divided into 1-bit strips). Then the error correcting bits are written to a separate set of disks (one per bit). Using 8-bit words, a Hamming Code scheme able to detect and correct single bit errors would need four extra disks.
RAID 2 systems are not available commercially. This is because RAID 3 schemes provide much the same redundancy with much less overhead.
RAID 3, or bit-interleaved parity, reduces the overhead needed to just a single disk. It works like RAID 2, except that instead of computing a multi-bit error correcting code, we just compute a single parity bit – typically the exclusive OR of all the bits in a word. Normally, a parity bit can only tell us that an error has occurred, but not where the error is. However, in the case of a disk error, we can reasonably expect to know which disk has failed, and hence which bit. In such a case, the value of the faulty bit is the exclusive OR of all the remaining data bits and the parity bit. When an error has occurred, the disk array can continue to operate in a reduced mode, with the data on the failed disk being computed on-the-fly (though of course there is no redundancy in such circumstances).
Because it has very small (1-bit) strips, the data transfer rate of RAID 3 is very high (at least when reading data – as we will see, there are problems when writing). However, the response time is no better than a single disk, because, clearly, with 1-bit strips it is not possible for more than one access to be active at a time.
RAID 4, or block-interleaved parity, is conceptually similar to RAID 3, except that instead of using 1-bit strips, larger blocks are used, and corresponding blocks of parity bits are stored on the check disk. Such systems can improve access times at the expense of data transfer rate (compared with RAID 3).
One problem with RAID 4 (and RAID 3) is that every write to the disk array means a write to the check disk. This is a potential bottleneck. To reduce this RAID 5, or distributed block-interleaved parity, distributes the parity information (and the data) over all the disks. In this way, there is no single bottleneck.
RAID 6 was added to the hierarchy after the original publications, and extends RAID 4 and RAID 5 by adding a second check block, allowing recovery from a second error. This, obviously, requires an extra disk. Currently, there are no commercial implementations of RAID 6.