Previous Up Next

Chapter 18  Parallel Processing

18.1  Introduction

Arguably, we have been considering parallel processing systems throughout this module, since pipelined and superscalar systems are clearly computing instructions in parallel (instruction level parallelism). However, this is not parallel processing as it is commonly understood. For many years, parallel processing was largely restricted to specialised and expensive hardware. This was because the number of computing tasks that could benefit from such technology was quite small, and furthermore different kinds of task demand different forms of parallelism. Such a situation leads to very small markets, and so prices are high – which further reduces demand. Also, generally, parallel systems are not very suitable for more mundane computing tasks. So if you were going to buy one of the expensive parallel computers available, you had to have a task that was really worth it, and you had to have a lot of money. This mainly meant large-scale numerical problems where results were needed quite quickly and which someone rich could be persuaded to pay for – such as weather forecasting, and cryptanalysis for security agencies (the `spooks'…). Consequently, companies building parallel machines tended to come and go fairly frequently, and the market has rarely been particularly healthy. To some extent, this has now changed for two reasons. Firstly, desktop graphics/multimedia applications are now very common, and (somewhat limited) specialised parallel processing hardware is part of every high-performance processor. Secondly, we now have cheap, high-speed networking, which means off-the-shelf PC-class hardware can be relatively cheaply assembled into a parallel machine.

As has been noted earlier, we are now starting to see multi-core CPUs, bringing parallelism to the desktop and living room, where the big question is: what are we going to use it for? Most `everyday' tasks are not particularly inherently parallelisable — historically parallel machines have been used for number-crunching tasks such as weather predication which are, and for good reason (the resources were expensive). But how does Microsoft Word improve with 8 cores in your CPU? One answer is that, well, it doesn't. The response to that is that you're probably not just running Word; you probably have many other processes running simulataneously, and more cores means your overall `computing experience' could become rather smoother. It remains to be seen if this is actually going to be worthwhile for most users. Three factors will probably fuel the growth in multicore technology:

  1. Users who really do want to run many threads simultaneously, e.g. web hosting services, and companies of the Google, Flickr, MySpace, etc. variety. In such instances, cheap multicores are an improvement on buting multiple single-core CPUs.
  2. Marketing and the `new best thing' factor; flashy Intel adverts, pushy PC World salesdroids, etc.
  3. Being able to run all the spyware, trojan horses, zombies, and spambots you've inadvertantly installed on your Windows box without it having any noticeable performance hit.

18.2  Classification of Parallel Machines

There are a variety of ways we can subdivide parallel computers. However, the most usual, and the first we will consider, is how much data is being operated on, and by how many different operations. A Single Instruction Single Data machine (SISD) is essentially a normal sequential computer – one instruction is operating on one data item at any one time (we ignore the subtleties of pipelining/superscalar here). A Single Instruction Multiple Data machine (SIMD – pronounced `sim-dee') simultaneously applies the same operation to multiple, different data items. Such machines are generally useful for processing arrays and vectors of data. A multiple instruction multiple data machine (MIMD – `mim-dee') allows different simultaneous operations on multiple data items. Such machines are currently often assembled, relatively cheaply, from networked workstations. They are more flexible than SIMD machines. The final possibility, multiple instruction single data (MISD) does not at first sight appear to make any sense. On closer examination, this is not so obviously the case – there are a few machines with MISD-like properties, but we will not consider them.

18.2.1  SIMD

Specialiased SIMD machines are typically arranged in an array (and numbers of very small processors. For example, the ICL DAP (1970s) had 4096 1-bit processors; the Connection Machine (1980s) had 64K 1-bit processors. During the 1980s, SIMD machines rose in popularity, and there were a range of commercial versions available (e.g. Thinking Machines) as well as the usual more experimental ones. However, it then faded again because it is relatively inflexible (lots of problems do not map well to SIMD) and is only really competitive for very large-scale problems, needing lots of processors. Also, it cannot take advantage of off-the-shelf microprocessors, but instead needs special purpose (i.e. expensive) hardware. However, there has been a fairly consistent demand for simpler SIMD-type machines called vector processors. These generally contain a one-dimensional array of processors, and have usually been a (possibly optional) part of a more conventional computer. For example, the architypical 1970s `supercomputer' the Cray-1 had an attached vector processor. (Incidentally, the Cray-1 had an 80MHz clock and 8Mbytes of memory – any reasonably-modern PC now substantially outperforms it for a tiny fraction of the cost. Even PDAs (or phones!) have more memory and faster processors. In practice, they are unlikely to be as fast as a Cray but would probably give it a run for its money.) Also, vector processors are useful for a range of graphical/multimedia applications, and so are incorporated in modern microprocessors – Intel's MMX and SSE, and IBM/Motorola's AltiVec are just simple vector processors.

18.2.2  MIMD

MIMD computers are currently much more successful than SIMD machines and can be broadly divided into two classes. Shared memory machines (sometimes called closely-coupled) as their name suggests share a single memory, though each processor has its own cache (which leads to problems with coherence — see section 18.3) which is used for communication between processors. The current normally-used terms for such processors are symmetric multiprocessors (SMP) or uniform memory access (UMA) machines. In practice, this technique is best suited to machines with relatively small numbers of processors, because large numbers of processors swamp the memory bus. (However, by adding extra buses, or using switches instead of buses, the number of processors can be increased.) The basic idea is shown in figure 18.1.


Figure 18.1: Shared memory multiprocessor (SMP or UMA)

The second class of MIMD processors (loosely coupled) uses physically distributed memories, usually associated with the processors, and the processors communicate using network-type technology. Such machines are further divided into two sub-types. Distributed shared memory (DSM) machines have a common, shared address space for all the memories. The inclusion of `shared' in the name DSM is unfortunate, because it implies the memory is directly shared between the processors. It is not: if a processor wishes to write to a memory that is not local to itself, the data must be sent via the interconnecting network. However, the shared address space provides the illusion of direct access. DSM machines are also commonly called nonuniform memory access (NUMA) machines. The basic idea is shown in figure 18.2. The second sub-type does not share the address space. In this kind of machine there is no need for the different processors to be physically integrated, and a currently popular and low-cost system is a cluster. Clusters use off-the-shelf computers and network hardware (and there is no real need for the components to be solely dedicated to the cluster: they can be used as individual workstations if necessary).


Figure 18.2: Distributed memory multiprocessor (DSM or NUMA)

18.3  The Cache Coherence Problem

An important problem that must be addressed in many parallel systems – any system that allows multiple processors to access (potentially) multiple copies of data – is cache coherence. That is, the possibility of `stale' data being accessed by one processor because another processor has changed it, and not all changes have yet been propagated. Suppose we have two processors, A and B, each of which is dealing with memory word X, and each of which has a cache. If processor A changes X, then the value seen by processor B in its own cache will be wrong, even if processor A also changes the value of X in main memory (which it – ultimately – should).

There are a number of approaches to this problem based on hardware, and a commonly-used one is some form of snooping – that is, keeping track of other processor's memory writes. The most common variant of snooping is a write invalidate protocol. In the example above, when processor A writes to X, it broadcasts the fact and all other processors with a copy of X in their cache mark it invalid. When another processor (B, say) tries to access X again then there will be a cache miss and either (i) in the case of a write-through cache the value of X will have been updated (actually, it might not because not enough time may have elapsed for the memory write to complete – but that's another issue); or (ii) in the case of a write-back cache processor A must spot the read request, and substitute the correct value for X.

An alternative (but less-common) approach is write broadcast. This is intuitively a little more obvious – when a cached value is changed, the processor that changed it broadcasts the new value to all other processors. They then update their own cached values. The trouble with this scheme is that it uses up more memory bandwidth. A way to cut this is to observe that many memory words are not shared – that is, they will only appear in one cache. If we keep track of which words are shared and which are not, we can reduce the amount of broadcasting necessary. There are two main reasons why more memory bandwidth is used: in an invalidation scheme, only the first change to a word requires an invalidation signal to be broadcast, whereas in a write broadcast scheme all changes must be signaled; and in an invalidation scheme only the first change to any word in a cache block must be signalled, whereas in a write broadcast scheme every word that is written must be signalled. On the other hand, in a write broadcast scheme we do not end up with a cache miss when trying to access a changed word, because the cached copy will have been updated to the correct value.

Another important aspect of coherence is serialisation of writes – that is, if two processors try to write `simultaneously', then (i) the writes happen sequentially (and it doesn't really matter who gets to write first – provided we have sensible arbitration); and (ii) all processors see the writes as occuring in the same order. That is, if processors A and B both write to X, with A writing first, then any other processors (C, D, E) all see the same thing. In practice, these issues are managed by a memory bus, which by its very nature ensures write serialisation, and also allows us to broadcast invalidation signals (we essentially just put the memory address to be invalidated on the bus). We can add an extra valid bit to cache tags to mark then invalid. Typically, we would use a write-back cache, because it has much lower memory bandwidth requirements. Each processor must keep track of which cache blocks are dirty – that is, that it has written to – again by adding a bit to the cache tag. If it sees a memory access for a word in a cache block it has marked as dirty, it intervenes and provides the (updated) value. There are numerous other issues to address when considering cache coherence (most especially the related problem of consistency – how to ensure that memory reads closely following memory writes that have yet to complete do not yield the wrong value). However, we will not address them here.

18.4  Current SIMD Processors: Multimedia Instruction Sets

Over the past decade, we have seen a substantial rise in the use of applications that can potentially benefit from vector processing – simultaneously applying the same operation(s) to multiple data items. These applications include: graphics (including the normal graphical UI in some cases); animation/games; video; and sound/speech processing. To speed up the processing of such data, microprocessors now generally contain specialised hardware – registers and instructions – to support such operations. The most well-known is Intel's MMX. However, there are others, including Motorola's AltiVec, and Intel's SSE.

18.4.1  MMX

MMX (Multi-Media Extention) first appeared in the Pentium II and provides a set of 64-bit registers which can be treated in a number of ways. At their simplest, they can simply contain 64-bit operands, to which a range of fairly conventional (with a couple of exceptions) arithmetic/logical operations can be applied. However, it is also possible to treat the contents of each register as a pair of 32-bit operands, which can be processed in parallel. Alternatively, you can treat them as four 16-bit operands; or eight 8-bit operands. For example, you may have sets of data representing pixel colour information, with 24-bits of data per pixel: eight bits each for red, green and blue. With MMX, you could load up the red data bytes (say) for eight pixels into one MMX register, and process them all simultaneously. Part of the beauty of these systems is that they are actually quite easy to implement. In the case of logical operations, there is no difference in the way you treat one 64-bit operand, eight 8-bit operands, or any other combination. In the case of arithmetic, comparison and shift operations, all that is necessary is a set of switches every eight bits to control the carry in/out from each byte to the next.

As well as the usual arithmetic/logical operations, MMX provides a few other operations. Some of these are obvious – like instructions to pack and unpack data to and from registers. However, it also includes saturating arithmetic. This is arithmetic in which any result that is too large to be represented is recorded as the maximum possible value: that is, there is no overflow or `wrapping around'. This is mainly because the applications that use MMX are often real-time, and it would not be helpful to introduce exception-handling delays. The fact that there might be an error is usually inconsequential. In fact, this is often the expected behaviour. For example, consider an image processing example, in which a particular pixel is rendered white (i.e. maximum red, green and blue values). There may be other pixels in the data which should strictly be rendered `even whiter' – however, the display system has already reached its contrast/brightness limit.

18.4.2  AltiVec

Motorola's AltiVec, which first appeared in the G4, is very similar in intent to MMX, with a similar set of instructions. However, it has 128-bit registers, potentially doubling the potential degree of parallelism. It also provides a larger set of instructions. For example, there are absolute difference (unsigned subtraction) and max/min operations, operations to shuffle bytes and half-words, as well as a combined multiply and add (useful in many signal processing operations).

18.4.3  SSE

In the Pentium III, Intel introduced SSE – Streaming SIMD Extensions – which added more functionality to MMX. SSE includes a further eight 128-bit registers, which can be treated as four, parallel, single-precision floating point operands, together with a range of new operations. Most of these are for the new quad floating point registers, but there are some new MMX operations as well. Some of the new SSE instructions allow the programmer to control the cacheability of data – that is, it is possible to mark data so it is not stored in the cache. You may wish to do this if you know that data will not subsequently be reused, so saving space in the cache.

18.4.4  The Problem…

One of the important lessons learned quite early in the development of vector processors was: it's often not the processing of data that takes the time, but getting it in the right form in the first place. Hennessy & Patterson have an example of this, but it is not explained at length, so we will look at it further. Suppose you have graphics data where every pixel is represented by three bytes, coding RGB (red, green, blue) data. Suppose you wish to convert this data to YUV (chromiance and luminosity – an alternative graphical data representation). The algorithm is easy:

Y = (9798*R + 19235*G + 3736*B) / 32768;
U = (-4784*R -- 9437*G + 4221*B) / 32768 + 128;
V = (20218*R -- 16941*G -- 3277*B) / 32768 + 128;

Each pixel is independent of all the others, so we can do eight pixels in parallel in MMX, and 16 in AltiVec. The problem is that we need all 8/16 pixels red data in one register; all the green data in another; and all the blue in another. However, if we simply naively load data into registers, this is not what we will get: see figure 18.3.


Figure 18.3: Mapping pixel data from memory to vector registers

This means that actual loading (and storing) process is quite complex, partially reducing the gains we make from processing data in parallel. Specialised vector processors recognize this and provide special addressing modes. Strided addressing loads/stores data items to/from registers with constant-sized gaps between the memory words. For example, in 18.3 we could use strided addressing with a gap (stride) of three to load the data we want. (Actually more likely four in practice.) generalisation of strided addressing is gather/scatter addressing, where the gaps between each data item are specified independently. Unfortunately, none of the current vector processing additions to existing microprocessors include these addressing modes.

18.5  MIMD Processors: Beowulf-Type Clusters

A currently-popular (and relatively cheap) way to build a MIMD parallel machine is a cluster – such clusters are commonly called Beowulfs, though there is no specific set of rules as to what exactly constitutes a Beowulf. Typically, they consist of a collection of standard workstations (usually PCs, quite likely rack-mountable server-style machines), running some operating system (usually Linux/Unix) perhaps with some modifications, connected together by standard networking hardware. Programs are generally at least partially, and usually completely, parallelized manually, and employ one of two sets of software designed for distributing parallel programs over sets of independent workstations – PVM (Parallel Virtual Machine) or the newer MPI (Message Passing Interface).

The `off-the-shelf' nature of the components is a big attraction in reducing cost. There has been much interest particularly in government/military organisations in COTS – commercial off-the-shelf – solutions to problems (instead of the `bespoke one-off-thousand-dollar-toilet-seat' approach traditionally taken) to reduce costs. In practice, although they are much cheaper than specialised supercomputers and parallel machines, you cannot simply take bottom-of-the-range PCs, a bunch of cheap 100baseT network interface cards and switches, and expect to make a competative cluster. You need to choose high-performance PCs – usually rack-mounted servers, with large L2 caches (read: more expensive) to reduce network data traffic, Gigabit Ethernet and high-performance SCSI disk systems. So not as cheap as you might at first think.

As you might expect, there are disadvantages to the approach. The first is interprocessor communication bandwith. Traditional parallel machines generally use the memory bus to transfer data, which is much faster than the networking hardware typically available. Early clusters (beginning of the 1990s) used the then-state-of-the-art 10M/bit Ethernet, which is quite slow. Techniques were developed to allow multiple, parallel ethernet connections – a process known as bonding – to speed up traffic. With bonding, data is striped, or interleaved, over multiple links – in much the same way that RAID and memory interleaving (see chapter 15) work. With faster Ethernet connections, the need for bonding has reduced – at the moment. It may come back – 100M/bit Ethernet is still not that fast; Gigabit is still expensive, and the time will come when even that is no longer regarded as fast.

A second `traditional' disadvantage that has reduced in seriousness is that in a traditional machine, the memory is shared, meaning that if one process needed most of it, there was no problem – provided the other processes could make do with what was left. In clusters, the memory is distributed, and only that memory available on a local node is available to an individual process. This problem has been reduced by the falling cost of memory – you just put lots more in.

Paradoxically, the cost of clusters has also been an issue – although they are relatively cheap to aquire, they are more expensive to run: instead of a single, specialised multi-processor machine, you have a whole load of independent machines, which you must maintain individually (more or less – there are tools to help with the software). This takes more time, and hence money. On the other hand, if something breaks, you stand a much better chance of getting a replacement quickly (it's all off-the-shelf), and parts are likely to be much cheaper.

As well as the (relatively) low aquisition costs, clusters gain by being reliable (one failed node usually does not stop them working, though a network switch might) and scalable (it is easy to expand clusters – though there are issues with consuming more network bandwidth with more nodes).

Typically, clusters run some variant of Unix or Linux, possibly with some modifications. For example, changes to allow bonding of multiple Ethernet interfaces. Also, modifications are commonly made to create a common process ID space over all machines: that is, each individual process will have a different process ID to any others running elsewhere on the network – this makes administration easier, for example. Beowulf clusters will generally have a master server, which is the prime repository for all data/program files. However, it would be massively inefficient not to put local copies of much of this on each node, so a remote distribution process is responsible for replicating files across all nodes (and managing changes, of course). In practice, a significant amount of the work in using a cluster goes into generating parallel code. This is typically managed by using special-purpose APIs (and possibly an associated IDE) to `parallelize' code – a (mainly) manual process. For many years, PVM was the most popular system, and it is still widely-used. However, implementations of the newer MPI are now the most widely used.

18.5.1  Aside: Who was Beowulf?

Beowulf was the hero of the early English (fictional) epic of the same name. In this, he is responsible for defeating the `monster' Grendel. Consequently, lots of large Beowulf clusters tend to pick names (and acronyms) from English/Nordic/Germanic mythology (Avalon, Loki, VALHAL(la)).


Previous Up Next