CHAPTER 6 PARALLEL ALGORITHMS 6.0 P and NP for Parallel Computers ----------------------------------- We saw in the last chapter that unbounded parallelism is not practical. On a practical parallel computer, even a massively parallel one, we can only expect to use a polynomial number of processors during a program which takes a polynomial amount of time. Thus we re-define P as follows. The (new) class P is the class of problems which can be solved on a parallel computer in polynomial time using a polynomial number of processors. With this definition, it turns out that P is exactly the same as on sequential computers. Restricting to polynomial time and number of processors is roughly equivalent to saying polynomial time and space. On a sequential computer, polynomial time implies polynomial space, as we saw in the proof of Cook's theorem. To be sure that P is the same, we have to show that a parallel P problem can be carried out in polynomial time on a sequential computer. Suppose that a problem can be solved in a polynomial time p(n) using a polynomial number of processors q(n). Then the parallel computer can be simulated on a sequential computer by simulating one machine-code instruction of each parallel processor (taking time O(q(n))) and repeating for p(n) steps, giving total time O(p(n)*q(n)) which is a polynomial in n. Given that P is the same, NP must obviously also be the same. The (new) class NP is the class of problems which can be solved by a single exponential search through some structures, each structure being tested in parallel polynomial (new P) time. Since the new P is the same as the old, the new NP is the same as the old. The definitions of NP-Complete, NP-Hard and many other complexity classes also turn out the same. -------------------------------------------------------------- | The classes P, NP, NP-Complete and NP-Hard are the same on | | parallel computers as on sequential ones. | -------------------------------------------------------------- This means that technically, parallel computers do not make any difference to the problems which are practical. Of course parallel computers do make a difference. For example, massive parallelism should make it possible to build a computer which "understands" human speech enough to convert it into text. However, the point is that this is a practical problem NOW -- there is a working sequential algorihtm which does the job extremely well. The trouble is that it uses a huge semantic dictionary and takes about a day to analyse a sentence. Thus parallel computers do change the problems which are "commercially viable".

6.1 The Class NC ---------------- Not only is the set of practical problems the same, but also some problems are naturally sequential, and (apparently) cannot be speeded up significantly on a parallel computer -- the sequential algorithm is essentially the only one and nothing is to be gained by using more than one processor. Thus we would like to investigate those problems which CAN be significantly speeded up. These will be a sub-class of P. It turns out that often a linear (i.e. O(n)) algorithm on a sequential computer, if it can be speeded up at all, can be speeded up to an O(log(n)) or O(log(n)^2) algorithm. Similarly O(n^2) algorithms can often be speeded up to O(log(n)^2) or O(log(n)^3) etc. The number of processors needed is typically O(n) or O(n^2) etc. This suggests a tentative definition for fast parallel problems. A problem is in the class NC if can be solved on a parallel computer in a time which is polynomial in log(n) and using a polynomial number of processors. NC stands for "Nick's Class" after the inventor. This is not the only possible definition of the class of problems having fast parallel algorithms, but it seems to be one of the most popular and easy to deal with at the moment. One of the advantages is that it is (reasonably) independent of the architecture of the parallel computer. There are two possible lines of attack open to us. One is to find fast parallel algorithms showing that problems are in NC, and the other is to show that problems, although polynomial, are not in NC so that we cannot expect substantial speedup. We will ignore the second problem, which is tough, and concentrate on finding fast parallel algorithms. Even so, experience with parallel computers and algorithms is so limited that some of these algorithms have never been used "for real". 6.2 A Model of Parallelism -------------------------- The subject of designing and analysing parallel algorithms is in its infancy, partly because designing parallel computers is also in its infancy. The main problem is in how to cope with communication, synchronisation and contention. These are problems well understood in operating systems, but only for loosely coupled processors (i.e. ones running almost independent programs which communicate only seldom). We are here talking about tightly coupled processors, cooperating closely in running a single program. There are many different models of parallel computation which give greater or lesser weight to communication costs. We will use a model which ignores communication costs altogether. There are p processors numbered 1...p, each having its own local memory, and there is a global memory shared by them all. The shared memory can be accessed by any processor in O(1) time with no contention as if it were local. Simultaneous reading and writing is allowed. If processors simultaneously access the same global memory location, the effect is as if the operations were carried out in some random order, but the time taken is O(1). The processors all execute the same program, but each has a variable i in its local memory which holds its sequence number (i=1 for processor 1, i=2 for processor 2 etc.), so different processors can act differently by testing i. There is a "synchronise" instruction which ensures that a processor waits until all the others have caught up before continuing.

One advantage of this model is that we can use our experience and intuition of sequential computers on it. Another is that it can simulate any fixed connection pattern (topology) of processors. Another is that it is realistic as far as testing for membership in NC goes. It is (theoretically) possible to build a computer of the above form with p processors and q memory cells using about O(p*q) "intelligent" switching units connecting the processors to the memory. The time taken to access a memory location is O(log(p*q)) rather than O(1), but this does not change membership in the class NC. Synchronisation can be done using (possibly) simultaneous "increment" operations on a value which is initially zero, followed by (possibly) simultaneous reads waiting until its value is p. The reason this kind of architecture is not often proposed is that all the processing power in the switching units would probably be better used in having a lot more processors. However, the New York Ultracomputer and one or two others have an architecture something like this. 6.3 Finding the Maximum is in NC -------------------------------- In fact we will show that the maximum of an array a[1]...a[n] of numbers can be found in O(1) time on our model computer (and thus O(log(n)) time on a more realistic one) using O(n^2) processors. The algorithm is as follows: Global Variables ---------------- n - size of vector a[] (with p = n^2) a[1]...a[n] - the input values m[1]...m[n] - m[i] indicates whether a[i] is maximum, initially TRUE max - the result Local Variables --------------- i - the processor's sequence number j, k - integers specifying a pair of values a[j], a[k] To find the maximum of a[1]...a[n] (program executed by all processors) ---------------------------------- j := i div n { each processor chooses an a[j] and a[k], } k := i mod n { using integer division and remainder } if a[j] < a[k] then m[j] := FALSE synchronise { make sure all processors have finished } if i <= n then { examine the results in m[i] } if m[i] = TRUE then max := a[i] Check that you understand how this works. Note that although simultaneous writing occurs, the values written simultaneously to a location are all identical. It has been shown that the ability to write simultaneously is essential - without it O(1) time is impossible. Also, with only n processors rather than n^2, it turns out that the problem requires O(log(log(n))) time.

6.4 Addition of large numbers is in NC -------------------------------------- Suppose we have two numbers, each n bits long, to be added. The bits are stored in a[n]...a[1] and b[n]...b[1] (so that a[1], b[1] are the least signicant bits). The usual bit-by-bit addition algorithm requires O(n) time because you have to allow the carry to propagate leftwards from position 1 to position n. We will show that it can be done in O(log(n)) time with O(n) processors using the "carry-lookahead" algorithm, so that the problem is in NC. Assume for simplicity that n is a power of two. We need to simulate a binary tree of processors, with one leaf processor for each bit, so we number the processors needed as in the "heap" array implementation of a full binary tree used for priority queues. The processors can work out where they are in the tree structure using their processor numbers (the parent of p[i] is p[i div 2] and the children are p[2*i] and p[2*i+1]). p1 / \ p2 p3 / \ / \ p4 p5 p6 p7 ..................... / / / / \ p[n] p[n+1] p[n+2] p[n+3] ..... p[2*n-1] a[n] a[n-1] a[n-2] a[n-3] ..... a[1] b[n] b[n-1] b[n-2] b[n-3] ..... b[1] The idea behind the algorithm is sort out the carries BEFORE doing the addition. Each processor p[i] has global variables c0[i], c1[i] and c[i] associated with it for communicating with its neighbours. The variables are Global Variables ---------------- n - size of numbers (n a power of 2, p = 2*n-1) a[n]...a[1] - the bits of the first number b[n]...b[1] - the bits of the second number c0[1]...c0[2*n-1] - carry generated (assuming no carry passed in) c1[1]...c1[2*n-1] - carry generated (assuming a carry passed in) c[1] ...c[2*n-1] - carry received from the right s[1]...s[n] - the result - the sum of a[i] and b[i] Local Variables --------------- i - the processor's sequence number The algorithm works in stages. In the first stage, the values of c0[i] and c1[i] pass up the tree from the leaves. Their initial value is taken to be "UNKNOWN", and they are set to "TRUE" or "FALSE". The value of c0[i] for a particular processor is the value of the carry generated by the group of bits in the processor's subtree, assuming that no carry is passed into that group of bits. For example, c0[2] is the carry generated in calculating the bits s[n]...s[n/2+1] assuming that no carry is generated in calculating s[n/2]...s[1]. Similarly, c1[i] is the carry generated by a group of bits, assuming that a carry is passed in.

To find the values of c0[i], c1[i] (program executed by all processors) ---------------------------------- if i >= n then bit := 2*n-i {leaf -- look at bits from a,b} c0[i] := a[bit] and b[bit] {1+1 generates a carry} c1[i] := a[bit] or b[bit] {0+1+C, 1+0+C, 1+1+C make a carry} else l := 2*i; r := 2*i+1 {non-leaf -- look at children} wait for c0[l], c0[r], c1[l], c1[r] {read them repeatedly} c0[i] := c0[l] or (c1[l] and c0[r]) {l gives carry (with r's help?)} c1[i] := c0[l] or (c1[l] and c1[r]) synchronise After this stage, which takes O(log(n)) steps for the information to pass up the tree, all the values of c0[i] and c1[i] are known. Next we find the values of c[i], the actual carries received from the right by the same groups of bits, by passing information down the tree. Again, c[i] is initially "UNKNOWN", then set to "TRUE or "FALSE". To find the values of c[i] (continuation of program) -------------------------- if i = 1 then c[1] := FALSE {root -- no carry into whole sum} else u := i div 2; r := i+1 {look at parent and sibling} if i odd then c[i] := c[u] {right half - same carry as parent} else c[i]:=c0[r] or (c1[r] and c[u]) {left half - carry from sibling} synchronise Again, the time taken for the information to filter down the tree is O(log(n)). Finally, the addition itself can be carried out, with all the information about carries already known. To find the values of s[i] (continuation of program) -------------------------- if i >= n then bit := 2*n-i s[bit] := (a[bit] + b[bit] + c[i]) mod 2 {treat c[i] as a number now} This takes O(1) time, so the whole algorithm takes O(log(n)) time with O(n) processors. This algorithm is better suited to a direct hardware implementation inside an arithmetic processing unit, rather than on a parallel processor, but as arithmetic is done on larger and larger words, algorithms like these become more important. 6.5 Sorting is in NC -------------------- The mergesort and quicksort algorithms (in fact any divide-and-conquer method) is naturally suitable for use on a parallel computer because the subproblems can be carried out simultaneously. However, these algorithms, which are O(n*log(n)) on a sequential computer, only come out as O(n) on a parallel computer. We have to do better to show that sorting is in NC. We will show that sorting can be done in O(log(n)^2) time using a network of O(n*log(n)^2) processors. This network can be simulated by O(n) processors. In fact there is an optimum algorithm taking O(log(n)), but it is more complicated. The method is called the odd-even sorting method. It uses a recursively defined network of simple 2-in 2-out comparator/swapper components. Our processors can simulate the components. We will assume that n is a power of 2 for simplicity. The n numbers are passed into the network using the n inputs on the left. They then pass through the network from left to right, and emerge sorted on the n outputs on the right. The sorter for 8 elements is illustrated: ----- odd-even ------------ odd-even ---------------------- ----- SORTER ---. .--- MERGER --------- COMP ----- ----- for n/2 ----\--/---- for n/2 ---. .-- SWAP ----- ----- items ---. \/ .--- items --. \/ \/\/ \/`--- COMP ----- /\/\ /\.--- SWAP ----- ----- odd-even ---' /\ `--- odd-even --' /\ ----- SORTER ----/--\---- MERGER ---' `-- COMP ----- ----- for n/2 ---' `--- for n/2 --------- SWAP ----- ----- items ------------ items ---------------------- <--- odd-even MERGER for n items ---> <--------------- odd-even SORTER for n items -----------------> The network sorts the first n/2 elements, and sorts the second n/2 elements to form two sorted lists. Then it merges the two sorted lists. The odd numbered elements from both lists are passed to the upper merger, the even numbered elements to the lower merger. After that, the elements pass through n/2-1 comparator/swappers to complete the sort. It is not easy to see that the merger works -- see Sedgewick's book for a partial explanation. The merger is also useful for the Fast Fourier Transform, polynomial evaluation etc. The network is obviously O(n) units wide, and the length of the sorter and merger, say s(n) and m(n), are given by recursive equations: m(n) = m(n/2) + 1 s(n) = s(n/2) + m(n) m(n) = O(log(n)) s(n) = O(log(n) + log(n/2) + log(n/4) + ...) = O(log(n) + (log(n)-1) + (log(n)-2) + ...) = O(1+2+3+...+k where k=log(n)) = O(log(n)^2) The network can be simulated by O(n) processors which perform the operations from left to right across the network.

6.6 Communication Costs ----------------------- As an example of the difficulties caused by fixed topologies where communication costs have to be taken into account, consider solving the maximum problem on such a computer. Assume that processors only have local memory, and that they can only communicate by sending messages to neighbouring processors. Note that such parallel processors are easily built, but this may well not be the best way to minimise communication costs. To simplify things, we will suppose that the processors are connected in a circle, and that each processor can only send messages to its "right-hand" neighbour. Each has a message buffer so that messages are never lost. For the maximum problem, we assume that there are n processors with processor i having a[i] in its local memory. Assuming that the cost of sending messages is high compared to processing costs, we would arrange for the processors to be doing other work while solving this problem, so we take the cost of solving the problem to be the (average) number of messages sent by each processor. The obvious algorithm for this problem takes O(n) time. In the first time step, each processor sends its own value a[i] to its neighbour. In the remaining steps, processors pass on received values until all processors know all the values a[1]...a[n], and thus can work out the maximum. This takes O(n) steps, and at each step each processor sends a message. Thus the cost is O(n) messages per processor. It has recently been discovered that only an average of O(log n) messages per processor is needed as follows. We can assume that the a[i] are all distinct. The algorithm executed by all processors is: Local Variables --------------- rightval val - initially a[i] leftval To find maximum of a[1]...a[n] ------------------------------ 1 send message containing val 2 wait for message and put incoming value in leftval 3 if leftval = val then stop (val is the maximum value) 4 send message containing leftval (relay the incoming value) 5 rightval := val; val := leftval (move along one) 6 wait for message and put the value in leftval 7 if (val > leftval) and (val > rightval) then goto step 1 8 repeatedly wait for a message and relay it At the first stage, each processor i sends a[i] to its neighbour which passes it on to its neighbour. Now each processor knows its own value and the two values to the left - three consecutive values. If the middle value is larger than the other two (a local maximum), then the processor retains that value. Otherwise, it becomes "passive" and takes no further part, except that it relays all messages it receives. Now there at most n/2 active values. The process is repeated to give at most n/4 active values. There are O(log n) stages, after which the there is one remaining active value. The active processor sends this to itself (relayed all round the circle) and recognises it at step 3 and stops. This maximum value could then be relayed to all other processors if desired. Note that there is no proper synchronisation - some processors may be on one stage while others are on the next. However, the message buffers and the act of waiting for a message ensure sufficient "local" synchronisation for the algorithm to work. Note also that the stages may take different amounts of time - the first is likely to take O(1) time, the last O(n). However, at each stage, each processor (including the passive ones) sends exactly 2 messages, for a total of O(log n) each.