```
CHAPTER  6

PARALLEL  ALGORITHMS

6.0 P and NP for Parallel Computers
-----------------------------------

We saw in the last chapter that unbounded parallelism is not practical. On a
practical parallel computer, even a massively parallel one, we can only expect
to use a polynomial number of processors during a program which takes a
polynomial amount of time.  Thus we re-define P as follows.

The (new) class P is the class of problems which can be solved on a parallel
computer in polynomial time using a polynomial number of processors.

With this definition, it turns out that P is exactly the same as on sequential
computers.  Restricting to polynomial time and number of processors is roughly
equivalent to saying polynomial time and space. On a sequential computer,
polynomial time implies polynomial space, as we saw in the proof of Cook's
theorem.

To be sure that P is the same, we have to show that a parallel P problem can be
carried out in polynomial time on a sequential computer. Suppose that a problem
can be solved in a polynomial time p(n) using a polynomial number of processors
q(n). Then the parallel computer can be simulated on a sequential computer by
simulating one machine-code instruction of each parallel processor (taking time
O(q(n))) and repeating for p(n) steps, giving total time O(p(n)*q(n)) which is
a polynomial in n.

Given that P is the same, NP must obviously also be the same. The (new) class
NP is the class of problems which can be solved by a single exponential search
through some structures, each structure being tested in parallel polynomial
(new P) time. Since the new P is the same as the old, the new NP is the same as
the old.

The definitions of NP-Complete, NP-Hard and many other complexity classes also
turn out the same.

--------------------------------------------------------------
| The classes P, NP, NP-Complete and NP-Hard are the same on |
| parallel computers as on sequential ones.                  |
--------------------------------------------------------------

This means that technically, parallel computers do not make any difference to
the problems which are practical. Of course parallel computers do make a
difference. For example, massive parallelism should make it possible to build a
computer which "understands" human speech enough to convert it into text.
However, the point is that this is a practical problem NOW -- there is a
working sequential algorihtm which does the job extremely well. The trouble is
that it uses a huge semantic dictionary and takes about a day to analyse a
sentence. Thus parallel computers do change the problems which are
"commercially viable".

6.1 The Class NC
----------------

Not only is the set of practical problems the same, but also some problems are
naturally sequential, and (apparently) cannot be speeded up significantly on a
parallel computer -- the sequential algorithm is essentially the only one and
nothing is to be gained by using more than one processor.  Thus we would like
to investigate those problems which CAN be significantly speeded up. These will
be a sub-class of P.

It turns out that often a linear (i.e. O(n)) algorithm on a sequential
computer, if it can be speeded up at all, can be speeded up to an O(log(n)) or
O(log(n)^2) algorithm. Similarly O(n^2) algorithms can often be speeded up to
O(log(n)^2) or O(log(n)^3) etc. The number of processors needed is typically
O(n) or O(n^2) etc.  This suggests a tentative definition for fast parallel
problems.

A problem is in the class NC if can be solved on a parallel computer in a time
which is polynomial in log(n) and using a polynomial number of processors. NC
stands for "Nick's Class" after the inventor. This is not the only possible
definition of the class of problems having fast parallel algorithms, but it
seems to be one of the most popular and easy to deal with at the moment. One of
the advantages is that it is (reasonably) independent of the architecture of
the parallel computer.

There are two possible lines of attack open to us. One is to find fast parallel
algorithms showing that problems are in NC, and the other is to show that
problems, although polynomial, are not in NC so that we cannot expect
substantial speedup. We will ignore the second problem, which is tough, and
concentrate on finding fast parallel algorithms. Even so, experience with
parallel computers and algorithms is so limited that some of these algorithms
have never been used "for real".

6.2 A Model of Parallelism
--------------------------

The subject of designing and analysing parallel algorithms is in its infancy,
partly because designing parallel computers is also in its infancy. The main
problem is in how to cope with communication, synchronisation and contention.
These are problems well understood in operating systems, but only for loosely
coupled processors (i.e. ones running almost independent programs which
communicate only seldom). We are here talking about tightly coupled processors,
cooperating closely in running a single program.

There are many different models of parallel computation which give greater or
lesser weight to communication costs. We will use a model which ignores
communication costs altogether.

There are p processors numbered 1...p, each having its own local memory, and
there is a global memory shared by them all. The shared memory can be accessed
by any processor in O(1) time with no contention as if it were local.
Simultaneous reading and writing is allowed. If processors simultaneously
access the same global memory location, the effect is as if the operations were
carried out in some random order, but the time taken is O(1).

The processors all execute the same program, but each has a variable i in its
local memory which holds its sequence number (i=1 for processor 1, i=2 for
processor 2 etc.), so different processors can act differently by testing i.
There is a "synchronise" instruction which ensures that a processor waits
until all the others have caught up before continuing.

One advantage of this model is that we can use our experience and intuition of
sequential computers on it. Another is that it can simulate any fixed
connection pattern (topology) of processors. Another is that it is realistic
as far as testing for membership in NC goes.

It is (theoretically) possible to build a computer of the above form with p
processors and q memory cells using about O(p*q) "intelligent" switching units
connecting the processors to the memory.  The time taken to access a memory
location is O(log(p*q)) rather than O(1), but this does not change membership
in the class NC. Synchronisation can be done using (possibly) simultaneous
"increment" operations on a value which is initially zero, followed by
(possibly) simultaneous reads waiting until its value is p.

The reason this kind of architecture is not often proposed is that all the
processing power in the switching units would probably be better used in having
a lot more processors. However, the New York Ultracomputer and one or two
others have an architecture something like this.

6.3 Finding the Maximum is in NC
--------------------------------

In fact we will show that the maximum of an array a[1]...a[n] of numbers can be
found in O(1) time on our model computer (and thus O(log(n)) time on a more
realistic one) using O(n^2) processors. The algorithm is as follows:

Global Variables
----------------
n             -   size of vector a[] (with p = n^2)
a[1]...a[n]   -   the input values
m[1]...m[n]   -   m[i] indicates whether a[i] is maximum, initially TRUE
max           -   the result

Local Variables
---------------
i             -   the processor's sequence number
j, k          -   integers specifying a pair of values a[j], a[k]

To find the maximum of a[1]...a[n]   (program executed by all processors)
----------------------------------
j := i div n         { each processor chooses an a[j] and a[k], }
k := i mod n         { using integer division and remainder }
if a[j] < a[k] then m[j] := FALSE
synchronise          { make sure all processors have finished }
if i <= n then       { examine the results in m[i] }
if m[i] = TRUE then
max := a[i]

Check that you understand how this works. Note that although
simultaneous writing occurs, the values written simultaneously to a
location are all identical.

It has been shown that the ability to write simultaneously is essential -
without it O(1) time is impossible.  Also, with only n processors rather than
n^2, it turns out that the problem requires O(log(log(n))) time.

6.4 Addition of large numbers is in NC
--------------------------------------

Suppose we have two numbers, each n bits long, to be added. The bits are stored
in a[n]...a[1] and b[n]...b[1] (so that a[1], b[1] are the least signicant
bits). The usual bit-by-bit addition algorithm requires O(n) time because you
have to allow the carry to propagate leftwards from position 1 to position n.
We will show that it can be done in O(log(n)) time with O(n) processors using
the "carry-lookahead" algorithm, so that the problem is in NC.

Assume for simplicity that n is a power of two. We need to simulate a binary
tree of processors, with one leaf processor for each bit, so we number the
processors needed as in the "heap" array implementation of a full binary tree
used for priority queues. The processors can work out where they are in the
tree structure using their processor numbers (the parent of p[i] is p[i div 2]
and the children are p[2*i] and p[2*i+1]).

p1
/    \
p2        p3
/  \      /  \
p4    p5  p6    p7

.....................

/      /      /       /             \
p[n]   p[n+1]  p[n+2]  p[n+3]  .....  p[2*n-1]

a[n]   a[n-1]  a[n-2]  a[n-3]  .....    a[1]
b[n]   b[n-1]  b[n-2]  b[n-3]  .....    b[1]

The idea behind the algorithm is sort out the carries BEFORE doing the
addition. Each processor p[i] has global variables c0[i], c1[i] and c[i]
associated with it for communicating with its neighbours. The variables are

Global Variables
----------------
n                  -   size of numbers (n a power of 2, p = 2*n-1)
a[n]...a[1]        -   the bits of the first number
b[n]...b[1]        -   the bits of the second number
c0[1]...c0[2*n-1]  -   carry generated (assuming no carry passed in)
c1[1]...c1[2*n-1]  -   carry generated (assuming a carry passed in)
c[1] ...c[2*n-1]   -   carry received from the right
s[1]...s[n]        -   the result - the sum of a[i] and b[i]

Local Variables
---------------
i                  -   the processor's sequence number

The algorithm works in stages. In the first stage, the values of c0[i] and
c1[i] pass up the tree from the leaves. Their initial value is taken to be
"UNKNOWN", and they are set to "TRUE" or "FALSE". The value of c0[i] for a
particular processor is the value of the carry generated by the group of bits
in the processor's subtree, assuming that no carry is passed into that group of
bits. For example, c0[2] is the carry generated in calculating the bits
s[n]...s[n/2+1] assuming that no carry is generated in calculating
s[n/2]...s[1]. Similarly, c1[i] is the carry generated by a group of bits,
assuming that a carry is passed in.

To find the values of c0[i], c1[i]   (program executed by all processors)
----------------------------------
if i >= n then
bit := 2*n-i                          {leaf -- look at bits from a,b}
c0[i] := a[bit] and b[bit]            {1+1 generates a carry}
c1[i] := a[bit] or b[bit]             {0+1+C, 1+0+C, 1+1+C make a carry}
else
l := 2*i; r := 2*i+1                  {non-leaf -- look at children}
wait for c0[l], c0[r], c1[l], c1[r]   {read them repeatedly}
c0[i] := c0[l] or (c1[l] and c0[r])   {l gives carry (with r's help?)}
c1[i] := c0[l] or (c1[l] and c1[r])
synchronise

After this stage, which takes O(log(n)) steps for the information to pass up
the tree, all the values of c0[i] and c1[i] are known. Next we find the values
of c[i], the actual carries received from the right by the same groups of bits,
by passing information down the tree. Again, c[i] is initially "UNKNOWN", then
set to "TRUE or "FALSE".

To find the values of c[i]           (continuation of program)
--------------------------
if i = 1 then
c[1] := FALSE                         {root -- no carry into whole sum}
else
u := i div 2; r := i+1                {look at parent and sibling}
if i odd then c[i] := c[u]            {right half - same carry as parent}
else c[i]:=c0[r] or (c1[r] and c[u])  {left half - carry from sibling}
synchronise

Again, the time taken for the information to filter down the tree is O(log(n)).
Finally, the addition itself can be carried out, with all the information about

To find the values of s[i]           (continuation of program)
--------------------------
if i >= n then
bit := 2*n-i
s[bit] := (a[bit] + b[bit] + c[i]) mod 2   {treat c[i] as a number now}

This takes O(1) time, so the whole algorithm takes O(log(n)) time with O(n)
processors. This algorithm is better suited to a direct hardware implementation
inside an arithmetic processing unit, rather than on a parallel processor, but
as arithmetic is done on larger and larger words, algorithms like these become
more important.

6.5 Sorting is in NC
--------------------

The mergesort and quicksort algorithms (in fact any divide-and-conquer method)
is naturally suitable for use on a parallel computer because the subproblems
can be carried out simultaneously. However, these algorithms, which are
O(n*log(n)) on a sequential computer, only come out as O(n) on a parallel
computer. We have to do better to show that sorting is in NC. We will show that
sorting can be done in O(log(n)^2) time using a network of O(n*log(n)^2)
processors. This network can be simulated by O(n) processors. In fact there is
an optimum algorithm taking O(log(n)), but it is more complicated.

The method is called the odd-even sorting method. It uses a recursively defined
network of simple 2-in 2-out comparator/swapper components. Our processors can
simulate the components. We will assume that n is a power of 2 for simplicity.

The n numbers are passed into the network using the n inputs on the left. They
then pass through the network from left to right, and emerge sorted on the n
outputs on the right. The sorter for 8 elements is illustrated:

-----  odd-even  ------------  odd-even  ----------------------
-----   SORTER   ---.    .---   MERGER   ---------  COMP  -----
-----  for n/2   ----\--/----  for n/2   ---.  .--  SWAP  -----
-----   items    ---. \/ .---   items    --. \/
\/\/                   \/`---  COMP  -----
/\/\                   /\.---  SWAP  -----
-----  odd-even  ---' /\ `---  odd-even  --' /\
-----   SORTER   ----/--\----   MERGER   ---'  `--  COMP  -----
-----  for n/2   ---'    `---  for n/2   ---------  SWAP  -----
-----   items    ------------   items    ----------------------

<--- odd-even MERGER for n items --->

<--------------- odd-even SORTER for n items ----------------->

The network sorts the first n/2 elements, and sorts the second n/2 elements to
form two sorted lists. Then it merges the two sorted lists. The odd numbered
elements from both lists are passed to the upper merger, the even numbered
elements to the lower merger. After that, the elements pass through n/2-1
comparator/swappers to complete the sort. It is not easy to see that the merger
works -- see Sedgewick's book for a partial explanation. The merger is also
useful for the Fast Fourier Transform, polynomial evaluation etc.

The network is obviously O(n) units wide, and the length of the sorter and
merger, say s(n) and m(n), are given by recursive equations:

m(n) = m(n/2) + 1
s(n) = s(n/2) + m(n)

m(n) = O(log(n))
s(n) = O(log(n) + log(n/2) + log(n/4) + ...)
= O(log(n) + (log(n)-1) + (log(n)-2) + ...)
= O(1+2+3+...+k where k=log(n))
= O(log(n)^2)

The network can be simulated by O(n) processors which perform the operations
from left to right across the network.

6.6 Communication Costs
-----------------------

As an example of the difficulties caused by fixed topologies where
communication costs have to be taken into account, consider solving the maximum
problem on such a computer. Assume that processors only have local memory, and
that they can only communicate by sending messages to neighbouring processors.
Note that such parallel processors are easily built, but this may well not be
the best way to minimise communication costs.

To simplify things, we will suppose that the processors are connected in a
circle, and that each processor can only send messages to its "right-hand"
neighbour. Each has a message buffer so that messages are never lost.  For the
maximum problem, we assume that there are n processors with processor i having
a[i] in its local memory. Assuming that the cost of sending messages is high
compared to processing costs, we would arrange for the processors to be doing
other work while solving this problem, so we take the cost of solving the
problem to be the (average) number of messages sent by each processor.

The obvious algorithm for this problem takes O(n) time. In the first time step,
each processor sends its own value a[i] to its neighbour. In the remaining
steps, processors pass on received values until all processors know all the
values a[1]...a[n], and thus can work out the maximum. This takes O(n) steps,
and at each step each processor sends a message. Thus the cost is O(n) messages
per processor. It has recently been discovered that only an average of O(log n)
messages per processor is needed as follows. We can assume that the a[i] are
all distinct. The algorithm executed by all processors is:

Local Variables
---------------
rightval
val     -   initially a[i]
leftval

To find maximum of a[1]...a[n]
------------------------------
1 send message containing val
2 wait for message and put incoming value in leftval
3 if leftval = val then stop          (val is the maximum value)
4 send message containing leftval     (relay the incoming value)
5 rightval := val;   val := leftval   (move along one)
6 wait for message and put the value in leftval
7 if (val > leftval) and (val > rightval) then goto step 1
8 repeatedly wait for a message and relay it

At the first stage, each processor i sends a[i] to its neighbour which passes it
on to its neighbour. Now each processor knows its own value and the two values
to the left - three consecutive values. If the middle value is larger than
the other two (a local maximum), then the processor retains that value.
Otherwise, it becomes "passive" and takes no further part, except that it relays
all messages it receives. Now there at most n/2 active values. The process is
repeated to give at most n/4 active values. There are O(log n) stages, after
which the there is one remaining active value. The active processor sends this
to itself (relayed all round the circle) and recognises it at step 3 and stops.
This maximum value could then be relayed to all other processors if desired.
Note that there is no proper synchronisation - some processors may be on one
stage while others are on the next.

However, the message buffers and the act of waiting for a message ensure
sufficient "local" synchronisation for the algorithm to work. Note also that
the stages may take different amounts of time - the first is likely to take
O(1) time, the last O(n). However, at each stage, each processor (including the
passive ones) sends exactly 2 messages, for a total of O(log n) each.
```