CHAPTER 4 NETWORK ALGORITHMS 4.0 Definitions --------------- A network is a set of NODES (or vertices) joined by EDGES (or directed edges or arcs or links). Each edge goes from one node to another, or possibly from a node to itself. Each edge may have a LABEL. (For mathematicians, this is a labelled directed graph with loops). ____ / | 4 2 / 5| A -----> C <----- E <----- | ^ | | | | 5| 6| 3| | | | v 3 | 2 v B -----> D <----- F A network may be used to represent airline routes between towns, electronic circuit layouts, jobs in a manufacturing process where some jobs must be finished before others can be started, flow through a network of pipes, etc. The labels on the edges might represent distances, times, capacities etc. If edges don't have a direction, e.g. if they represent roads, each road can be represented as two edges, one in each direction. A network can be represented as a data structure in two main ways, depending on whether it is dense or sparse. If it is dense, (most possible edges present), the LABEL MATRIX (adjacency matrix) is suitable. If the network has N nodes, this is an N by N square matrix of labels. The i,j entry "label[i,j]" represents the label on the edge from node i to node j. Some special value must be chosen to represent edges which do not exist - e.g. a large (infinite) number if the labels are distances, or zero if the labels are capacities. If the above network represents capacities, then it has label matrix: A B C D E F A 0 5 4 0 0 0 B 0 0 0 3 0 0 C 0 0 0 0 0 0 D 0 0 6 0 0 0 E 0 0 2 0 5 3 F 0 0 0 2 0 0 If the network is sparse, then the matrix is sparse, and we can do better by having a list of the edges out of each node. The list could be an array or linked list -- we will assume a linked list representation. For each edge in a list, we must record its label and its target node. For example, the above network is represented by the following linked lists.

A -> 5B -> 4C B -> 3D C . D -> 6C E -> 2C -> 5E -> 2F F -> 2D This is just a compact representation of a sparse matrix, and in fact network theory techniques can be used in the design of sparse matrix algorithms generally. What is the size of a network problem for a network with N nodes and E edges? For a dense network, using the matrix representation, the problem size is the number of elements in the matrix. For a sparse network, using the linked list representation, it is the total length of the lists, which is proportional to the number of edges. DENSE SPARSE n = N^2 n = E One important special kind of network is the PLANAR NETWORK. This is a network which can be drawn on a flat piece of paper with no crossings, and is particularly important for laying out electronic circuits on single-layer chips. Obviously, the labels and directions of the edges make no difference in trying to determine whether a network is planar (i.e. can be drawn flat) or not, and loops make no difference either, so we can remove all these things and leave ourselves with an ordinary mathematical GRAPH. The two smallest graphs which are not planar are called "K5" and "K3,3" by mathematicians. The graph "K5" consists of 5 nodes A,B,C,D,E which are kompletely konnected -- i.e. every node is connected to every other. The graph "K3,3" consists of six nodes A,B,C,D,E,F in which the first three A,B,C are all connected to each of the second three D,E,F (but there are no connections between A,B,C, or between D,E,F). The graph "K3,3" is sometimes called the "water, gas, electricity" graph, because the nodes A,B,C can represent the three mains services, and the nodes D,E,F can represent three houses in a street to which the services are to be connected. The services cannot be connected without one service conduit crossing another somewhere. You should try drawing these two graphs on paper without crossings, and convince yourself that it can't be done. It turns out ("Kuratowski's theorem") that any larger graph which can't be drawn flat must "contain" one of these two simple examples in some way. In a drawing of a planar graph on a piece of paper, the paper is divided into a number F of FACES, or regions, including the "outside" face which extends to the edges of the paper (imagine cutting the paper along all the edges and counting the pieces). The theory of planar graphs is heavily based on EULER'S FORMULA which holds for all connected planar graphs (i.e. planar graphs which are in one piece). Euler's formula is N - E + F = 2 which you can remember by noting that the items on the left occur in order of increasing dimension (numbers of points, lines and surfaces) and that the signs alternate (and this hints at the n-dimensional generalisation).

One way to see that the formula is true is to imagine measuring the Euler number (N-E+F) of a graph by simplifying it in easy stages. Imagine removing an edge between two faces. The two faces become one, and so both E and F drop by one, leaving the Euler number the same. Repeat this as often as you can. You are left with edges which have the same face on each side, so that if you removed one, the graph would fall into two pieces. In fact the graph is now a TREE. Now you look for a LEAF -- i.e. a node connected by only one edge to only one other node. Removing the leaf node and its edge doesn't change the Euler number, so you repeat this as often as possible. You are now left with a single node, a single face and no edges, so the Euler number is 2. The most important property of planar graphs (or networks) for us is that they are sparse. In fact E is O(N) rather than O(N^2). The maximum number of edges in a planar graph is 3*N-6 (apart from a couple of trivial exceptions - can you find them?). A planar network can have two directed edges between each two nodes, and a loop at every node, giving a maximum of 7*N-12 edges. PLANAR GRAPH PLANAR NETWORK E <= 3*N - 6 E <= 7*N - 12 To prove the planar graph result, first make sure that your planar graph has as many edges as possible by adding diagonals to any face with more than three sides, until all the faces have three sides (including the "outside" one). If we count the total number of edges by multiplying the number of faces by three, we have actually counted each edge twice (once for the face on either side), so the total number of edges is E = 3*F/2. Plugging this into Euler's formula gives E = 3*N-6, so our original graph had E <= 3*N-6. So we always use the sparse linked-list representation for planar graphs or networks, with problem size n = N. For example, for a circuit with 1000 nodes, the matrix would need 4 M-bytes, whereas the linked list requires 150 K-bytes, assuming 4 bytes per pointer or integer. Times taken to process planar networks are similarly shorter, simply because it takes a lot longer to scan 4 M-bytes of data than to scan 150 K-bytes of data. 4.1 Scanning Networks --------------------- The simplest way of scanning a network uses a recursive design. Suppose we want to find (and mark) all the nodes we can reach from a particular node by following edges. The following design will do: To scan from node x ------------------- mark x for each edge out of x, to a node y, do if y is unmarked then scan from y ____ / | 4 2 / 5| A -----> C <----- E <----- | ^ | | | | 5| 6| 3| | | | v 3 | 2 v B -----> D <----- F

To find all nodes reachable from E, we call scan(E). This calls scan(C) which returns immediately, then calls scan(F). Then scan(F) calls scan(D). As there are no edges out of D to unmarked nodes (C will have been marked), scan(D) returns immediately. Then scan(F) returns, and scan(E) returns. The result is that E,C,F and D have been marked. We can estimate the time taken with a label matrix representing the network as follows. Only unmarked nodes are scanned, and then they are immediately marked. Thus no node is scanned more than once, so O(N) nodes are scanned. In the case of a label matrix, scanning the edges out of a node involves looking at all N elements of the corresponding row of the matrix. Thus the algorithm takes at most O(N^2) steps, i.e. O(n) time -- linear time. We can estimate the time for a linked-list representation as follows. Again, no node is scanned twice. To scan the edges out of a node, we just run through the list of edges out of it. The total time is at most the total length of all the lists, so the time taken is O(E) which is O(n) -- linear time again. Using a stack gives a DEPTH FIRST scan in which we follow a path as far into the network as possible before backtracking. If we had used a queue (so we scan the oldest node first), we would get a BREADTH FIRST scan in which all the neighbours of the initial node are scanned, then all their neighbours etc. 4.2 Shortest Path Problems -------------------------- Suppose the labels in a network represent distances, and the object is to find the shortest distance from one node to another. There seems to be no better way of finding the shortest distance from A to B other than finding all the shortest distances from A to other nodes. The overall strategy for doing this is as follows. For each node we keep track of the shortest known distance to it so far. Then we scan the edges and update the shortest known distances as follows. Suppose we examine an edge from node X to node Y, and suppose the distances are 5 & 8, and the edge length 2. Then we update the distance to Y to be 7, since we have found a shorter route. 5 2 8 5 2 7 X ------> Y ===> X ------> Y There is a problem. Consider the following network: 0 7 00 2 00 A --------> B --------> D | ^ | | 1| 4| | | v / C -------- 00 We start with distances from A to A,B,C,D of 0,00,00,00 (00 means infinite). After scanning the edges from A, we get distances 0,7,1,00. After scanning the edges from B, we get 0,7,1,9. Then we scan the edges from C getting 0,5,1,9. Since the distance to D is actually 7, something has gone wrong. If we had scanned the edges from C before the edges from B, we would have succeeded.

To fix this, we keep a list of nodes to be scanned, but this time at each stage we take from the list the node with smallest (current) distance. The most suitable data structure for the list is a priority queue, nodes with the shortest distance having the highest priority. Initially, the priority queue contains all the nodes, the starting node having distance 0, all others having infinite distance (maxint, say). The algorithm is: To find shortest distances -------------------------- initialise distances and priority queue while priority queue not empty do remove node of shortest (current) distance from priority queue, say x mark it for each edge out of x to an unmarked node y if (distance to x) + (length of edge xy) < distance to y update distance to y shuffle y up to a new position in the priority queue We can prove that the algorithm works as follows. At each stage, the marked nodes have their final correct distances recorded. The distances of the rest are the shortest distances via marked nodes only. When we mark a new node, we know its shortest distance via previously marked nodes. There can be no shorter route via unmarked nodes, as we have chosen it to be the closest one. Thus the node we are marking has its final correct distance recorded. We can estimate the time taken as follows. For a sparse representation, we can use the usual "heap" structure in an array to represent the priority queue. The priority queue has length O(N), so removing or re-positioning an item takes at most O(log(N)) time. There is at most one removal per node, and one re-positioning per edge, giving O(N*log(N) + E*log(N)) = O(E*log(N)) time. Since we can take N to be O(E), this gives us O(E*log(E)) = O(n*log(n)) time. For a matrix representation Length[1..N,1..N], the above algorithm would repeat the inner loop N^2 times, giving an O(N^2*log(N)) algorithm, but we can do better as follows. We use an array Distance[1..N] which we use as priorities, a boolean array Marked[1..N] and variables QSize and FrontOfQ. The item at the front of the queue can be removed from the queue by setting Marked[FrontOfQ] := TRUE, but then FrontOfQ needs updating by scanning the queue entries to find the one with shortest distance. This would take O(N) time giving an O(N^3) overall time for the algorithm, but the scan can be combined with the scan of edges out of a node, giving an O(N^2) algorithm. As before, all distances start out infinite, except Distance[1] = 0, and all nodes are in the queue, with FrontOfQ = 1. To find shortest distances (dense network variation) -------------------------- initialise Distance[1..N], IsInQ[1..N], QSize, FrontOfQ while QSize not zero do take node x from front of queue ( x := FrontOfQ; QSize := QSize-1 ) mark node x ( Marked[x] := TRUE ) for each node y, if y is unmarked then if Distance[x] + Length[x,y] < Distance[y] then update Distance[y] update FrontOfQ ( if FrontOfQ=x or Distance[y]<Distance[FrontOfQ] then FrontOfQ := y )

This is clearly O(N^2) = O(n), so priority first searching is essentially no more expensive than depth first searching in dense networks, but in sparse networks it involves an extra factor of log(n). We can find the shortest paths to each node as well as the shortest distances by recording for each node not just the current distance to that node, but also the "predecessor" node which caused that distance to be recorded. When the algorithm is over, the predecessor nodes can be traced back to the initial node to find the path. This general technique of using a priority queue while searching is called PRIORITY FIRST searching. It generalises breadth first and depth first search, and is used in a variety of situations. For example in chess playing programs, possible moves are often searched in order of the strength of the position so that more promising lines of play are examined first. 4.3 Longest Path Problems -------------------------- To find the longest path from one node to another in a general network is an INTRACTABLE problem connected with the Travelling Salesperson problem (see later). Often however, the network involved is acyclic (a directed acyclic graph, or DAG for short), i.e. it has no cycles: A -----> C <----- E -----> G | ^ | ^ | | | | | | | | | | | | v | v | B -----> D <----- F -----> H Efficient algorithms on DAG's rely on the following fact. A DAG can be TOPOLOGICALLY SORTED, i.e. sorted into an order in which all the edges "point forwards". For mathematicians, a DAG represents a partial order which can be embedded in a total order. -----------------------------------------------> | ---------------------------------> | | | | | E ---> F ---> H ---> G A ---> B ---> D ---> C | | | | ------------------> -------------------> The topological sort algorithm works as follows. Start anywhere and follow edges until you find a node with no edges out of it (like C). Make this node the last one in the sorted list, remove it and the edges into it, and topologically sort the remaining nodes. This crude algorithm is O(N^2). To improve it, we arrange to take up each search for a node where we left off the previous search. This gives us an ordinary depth first search in which we add a node to the sorted list when we backtrack from it. However, not all nodes may be reachable from the first node we choose, so we may have to do several depth first searches. The algorithm is:

Topologically sort a DAG ------------------------ Initialise the sorted list to be empty While there is an unmarked node u in the DAG do depth first scan from u Depth first scan from x ----------------------- mark x for each edge out of x, to a node y, do if y is unmarked then scan from y add x to the front of the sorted list Check that in the above example, with edges out of a node listed in alphabetical order, this algorithm causes two depth first scans (from A and E) and gives the sorted list shown above. The algorithm is clearly linear, O(n). To find the longest path from A to Z in a DAG, it is just as quick to find all the longest paths from other nodes to Z (finding all longest paths from A can be done best by changing the representation of the network so that each node has a list of the edges coming in). To find the longest paths to Z in a DAG ---------------------------------------- Topologically sort the DAG Consider each node x in REVERSE topological order For each edge xy out of x calculate (length of xy) + (distance of y) Set distance of x to be largest of these It works because the ordering ensures that y is to the right of x, and so must have been dealt with already. There is one calculation per edge, so this is O(E). Variations on this are used in Critical Path Analysis for job scheduling. 4.4 The Network Flow Algorithm ------------------------------- Suppose we have a network representing flow through pipes, or traffic flow. Each edge has a label representing the maximum flow along it. There is a source node A and a sink node Z. We want to achieve the maximum flow from A to Z. The idea is to start with any feasible flow (i.e. one which respects the capacities). In this problem, a zero flow will do. Then we find a path from A to Z along which flows can be increased. (Pictures from Sedgewick 436-7: note misprints) 0/8 2/6 2/4 -----> B -----> C -----> / \ / \ / \ |/ 0/2 \ / \ /-- \ A \ F After ADEBCF \ / \__ / \ / |\ 2/3 / \ / \ / -----> D -----> E -----> 2/2 2/5 0/5

The first number is the current flow, the second is the capacity (max flow). Now after ABCDEF, we get: 2/8 4/6 2/4 2/2 2/3 2/2 4/5 2/5 After ABCF we get: 4/8 6/6 4/4 2/2 2/3 2/2 4/5 2/5 Now there are no paths of the above type along which the flow can be increased. However, the flow can still be increased along ABEF as decreasing the flow along EB is equivalent to increasing the flow along BE. We get: 6/8 6/6 4/4 2/2 0/3 2/2 4/5 4/5 Thus we define a "flow-augmenting path" to be a path of forward and backward edges from A to Z in which the forward edges are not saturated (do not have maximum flow) and the backward edges have non-zero flow. It is easy to work out the maximum amount by which the flow can be increased along such a path (min of amounts by which the flow can be increased along each edge). It is now fairly easy to show that when this procedure cannot be repeated further, we have a maximum flow. (A search for paths will divide the nodes into "visited" and "unvisited". The set of edges from the visited to the unvisited nodes forms a barrier across which no more can flow - called a "min cut".) There is still the question of what order the paths should be searched for. If we were to use a depth first method, we would be likely to find long paths from A to Z which could be disastrous: 1000 1000 --> B --> / | \ A | 1 C \ v / --> D --> 1000 1000 If the paths found were alternately ABDC ADBC, the flow would only be increased by 1 each time. It is much better to do a PRIORITY FIRST search looking for paths producing the greatest increase. The number of flow augmenting paths needed is hard to analyse, but is small in practice with these methods. Each path needs one scan of the network, O(N^2) for dense networks, O(E*log(N)) for sparse ones.

4.5 Problems related to Network Flow. -------------------------------------- Suppose that we have a flow problem which also has capacities on the nodes, i.e. there is a maximum flow possible through each node as well as each edge. This can be converted into a standard flow problem by introducing dummy edges: 6 6 --->--- X --->--- ---->--- X1 ---->---- X2 ---->---- A problem with multiple sources and sinks can be converted into a standard problem by adding a "supersource" and "supersink": 10 A ... Y 10 A --10--> ... --10->Y / \ B --7--> ... --12->Z A1 - 7 B ... Z 12 - Z1 C --5--> \ 5 C A problem with undirected edges (flow in either direction) can be represented by replacing each undirected edge by a pair of directed edges (as usual in network algorithms). Suppose each edge has a minimum flow as well as a maximum capacity. This can be handled in the same way, except that a zero flow is not feasible, and an initial feasible flow may be hard to find. There is a separate (similar) algorithm for finding an initial flow, again by changing the network and applying the flow algorithm to the new network. (non-exam: not in the books). In the MATCHING problem, we have a number of applicants and a number of jobs. Each applicant is qualified only for certain jobs. The problem is to match up as many pairs as possible. This is a multiple source/sink flow problem: (Picture of bipartite network turned into flow problem) (Not the only algorithm) A generalization is the ASSIGNMENT problem in which there is a cost associated with each edge (representing suitability of applicant for job). A further generalization is the TRANSPORTATION problem as follows. We have a number of factories each producing a maximum amount called its supply. We have a number of warehouses, each with a (minimum) requirement or demand. There is a cost associated with transporting a unit of product from a particular factory to particular warehouse. (Assignment is transportation with unit supplies and demands). This is again a multiple source/sink flow problem in which we want a minimum-cost maximum-flow. The flow algorithm can be improved to cope, or there is a "Hungarian" algorithm which is better. The MIN CUT problem is as follows. Suppose we have an electronic circuit which we need to divide into two parts (to go on two sides of a circuit board or two layers of a chip). We want the number of wires connecting the two parts to be a minimum. If we treat this as a flow problem (unit flow along unit-capacity wires between components - NOT electrical flow), then when we have found the maximum flow, we will have found a minimum cut as described above.

4.6 Pattern Matching With Regular Expressions --------------------------------------------- Regular expressions are a way of specifying patterns. Suppose we only have patterns of letters, other symbols having special meanings. The special operations on patterns are CONCATENATION, UNION (+) and CLOSURE (*), and brackets may be used. Concatenation means "followed by", union means "either/or" and closure means "any number of (including zero)". We also allow the abbreviation "." meaning any single letter (short for a+b+...+z). For example: Pattern Meaning Matches ------- ------- ------- a matches the letter a a ab matches a followed by b ab a+b matches an a or a b a, b a* any number of a's (maybe 0) "", a, aa, aaa, aaaa ... ba*b some a's between two b's bb, bab, baab, baaab ... (a+b)b an a or b followed by a b ab, bb (a+b)* any number of a's or b's "", a, b, aa, ab, ba, bb, aaa ... (ab)* any number of ab's "", ab, abab, ababab ... aa* any positive number of a's a, aa, aaa, aaaa ... a(b+c)d b or c between a and d abd, acd . any single letter a, b, c, d, e ... .* any sequence of letters "", a, be, sea, xyzzy ... .*cat.* anything containing cat cat, scat, catch, zzcatzzzz ... .*(ie+ei).* anything containing ie or ei Exercise: Find an expression matching any sequence of a's and b's NOT containing two consecutive a's. Note that there is a convention that where there are no brackets, closure is done before concatenation which is done before union (like exponents, multiplication and addition). Also note that the regular expressions in the UNIX editors "ed", "vi" have no union, work with any characters and not just letters, use \( and \) for brackets and have some useful extra abbreviations. These expressions are called REGULAR expressions and form a regular algebra which is an interesting one with a differential calculus etc. (see "Regular algebra and finite machines" by Conway). Regular expressions are important not so much for their convenience for the user, but more because they correspond EXACTLY to patterns which can be recognised by a single scan without using any memory. Thus there is no regular expression which matches every palindrome since there is no alternative but to remember the first half in order to match it with the second half. Similarly there is no regular expression matching every pattern with an equal number of a's and b's because there is no alternative to keeping a count. We can build a network to represent a pattern as follows. The network will have a start node A and a finish node Z. The networks for patterns "a", "a+b", "ab" and "a*" are:

a A -------> Z a ------> B -----> / \ A Z \ b / ------> C -----> a b A ------> B -----> Z A ------> Z / \ ^ | a| | | v \ / B Each node is either a matching node with one edge out labelled with a letter, or a choice node with two unlabelled edges out. Such a network has a sparse representation without needing linked lists. A network for an arbitrary pattern is built up out of networks for smaller subexpressions. For union, we use a new start node with edges to the old start nodes, and join the old finish nodes together. Thus (a+b)+a* gives: a ------> C -----> / \ ------> B \ / \ b \ / ------> D ------>-\\ A Z \ / \ / ------> E -----------------> / \ ^ | a| | | v \ / F For concatenation, the finish node of the first network is joined to the start node of the second network, so (a+b)c* gives:

a ------> B -----> / \ A D ------> Z \ b / / \ ------> C -----> | | c^ | | v \ / E Finally, for closure, we put the new start node at the old finish node, add a new finish node, and add two new edges (from the new start node to the old start node and to the new finish node). Thus (a+b)* gives: A ------> Z / \\ ^ ^ | a| b| | | | | B C | ^ ^ | | | v \ // D For example, suppose we want a network to represent (ac+a*b)d. We get a network similar to the one in Sedgewick: a c B -----> C -----> / \ d A D -----> Z \ b / E -----> F -----> / \ ^ | a| | | v \ / G We can use a network like this to scan a string to see if it matches the pattern as follows. At any time, there is a set of active nodes. To begin with, only the start node is active. For each character in the string, we do the following. Each active choice node is first replaced by its two successor nodes until no active choice nodes remain. Then the current character is compared to the edge-labels out of the active nodes. Nodes which do not match simply become inactive. Nodes which match are replaced by their successors. If the finish node winds up active, the string matches the pattern. For example, suppose we want to know if "aabd" matches the pattern (ac+a*b)d. We use the network above. We keep a list of active nodes, with the choice nodes first.

Step 1 ------ A -> E,B -> F,G,B -> E,C (replace A by its successors, then same for E, then match first "a") Step 2 ------ E,C -> F,G,C -> E (replace E by its successors, then match second "a") Step 3 ------ E -> F,G -> D (replace E, then match "b") Step 4 ------ D -> Z (match the "d") The finish node Z is active at the end, so the string aabd does match the pattern. If we try with abc, we get: Step 1 ------ A -> E,B -> F,G,B -> E,C Step 2 ------ E,C -> F,G,C -> D Step 3 ------ D -> (no states left active) If we run out of active states like this, or if the finish node Z is not active at the end, then the string does not match. We now know abc does not match the pattern. In order to implement this pattern matching algorithm, we need to be able to extract the first item from the list, and add items to the beginning or end of the list (according to whether they are choice nodes or matching nodes. Such a list is often called a "deque" or "double ended queue" and can be implemented in an array in the same way as a queue. No node ever appears twice on the list, so an array of length N can be used for an N-node network. There are many variations on these networks. The ones in Sedgewick are almost identical, but with extra unlabelled edges which make the building of the network simpler. It is possible to design simple building rules which eliminate choice nodes. However, a node can then have any number of edges out of it, the work done in each step of the matching algorithm is not reduced, so it is not worth it. It is also possible to design networks which not only eliminate choice nodes, but also have only one active node at any time. However, there are no simple building rules for these. Such a network for (ac+a*b)d is:

b ---------------------> / --> \ / a a \a/ b \ d A -----> B -----> C -----> D -----> E \\ b // \-------------->/ \ c / ------------> Networks like these are often better, but may sometimes yield very large (in fact disastrous) numbers of nodes. Also note that there can be more than one edge from one node to another (as B to D above). This is no problem in a sparse representation.