[ Bristol CS | Index ]

# Graph Structures

Functional programmers often want to create structures which are not purely tree-like. Examples are parse trees with cross-references from occurrences of variables to their declarations, general graphs used to implement well known algorithms, and direct access to heap objects via pointers.

This working paper reviews the array and monadic approaches to graph structures, and makes a proposal for implementation in Brisk.

## Trees Versus Graphs

Trees are well supported by functional languages using recursive data types. In trying to generalise to graphs, two issues arise concerned with sharing and updating which mean that some different kind of representation is needed.

With tree structures, there is no distinction between copying a subtree and sharing it. Thus a tree is often represented internally as a graph, with identical subtrees being shared. This makes updating reasonably efficient. Updating a tree consists of replacing those nodes which refer directly or indirectly to the new information. Typically, this involves replacing the nodes on the path from the root down to the point of change. Other parts of the tree remain unchanged, shared by both the old and new versions of the tree. In a tree with n nodes, an update often involves replacing only about log n nodes.

With a graph structure, there is a distinction between copying a structure and sharing it, otherwise you would not be able to distinguish cycles of different lengths, or to distinguish a cycle from an infinite path. This suggests that each node in a graph must have an identity so that, for example, if you traverse a cycle, you can tell when you have got back to where you started from.

The situation with updating is also different in a graph from a tree. In general, a node must always be replaced by a new one if any of the items it points to represent different values from before. In a graph, the update of any one node would lead to a need to replace all the nodes from which it is accessible, which could easily be the entire graph! This suggests that edges in a graph structure should be represented indirectly via some lookup mechanism.

## The Array Approach

One approach is to use an array. An edge is represented by (say) an Int which can be used as an array index. Each entry in the array represents a node and uses the Int type to refer to other nodes. For example:
```   type Graph = Array Int Node
type Node = [Int]
```
This provides the necessary indirection so that a node can be updated in isolation. However, there are disadvantages to the array approach:
• Updating an array entry may not be constant time.
• Having the Int type visible is unsafe; it should be made abstract.
• Even then, it would not be possible to prevent someone from indexing one array with an index which belongs to another.
• Adding nodes is awkward; the array may need to be increased in size.
• Deleting nodes leaves holes in the array which need to be garbage collected explicitly.
This last point is particularly problematical when graphs are being used to model run time heaps, or to gain direct access to the current run time heap. It would be much pleasanter if garbage collection of the graph structure was handled automatically by garbage collection of the program.

## The Monadic Approach

In the monadic approach, actions are provided to create new references (also called mutable variables) and to change what they point to. In this approach, the structure in which the references are looked up is implicit in the monadic state. This allows references to be represented directly as addresses, it ensures constant time update, and it allows automatic garbage collection. However, it does have disadvantages:
• It is difficult to have a number of different graphs around at the same time.
• It can still be difficult to prevent one reference from being looked up in a graph to which it does not belong.
• All graph handling is forced to be monadic.
The proposals for handling independent substates in Brisk's monadic facilities can be used to support multiple graphs at the same time, and it is proposed that such facilities be implemented in Brisk. However, non-monadic facilities are also needed. The situation is similar to arrays; monadic incremental arrays are available, but monolithic arrays are also provided for conventional use. Both kinds are are useful.

## The Proposal

What is proposed for Brisk is that, as well as monadic actions for handling references, non-monadic monolithic facilities are also provided, and it is these we concentrate on here.

We assume that references are abstract, and that equality is the only direct operation on them which might need to be available. We will call the structure in which references are looked up a Heap rather than a graph, since the item which a reference points to is arbitrary, and is not restricted to being a simple node. The operations proposed are:

```newHeap :: IO (Heap a)                      -- create a new heap
newRef :: Heap a -> a -> (Ref, Heap a)      -- generate a new pointer
lookup :: Heap a -> Ref -> a                -- like ! in arrays
update :: Heap a -> Ref -> a -> Heap a      -- like // in arrays
```
The operation to create a new heap is monadic, in fact an I/O operation. This is to ensure that every heap can be given a separate identity, and references from different heaps can be distinguished. Otherwise, if newHeap just had type Heap a, and we were to define two heaps by applying identical operations to newHeap, then referential transparency would force us to conclude that the two heaps, and all the references generated from them, were identical values. This means that references from one heap could be used in another. By making newHeap monadic, an attempt to use a reference from one heap to look something up in another can be made an error.

There is no operation to find all the references in a heap. This is deliberate; it allows references to be garbage collected. A programmer implementing a graph must explicitly hold onto a node or nodes from which all other nodes of interest are accessible.

## Implementation

A reference can be implemented not directly as an address, but as a small indirection node containing the identity of the heap which it belongs to, and a pointer to the value currently being referenced.

The problem of accessing different versions of a heap at different times can be dealt with using trailers, as with the proposed Brisk implementation of arrays. Thus a heap is represented as node containing just its identity. When a heap is updated, a copy is made and the original is over-written with a node representing an unevaluated expression which refers to the new version and undoes the update.

The lookup operation is implemented by first evaluating the node representing the heap to bring it to the correct version if necessary, then checking that the reference node contains the correct heap identifier, and then indirecting to find the associated value.

Dr. Ian Holyer, ian@cs.bris.ac.uk. Last modified on Friday 22 August 1997 at 16:59.