Welcome to the world of the Data Diffusion Machine. The Data Diffusion Machine (DDM) is a virtual shared memory architecture where data is free to migrate through the machine. We have the following to offer:
Shared memory machines are convenient for programming but do not scale beyond tens of processors. The Data Diffusion Machine (DDM) overcomes this problem by providing a virtual memory abstraction on top of a distributed memory machine. A DDM appears to the user as a conventional shared memory machine but is implemented using a distributed memory architecture. This approach is generally known as Virtual Shared Memory, or VSM.
The DDM is not the only architecture implementing virtual shared memory. Other machines such as the DASH, KSR, and MIT-Alewife, also implement virtual shared memory. There are however some essential differences between these machines. The DDM and KSR machines are known as cache only memory architectures or COMA's. Each data item migrates to the processor(s) where it is used. There are no fixed home locations where data will always be found, instead data is located by means of directories. Data addresses no longer correspond to physical locations, but simply represent names (or tags) for data. As the data is free to move around the system the programmer sees a uniform memory access (UMA) system in which all data is equally accessible. In contrast, the DASH and Alewife architectures have the data stored in some home location and cache the data where it is frequently used. In these architectures memory accesses are non-uniform (NUMA) which generally means that the programmer is more restricted in how data is laid out in the address space.
The purpose of the DDM architecture is to provide a scalable shared memory architecture to the user. In order to do so the DDM uses a hierarchical structure:
The leaves of the tree consist of processors with large set-associative memories that comprise the sole store for data. The nodes above the leaves are directories that keep track of the data items below. Data can be either writable in the memory of exactly one processor, or read-only in multiple processor memories. When the data required by a processor is not available, a request is posted upwards in the tree. The request is propagated until the directory indicates that the data is available in some branch of the tree, where the data is fetched and brought back to the requesting node. If necessary, ordinary caches can be placed between the processor and the DDM memory, or private memory can be attached to nodes to store, for example, program code.
The original concept of the DDM envisaged an arbitrary interconnect although the published DDM protocol was elaborated for a bus based implementation. A bus based prototype of a DDM was designed and built at the Swedish institute of computer science (SICS). Using a bus as a communication mechanism has the advantage that transactions that are posted are broadcast so that all children at a level observe the same stream of transactions.
A link based DDM was developed at the university of Bristol in the scope of the PEPMA project. The protocol must explicitly broadcast where necessary. This makes the protocol slightly more complex, but it obviates the need for snooping hardware, which complicates the design of a bus based system. The DDM developed at PACT in the scope of the HORN project is also link-based but has improved scalability and is starvation free. The network used is no longer a pure tree but a split tree. The advantage of this network is that the directories are smaller and less prone to contention. Furthermore the protocol has been enhanced so that all requests made by a node are eventually served: it is not possible for two nodes to lock out a third node.
We are currently in the process of fully designing this link based DDM. A specification of the protocol that we are using can be found in technical report CSTR-93-17 of the University of Bristol Computer Science department. We evaluate the architecture by means of a (software) emulator that has been calibrated to reflect realistic timings. Accompanying tools give a real-time animated graphic visualisation, allowing the user to observe the behaviour of the program and spot contention in parts of the memory (we have a demonstration of this last tool on-line). Preliminary evaluation results for up to 72 nodes are very promising.