The fourth UK Many-Core developer conference (UKMAC 2012)

UKMAC 2012 programme

Time Speaker
09:30 Simon McIntosh-Smith, University of Bristol
  Welcome and Introductions
09:40 Andreas Olofsson, Adapteva
  Keynote address: Ultra energy efficient many-core computer architectures
10:30 Alan Gray, EPCC
  Scaling Soft Matter Physics to a Thousand GPUs and Beyond
The GPU architecture is inherently more suitable for many types of intensive parallel computations than the traditional CPU, since it features a large number of lightweight cores offering a relatively high ratio of performance to power consumption. Subsequently, an increasing number of massively-parallel supercomputers are based on heterogeneous node architectures featuring CPUs coupled with GPUs as compute accelerators. Such systems provide a possible template for future exascale systems (for which power consumption will be a key issue). We present our experiences in scaling the "Ludwig" lattice Boltzmann fluid dynamics application on the Cray XK6 hybrid supercomputer (with nodes comprising an AMD Interlagos CPU and an NVIDIA Tesla X2090 GPU, coupled using the Cray Gemini interconnect). This versatile application is capable of simulating the hydrodynamics of complex fluids, (e.g. mixtures, surficants, liquid crystals, particle suspensions) to allow cutting-edge research into soft matter physics.

We will describe the techniques used to port to the GPU architecture in an incremental fashion, while retaining the ability to test for correctness throughout the process. We will present significant performance gains achieved on each GPU through data layout and code restructuring to reduce off-chip memory accesses. We will also discuss how this work has motivated optimizations to the traditional CPU-based version, where matrix-vector multiplication operations within key loops have been restructured to improve utilization of the SIMD vector units in modern CPU cores, significantly improving the overall performance. We will present a new halo-exchange communication phase for the code, developed to allow efficient parallel scaling to many GPUs, including the combination of CUDA stream functionality with MPI communications allowing the overlapping of separate stages within the communication phase to reduce the overall communication time.

For a binary fluid benchmark, excellent scaling is observed up to 936 NVIDIA X2090 GPUs on a prototype Cray XK6 machine (the largest resource available at the time of writing). We compare on a node by node basis to the Cray XE6 CPU-based architecture: each GPU is compared with 2 fully utilised AMD Interlagos 16-core CPUs (using the SIMD-tuned code). The Cray XK6 performance advantage ranges from a factor of 1.5 to a factor of 1.8, depending on the number of nodes utilized.

We will also describe work to enable colloidal particles in the simulation, required for a number of systems of interest, with minimal overhead. These are implemented in such a way that we avoid a major diversion of the CPU and GPU codebases, whilst minimising data transfers between the CPU and GPU at each timestep. We keep the majority of the (relatively inexpensive) particle related code on the CPU, while offloading only those parts responsible for the interaction with the fluid to the GPU (where the fluid data resides).
11:00 Coffee
11:30 Shuo Li and Jim Cownie, Intel
  Extending Parallelism from Intel Xeon Processor to Intel Xeon Phi Coprocessor: --- A Structured, Stepwise Approach to Manycore Programming
In this presentation, we lay the background of this talk by looking at the technical detail of latest announcement of Intel Xeon Processor and Intel Xeon Phi Coprocessor at Supercomputing 12. We then, compare and contrast the hardware execution resource and runtime facilities that support parallel programming on these two product lines together with the latest release of Intel Parallel Studio XE 2013. Related new features and feature extensions in Intel C/C++ Compilers with Intel Cilk Plus Technology, OpenMP and Intel TBB runtime library, Intel MKL, Intel MPI Library will be discussed.

With that, we present a structured stepwise approach that extends the parallel programming from Intel Xeon Processor to Intel Xeon Phi Coprocessor. This framework combines the program execution characterization, synchronous SIMD vectorization, concurrent execution and runtimes behavior analysis in a framework that guides the many core developer in a step-by-step process to achieve sustained, scalable concurrent execution of SIMD code.

This presentation also attempts to engage the audience in a collective effort to optimize a popular problem in derivative finance using this stepwise framework. The audience was given a sample program built with GCC, and then the stepwise parallelization method was use introduce parallelism to the program. Each step covers a specific topic and attempt to improve the performance using one method. This framework consists of the following components:

  1. * Efficiency of serial and scalar optimization with traditional program characterization at microarchitecture level.
  2. * SIMD data parallelization with explicit or Compiler-based loop vectorization.
  3. * Concurrent multithread that can span from multicore to many integrated core.
  4. * Load Balance, Sustainability and scalability of parallel runtime environment.

The presentation is punctured by a series of demos of partially optimized code. A version of vectorized multithreaded application using Intel Parallel Studio XE 2013 emerges in the end and runs with high performance on Intel Xeon Processors. In the last step, we demonstrate, using the same Intel programming tools, this example application can be compiled with no additional source code change to run on Intel Xeon Phi Coprocessor.
12:30 Dave Glowacki, University of Bristol
  danceroom Spectroscopy: Interactive quantum molecular dynamics accelerated on GPU architectures using OpenCL
danceroom Spectroscopy (dS) is an interactive audiovisual installation and performance tool built from algorithms commonly used to simulate and analyze quantum molecular dynamics. Using an array of up to seven simultaneous depth sensors, dS literally interprets and renders humans as “energy landscapes”. The result is an interactive system where movement is interpreted as perturbations in a virtual energy field. This interpretative leap (i.e., imagining humans as “energy fields”) allows users to perceive the emergent physics arising from their movement within a real-time simulation of an ensemble of atoms comprising an atomic liquid. Graphically, users perceive this interaction via projections of their energy field embedded in a simulation of thousands of interactive atoms that fluidly react to the real-time motion of their fields. Sonically, we have implemented a set of analysis algorithms that detect transient structures within the ensemble dynamics, package the data into appropriate structures, and send it to an electronica musician. This allows users to hear the sonic effect of their field perturbations within the atomic nano-physics. dS has recently been featured at the London 2012 Cultural Olympiad and London’s Barbican Arts Centre.

The interactivity that we have so far achieved with dS relies on a high-performance custom-built workstation. Our code utilizes OpenCL C# wrappers run on a heterogeneous platform that includes an Intel i7 hexacore hyper-threaded CPU and two NVIDIA GTX 590 GPUs. One of the GPUs is reserved exclusively for DirectX 11 graphics, and the other for real-time physics computations and mathematical analysis. Using profiling tools, we ported a number of our most expensive computations to the physics GPU, including calculation of the external and internal forces acting on the atoms, propagation of atomic positions and velocities, collision detection, and construction/updates of an ensemble-averaged velocity autocorrelation function. The CPU runs cheaper tasks that require fast access to global atomic ensemble data (e.g., several of the sonification algorithms). Our GPU accelerated code allowed us to substantially increase the performance of the system and reliably simulate up to 10,000 interactive quantum atoms, compared to fewer than 300 with CPU-only physics computations. In this presentation, I will demonstrate the dS system, and discuss the parallel design driving those portions of the code that led to maximum performance enhancements upon porting to GPUs.
13:00 Lunch
14:00 Zheng Wang, University of Edinburgh
  Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems
General purpose GPU based systems are highly attractive as they give potentially massive performance at little cost. Realizing such potential is challenging due to the complexity of programming. This paper presents a compiler based approach to automatically generate optimized OpenCL code from data-parallel OpenMP programs for GPUs. Such an approach brings together the benefits of a clear high level language (OpenMP) and an emerging standard (OpenCL) for heterogeneous multi-cores. A key feature of our scheme is that it leverages existing transformations, especially data transformations, to improve performance on GPU architectures and uses predictive modeling to automatically determine if it is worthwhile running the OpenCL code on the GPU or OpenMP code on the multi-core host. We applied our approach to the entire NAS parallel benchmark suite and evaluated it on two distinct GPU based systems: Core i7/NVIDIA GeForce GTX 580 and Core i7/AMD Radeon 7970. We achieved average (up to) speedups of 4.51x and 4.20x (143x and 67x) respectively over a sequential baseline. This is, on average, a factor 1.63 and 1.56 times faster than a hand-coded, GPU-specific OpenCL implementation developed by independent expert programmers.
14:30 Andy Page, BAE Systems
  The Use of GPU Frameworks
The ATC has started to develop interest and expertise in GPUs since the early 2006 during a project called CFMS. CFMS was a Technology Strategy Board (TSB) grand aided project involving leading players from the aerospace, motor sport and marine industries with the aim to dramatically increase the power of the design simulation. During the project, we investigated new and emerging highly parallel computing and computer gaming-related IT technologies to provide information on the potential impact on the aerospace product design process tools and developed the world first Computational Fluid Dynamics solver running on GPUs, delivering 100x computational improvement at 10% of a typical cluster power consumption. Since, we have transferred knowledge across the company and have implemented many algorithms to GPU hardware for image detection, real-time tracking, online data processing, optimisation, etc.

The topic of my presentation will be on the use of dedicated frameworks that are GPU enabled and which, when integrated within an existing code, offers tremendous performance and productivity enhancements in a complete transparent way. By abstracting the execution model, these frameworks free the developer from worrying about internal execution mechanism such as function dispatch, memory management, etc. and simply require user-provided programs to control problem specific models. Examples of these frameworks are, for instance, Jacket – a GPU-enable engine for Matlab, OptiX – a ray tracing engine from NVIDIA, FlameGPU – an agent based framework, OP2 – an open source framework for unstructured grid applications for GPUs or multi-core CPUs clusters...

For the type of work that we are currently involved with, a key capability is the ability to rapidly implement tasks that can perform in near real-time. With this as an objective, we have recently intensively used NVIDIA OptiX ray tracing engine to compute Electromagnetics and Infra-Red signatures, as well as the FlameGPU platform to simulate large numbers of autonomous agents in the context of crowd management. In both cases, outstanding performances were achieved. OptiX could trace ~100 Million rays per seconds and let us demonstrate, in a 3 week period of development, physics functionalities in near real time (Fig. 1). FlameGPU could simulate 160,000 pedestrians at ~20 frames per second and allowed us to develop relatively rapidly a real-time decision support tool for emergency commanders (Fig. 2).

In this presentation I will show how these frameworks may be used to develop efficient codes with limited effort and demonstrate the level of flexibility that can be achieved. I will also provide some personal opinion on how the same approach may be used to solve other types of problems.
15:00 Alexey Kravets, ARM
  A taste of CARP: benchmark analysis, language design and kernel verification
General-purpose computation on graphics processing units (or GPGPU) is grow- ing in popularity, expanding from desktop and supercomputer to mobile and em- bedded applications. The issues of software correctness, efficiency, portability and longevity, however, are casting shadows over the landscape of GPU pro- gramming, which is fragmented between similar but incompatible technologies (NVIDIA CUDA, Khronos OpenCL, Android Renderscript).

We will give an overview of initial results from the EU-funded project CARP: Correct and Efficient Accelerator Programming.

First, we will discuss our analysis of two open-source OpenCL benchmarking suites - Rodinia and SHOC. For each benchmark, we provide high-level algo- rithm description (mathematical formulation, abstract data structures, iteration domains, memory access and dependence information) and low-level OpenCL implementation details (mapping between abstract data structures and OpenCL buffers, partitioning of computation into kernel invocations, work-groups and work-items, target-specific optimisations and optimisation opportunities). To illustrate the analysis, we will walk through one relatively simple benchmark, but also will present highlights for more challenging, irregular cases such as scan, breadth-first search and sparse matrix-vector multiplication.

Second, we will provide initial insights into the design of PENCIL, a platform- neutral compute intermediate language (joint work with INRIA) and a proof-of- concept domain-specific language (DSL) for linear algebra that can be lowered into PENCIL for achieving a full DSL-to-OpenCL workflow.

Third, we will present a user’s perspective of applying GPUVerify (http:// multicore.doc.ic.ac.uk/tools/GPUVerify/) to ensure race- and divergence- freedom of GPU kernels (work of Imperial College London).
15:30 Coffee
16:00 Pedro Gonnet, Durham University
  Using asynchronous task-based parallelism directly on GPUs
Task-based parallelism is a versatile form of shared-memory parallel programming, in which a computation is divided into a set of inter-dependent tasks which are then scheduled, concurrently and dynamically, on any number of CPUs or computational cores. Correct program execution is ensured, and memory conflicts are avoided, by specifying dependencies between tasks, which need to be enforced by the scheduler. Several libraries such as Cilk, QUARK, StarPU, SMP superscalar, Intel's Threading Building Blocks, and to some extent OpenMP, provide such frameworks for general-purpose multi-core computers.

This approach does not, a priori, seem suitable for SIMT-parallel GPUs, which are usually treated as large vector processors. The underlying hardware, however, is organized much in the same way as regular mutli-core processors and can be treated as a large multi-core, in which each core executes in strict lock-step SIMT parallelism. The use of task-based parallelism directly on the GPU would open up the latter to a large number of algorithms and applications which do not vectorize well.

In this talk, we present a CUDA-based implementation of task-based parallelism directly on the GPU for Molecular Dynamics simulations. We show that this approach scales well over the number of Multiprocessors used, with very little overhead for task management, and thus provides a problem-independent infrastructure for task-based parallelism directly on GPUs.
16:30 Carlo Bertolli, Imperial
  Performance Portability for Unstructured Mesh Applications using the OP2 Library
Unstructured mesh applications underpin many of the more sophisticated numerical methods in computational science. Examples from the Computational Fluid Dynamic (CFD) field include weather and climate modelling, blood flow in the heart, and turbomachinery components of jet engines. An unstructured mesh is especially challenging from an optimisation point of view in High-Performance architectures, as the mesh connectivity does not necessarily follow a regular pattern and is not known at compile-time. It is instead represented in terms of arrays of pointers which are an input of the application. This makes all standard optimisations used for the simpler case of structured meshes (e.g. based on the polyhedral model) hard or impossible to achieve at compile-time.

In this talk I will introduce the OP2 library, which aims to simplify the process of development and optimisation of unstructured mesh applications across a wide range of architectures. OP2 is supported in C/C++, FORTRAN, and Python, and it offers an interface to: (i) declare an unstructured mesh in terms of its basic sets (e.g. vertices and edges), to associate data to these sets (e.g. 3D coordinates of each vertex), and to define the mesh connectivity (e.g. mapping between edges and vertices); (ii) to compute over the mesh using parallel loops. A parallel loop applies a user-defined kernel to all elements of a mesh set (e.g. to all edges), and each user kernel invocation can access datasets defined over other sets through mappings (or indirections). In OP2 this indirection is expressed in a declarative manner through mesh access descriptors. Thanks to a source-to-source compiler technology, programs using OP2 can be run on a variety of parallel architectures, like clusters of CPUs and GPUs. Currently, OP2 is able to generate implementations of its constructs in CUDA, OpenMP, OpenCL, MPI and mixed code like MPI+CUDA and MPI+OpenMP.

In the talk, I will focus on the implementation of OP2 on two architectures that form the key building blocks of today’s HPC systems, namely CPUs with vector support and GPUs. I will show how important optimisations are developed targeting these architectures achieving near-optimal performance and portability: (i) GPU-related optimisations, including improved data locality through shared memory, coalescing of global memory accesses, and complex loop fission; (ii) CPU-related optimisations, including a preliminary study on vectorization, and a short description of how to improve data locality across successive parallel loops.

The main driving force of OP2 is a large turbo-machinery CFD simulation application used at Rolls-Royce, called HYDRA. I will use HYDRA as an example of how optimisations acquire a central role in achieving performance portability for large real-world industrial cases. I will also draw a comparison between the performance achievable for simpler benchmarks compared to the performance of HYDRA - that is, a study on the scaling of performance from benchmarks to actual industrial code.
17:00 Simon McIntosh-Smith
  Closing remarks
17:15 Depart


UKMAC 2012 chair: Simon McIntosh-Smith, Microelectronics Research Group, University of Bristol

UKMAC Steering Committee: Prof Mike Giles (Oxford), Prof Paul Kelly (Imperial), Dr Graham Pullan (Cambridge), Simon McIntosh-Smith (Bristol)

Local organisers: Simon McIntosh-Smith, Lesley Jones (Bristol)

Conference Sponsors

Intel Imagination NVIDIA Ind. Maths KTN ASEArch CPP

Organised in conjuction with the OpenCL user group, Comportability.