Taxonomy of programming languages 2010:
o Comms libraries (MPI)
o Shared memory (OpenMP)
o GPU programming models (Cuda, OpenCL)
o Hybrid models (OpenMP+MPI)
o Traditional PGAS (UPC, Co-Array Fortran)
o HPCS (Chapel, X10, Fortress)
What is MPI really good for? Not ideal at anything really. Not likely to last in the long term.
MPI 3.0 underway right now: improved resilience, hybrid computing, asynch collectives, improved scalability.
Multiresolution languages: some are quite low-level and let you get really close to the machine (e.g. MPI, pthreads). Others are higher-level – HPF, ZPL, but these don’t let you bypass the higher level abstractions and let you get close to the machine when necessary.
Chapel is designed to have a spectrum of features from the low level to the high level.
PGAS isn’t shared memory, it’s shared namespaces, an important distinction! Has a strong sense of ownership on variables – local and remote variables. E.g. UPC, Co-Array Fortran, Titanium, several more. Traditional PGAS is almost entirely SPMD today (Single Programme Multiple Data).
Chapel is starting to target GPUs now – just have to express parallelism and locality, the compiler/run-time takes care of everything else. Goes beyond SPMD parallelism, an advantage over traditional PGAS languages.
Two main thrusts today: evolutionary based on MPI 3.0 and OpenMP 4.0, and revolutionary, based on Chapel/X10 etc.
Graphs are sets of vertices and edges. Edges are “meta data”. Applications in homeland security, biology etc.
Crunching graphs doesn’t work well on traditional commodity architectures because it doesn’t have much spatial or temporal data locality – better suited to massively multi-threaded architectures with almost no memory hierarcy (close to randomly accessing memory).
Cray’s “Threadstorm” XMT design is more like this – no processor caches, instead use chip real estate for thread contexts. Becomes massively threaded and latency tolerant, suited for graph algorithms.
This is a NERSC research project. They have a 4-way Opteron-based cluster.
Want to be able to optimise for multiple different architectures – labour intensive, so need this to be automatic.
This is a “severn dwarfs” approach, optimising multiple kernels across entire family of architectures.
Being driven by LBMHD – Lattice Boltmann Magnetohydrodynamics (CFD + Maxwell’s equations).
Essentially this is a stencil operation. Used for plasma turbulence simulation.
One thing autotuning can do is maximise cache efficiency by automatically padding as appropriate.
Can also autovectorise to reuse data in the stencil operations, reduce TLB misses etc.
Have extended their auto-tuner to now target OpenMP+MPI – distributed memory for the first time.
Search space for best configuration of optimisations is too large to search exhaustively. They use a 3 stage greedy algorithm instead.
Cuboid data partitioning isn’t always best, because while this maximises surface to volume ratio, it may not maximise cache performance.
Takes 10-12 hours to do the tuning, the calculation then runs for days, so the trade-off is worth it. Not necessarily always the case though.
Getting close to maximum theoretical performance per node, have still to look at this in the distributed memory case though.
This work focuses on single node tuning.
Fast multipole is O(n) on n particles, compared to O(n^2) for direct methods and O(nlogn) for Barnes-Hut.
Two steps: build tree then evaluate potential on n targets.
Recursively divide tree until each space has no more than q points.
Tuning included optimising for FFTW plans, SIMDization, Newton-Raphson approximation, structure of arrays and matrix free computation (what’s this?)
Had an interesting result that an Intel Nehalem was as fast as a GPU, but really didn’t understand why this was the case.
90% of codes at NASA still written in MPI. Used for performance and portability, but far from ideal.
Cart3D uses MPI and OpenMP (either/or, not hybrid). Achieved close to ideal scaling up to 512 cores.
Often have problems with machine stability on large jobs bringing MPI down.
OpenMP 3.0 being extended to address locality (currently assumes it’s a flat memory space).
Darren was part of Stephen Jarvis’s performance modelling team at Warwick.
Did a lot of the initial performance modelling for Roadrunner (Cell-based) before it was procured.
Also modelling Blue Waters ahead of its installation at UIUC.
Exascale programme began three years ago in the US. DOE funded. Sequence of workshops looking at building a science case for EXAFLOPS.
The current #1 system, Jaguar at ORNL is 2 PFLOP peak, 6MW just for the machine, not including cooling! 225,000 cores, MTBF a few days.
Want an EXAFLOP machine capped at 20MW, $200M. System concurrency is 10,000 times more than the largest systems today!
The memory capacity and memory bandwidth bottlenecks will tighten by an order of magnitude.
The average core count in the top 20 systems of the Top 500 is 90,000 already.
Need to break the bulk synchronous processing traditional paradigms.
Don’t forget that bandwidth is improving much more slowly than compute.
Benefits of these algorithms are for more skinny matrices in things like QR. Squarer matrices negate a lot of the benefits.
Have done some interesting work proving lower bounds for FLOPS, bandwidth and messages.
Computational finance accounts for 10% of the systems in the Top500 (!) These are primarily used for throughput computing – vast numbers of small, independent tasks.
Financial institutions want to increase their simulations by an order of magnitude over the next year to do better risk management.
Computational financing covers:
o Options pricing – investment banks – Monte Carlo (60%), PDE (30%), other semi analytic (10%)
o High prefquency algorithmic trading (not computationally intensive)
o Actuarial science – pension funds, insurance
A ‘call option’ give you the right (not an obligation) to buy something at a fixed price.
Stochastic differential equations (SDEs) used to model the behaviour of stock prices etc.
Geometric Brownian motion (Black-Scholes) used to model stock prices.
Might do 100,000 path simulations to value a single option.
Monte Carlo is trivially parallel – each path needs an independent set of random numbers. This is expensive because banks have to analyse their position over night, every night. Also have to constantly assess the sensitivity of the value to changes in various parameters, such as changes in interest rates. The aim of the banks is to hold offsetting risk so they don’t gamble.
PDEs are an alternative approach. Solve a Black-Scholes PDE (geometric Brownian Motion) backwards in time from the payoff value. These are simple, scalar PDEs.
PDEs suffer the “curse of dimensionality”. Cost of a Monte Carlo is linear in dimension, but this same cost is exponential in PDEs. Don’t see many PDEs larges than 3 dimensions because of this.
Bloomberg has a large GPU cluster – 48 Teslas each with 4 GPUs. For them this was instead of buying 2000 CPUs. BNP Paribas has a small cluster. Many more doing proof of concept trials.
If you have a background in MPI+OpenMP learning Cuda isn’t too difficult.
A key issue in parallel RNGs is having a skip-ahead capability. Can jump N places in O(logN) operations.
On-chip caches in future GPUs will make PDEs much easier – automates the data reuse of halo cells etc.
Looking forward to greater C++ support in next-gen GPUs since most finance codes are written in this language.
Lightwave research lab, Columbia U.
On-chip networks can use 30-50% of chip power budget (not sure I believe this, it was a tiny fraction on ClearSpeed’s processors).
Off-chip comms is certainly a big power hog.
It’s typical for a processor to have 10X more intra-chip bandwidth than inter-chip.
Photonics may change the rules for bandwidth per watt. Once bits have been modulated they can be transferred as far as you want, either on-chip or off. Off-chip BW can be the same as on-chip BW (not sure I believe this either!).
Having 3D memory layers may benefit from also have layers of photonic network on chip.
Their group is looking at BW for exascale systems, designing a chip with 10 TBytes/s on-chip and off-chip BW.
Photonics can be integrated in CMOS processes, including things like optical waveguides (nano-wires).
First complete optical link in Aug 2009 (with collaborators at Cornell).
Can easily build things like a torus on-chip.
Have a simulator called Phoenix-sim about to be publically released.
They can do 38 10Gbps links at 3.2W total. (Very low power!) (simulation)
All current exascale system designs are exascale in FLOPS only, leaving memory size and bandwidth behind.
First Conjugate Gradient (CG) solutions on GPUs appeared in 2003 as simple shaders – took heroic efforts!
Have 10X speedup on a GPU vs optimised code on 4 core host (ported ICCG).
MiniFE assembles a sparse linear-system, solving it using a conjugate gradient algorithm.
Sandia has a huge investment in MPI-parallel, unstructured-mesh, finite-element applications.
MineFE is available as open source, download from the Mantevo website – http://software.sandia.gov/mantevo
The parallel pattern library is written in C++ for data and task parallelism (async).
Have partnered with Intel to use their Threaded Building Blocks (TBB).
Looks quite interesting, but of course it’s OS specific!
This is a class of MC methods particularly well suited to parallelisation on shared memory systems.
Include SMC samplers, particle filters and population-based Markov Chain Monte Carlo (MCMC).
Open source – code.google.com/p/opencurrent (Apache 2.0 license)
A general library for solving PDEs over structured grids. Initially targeting fluid simulations but it’s quite general.
Has been balancing ease-of-programming vs. performance.
Want to be able to work from a high-level algorithmic description.
Worked hard to avoid all serial bottlenecks.
Have implemented a synchronous co-array model to support multiple GPUs.
Now as full incompressible Navier-Stokes working on two GPUs. Should scale by about 90% on the two GPUs.
Going to add kernel generation via C++ template meta-programming. Can make programming much easier and doesn’t have to cost much performance.
PDEs are important in a whole variety of applications.
Want a suitable level of abstraction to separate the user’s specification of the app from the details of the parallel implementation. Aim to achieve code longevity and near-optimal performance through re-targeting the back-end to different hardware.
This work is in collaboration with Paul Kelly at Imperial.
Mike produced OPlus unstructured solver for Rolls-Royce 10 years ago. MPI-based fo HYDRA CFD codes on clusters of up to 200 nodes.
OP2 is an open source project. It’s an “active library” using code transformation to generate Cuda, OpenCL or OpenMP/AVX code for GPUs and CPUs.
Has abstractions for sets (nodes, edges, faces), datasets (flow variables), pointers (e.g. from edges to nodes), and parallel loops. These operate over all members of one set. At most one level of indirection. User specifies how data is used ( read only, write only or increment).
No dynamic grid adaptation.
Potential hazard when incrementing indirectly-referenced unstructured grids. Can solve this by giving edges owners, and edges with two owners get incremented twice. Alternatively can colour edges so that no two edges of the same colour update the same node. This avoids data conflicts by construction. A third option is to use atomic updates. Current hardware support for this is slow though.
Has a prototype running today.
Have an “energy efficient computing” webpage on LBNL – worth looking up!
Check out the energy “spaghetti” chart. 55% of all energy is just lost, half of this lost in electricity transmission.
Two interrelated issues: “bits” and “buildings”.
Khazzoom-Brookes postulate – “work will fill the available time”. I.e. saving energy at the micro level leads to higher energy consumption at the macro level.
Bit IT responsible for $16B/year energy cost, 200TWhours/year. Paper from Jonathan Koomey from Stanford in 2006 estimating total power consumption by servers in the US and the world.
“SMART 2020: enabling the low carbon economy in the information age: - The Climate Group. – global ICT carbon footprint by subsector, geography etc.
It can cost $5-10M to go from a 2 to 3 MW power supply for a supercomputer.
Microsoft has a million server datacentre in Chicago, 400 containers, 300MW power supply!
Datacentre Infrastructure Efficiency (DCiE). Similar to PUEs. Average is about 1.9, state of the art is about 1.2. But don’t forget this is just a ratio, not am absolute number!
LBNL is aiming for a PUE of 1.05 for their new data centre.
Dale Sartor is pushing data centre energy efficiency from UC Berkeley.
One of the biggest energy issues in the electronics is in off-chip energy movement – bandwidth costs power.
Mentioned Green Flash as an example as an energy efficient, application-specific supercomputer for climate modelling.
Green Flash is aiming for 100x power efficiency over mainstream HPC approaches.
Power consumption is going to change the economics of computing – might drive more people to used clouds.
John is one of the main people behind the “Green Flash” climate modelling supercomputer design at LBNL.
Concerned with the possibility of 3 disruptive transitions over the next 10 years – would rather just have one!
Sustained to peak flop ratio is the wrong metric if flops are cheap.
The programming language should reflect the new reality of the actual costs of the underlying system
Locality of data is really going to matter (and matter more and more in the future). The energy cost of accessing off-chip memory is 100X higher than on-chip. Caches get in the way of this because they hide the true cost of accessing that data.
Must have data-centric parallel expressions.
Need hierarchical data layout statements. (And must be multi-dimensional).
Fine-grained power management makes homogeneous systems look heterogeneous. Fault resilience can do the same.
Really need to go to asynchronous programming models to scale in the future.
Used Trilinos C++ libraries to parallelise their codes. Also GMRESR.
“Motif” is the new name for “Dwarf”!
Working on “a pattern language for parallel motifs” at the Berkeley ParLab.
Collecting at least one classical example of each dwarf. Defined in the abstract, but comes with input test data and verification of the expected results so you can port a dwarf and test the implementation.
Minimum vector length of 32 per SM (8 cores per SM, each of which needs a minimum of 4 threads each). Nvidia calls these 32 threads a “warp”. Warps can be active or inactive (waiting for data).
Mike’s rule of thumb is that you want at least 10,000 threads to run to get anywhere near peak performance on the GPUs. That’s about 40 threads per core.
Memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations and can be managed by the compiler.
Code is written from the point of view from an individual thread.
Warp divergence is important to avoid if at all possible – threads have to execute all branches of a conditional statement. There’s a special case called ‘warp voting’ that’s fairly heavyweight which can coalesce branches at run-time – the compiler inserts additional code to check for this where appropriate.
For codes with boundary conditions, treat them differently depending on the cost of computing those conditions. If they’re cheap, just branch as necessary (small overhead). If expensive, have two kernels, one of interior points and a second for boundary points.