<< 2008-9 >>
Department of
Computer Science

ISC'09 day 1

International SuperComputing 2009 (ISC '09)

ISC'09 is second the largest supercomputing conference on the calendar, and the largest outside the US.

The following notes are pretty much my live transcription as the conference unfolds so please forgive any typos, acronyms not explained etc. I hope you find this useful and/or interesting, please don't hesitate to get in touch if you have any questions!

Conference website.

See also day 2 and day 3 of the conference.

ISC, now in its 24th year, is a conference that's been growing quite fast. Attendance is up from 647 in 2005 in Heidelberg to 1375 last year in Dresden to 1527 this year in Hamburg. The conference also had 120 exhibitors, both vendors and High Performance Computing (HPC) users, so it's now quite a large conference.

Talks include a keynote by Andy Bechtolsheim, co-founder of Sun, to Thomas Sterling, one of HPC's stalwarts and industry commentator.

Highlights from the latest Top500 list of the fastest supercomputers (slides)

Dr Eric Strohmaier (LBNL)

  • The top 2 systems stayed the same, both in systems in the US and the only systems capable of over 1 PetaFLOP (one trillion floating point operations per second!)
    • These systems have over 100,000 cores each
    • The fastest system is RoadRunner, a heterogeneous many-core system using IBM's Cell processor at LLNL
  • Europe now has the third fastest system, 825 TFLOPS at Juelich in Germany with a system of nearly 300,000 cores (the most cores on the list)
  • Eight of the top ten systems are based in the US and are from IBM, Cray and Sun
  • The average power consumption of a top 10 system is 2.45 MegaWatts (!)
  • The number 10 system is 275 TFLOPS, again at Juelich in Germany (first time Europe has two in the top 10)
  • Saudi Arabia now has a system at #14, the highest ever showing from a middle eastern system
  • The fastest system in Asia is now in China, with a 180 TFLOP system in Shanghai
  • Only one vector processor-based system in the Top500 system these days (the new NEC-based upgrade of the EarthSimulator)
  • Performance trends are remaining steady and on Moore's Law, even amid the current economic downturn
  • IBM and HP share most of the market between them
  • Industries share of the Top500 keeps growing, though few of these are large (most outside the top 50)
  • UK has been growing particularly strongly in terms of the number of systems in the top 500
  • For the first time China has more systems in the top 500 than Japan
  • Almost all systems in the top 500 are clusters or Massively Parallel Processing (MPP) systems such as BlueGene
  • 53% of all 500 systems use Intel Xeon 54xx Harpertown quad core processors
  • Over 75% are various Intel quad core processors, nearly 50% if measured by performance
  • 77% of systems already have four cores per socket, 21% are still dual core
  • Eight core processors from Intel expected to show up in the next Top500 list
  • Interconnects are mostly Gigabit Ethernet (GigE) (~280 systems) and InfiniBand (IB) (~140 systems)
    • Although very little GigE in the top50, here it's mostly IB
  • LINPACK efficiency ranges from 93% to below 30% (the special purpose GRAPE system)
    • Lots of systems around 54% using GigE
    • IB systems tend to achieve around 80% efficiency
  • Power consumption:
    • Jaguar uses 7MW!
    • Only a few systems above 1MW
    • Most systems still using several KiloWatts though
    • Most power efficient systems are based on IBM's Cell (RoadRunner)
    • Special purpose systems are next most power efficient (GRAPE-DR)
    • PowerPC 450's are next most power efficient
    • Quad core systems then come in at around a couple of hundred MFLOPS per Watt
  • Many of the most power efficient systems are using IB as an interconnect
  • The slowest systems in the Top500 are now over 17 TFLOPS!

  • For more trends see the press release and performance development graphs.

Keynote speech, "The path from Petaflops to Exaflops" (slides)

Andy Bechtolsheim, co-founder of Sun

  • HPC accounts for 30% of all server sales today and this fraction is growing
    • Most of the rest goes into the web industry
    • Classic "enterprise" computing has been shrinking
  • HPC is worth $20bn (compute, storage and services) by 2012
  • 10 GigE and IB now shipping in volume
  • 100 PetaFLOP system expected in 2016, 1 ExaFLOP by 2020 (one thousand trillion floating point ops/s)
  • Need performance to double every year to 2020 to hit 1 ExaFLOP
  • Moore's Law staying on track to deliver twice as many transistors per device every two years
  • 8nm process by 2020, having a 10 TFLOP socket (processor) in that year
    • Confirmed by Intel in a later talk
  • Expect ~160 cores per CPU at ~4GHz and 16 FLOPS per cycle per core by 2020
    • Also 2 TeraBytes/s bandwidth per socket by 2020
    • 500W per socket? Hmm...
  • That's aiming at 20 GFLOPS per watt by 2020
  • Would still need 50MW for the ExaFLOP system (!!!)
  • Expecting to use multi-chip 3D packaging
    • Already being used in, for example consumer electronics such as cell phone chips
  • Could also integrate fabric I/O, i.e. integrate router with the CPU
  • Expecting a combination of mesh and tree interconnect topology
  • Expecting 50 Gbps per lane in 2016, so 100 Gbps by 2020?
  • Believes MCM is the single biggest saving for power use
    • Could save 50% compared to server processors today
  • Microchannel fluidic heat sinks may be required (water cooling right on the chip)
  • Predicts all HPC systems will use water cooling in the future (it's the most power efficient way of cooling)
  • Expect to need 100 TB/s storage BandWidth (BW) by 2020
    • Will need solid state disks (SSDs)
      • Much lower power
      • Better for random access
      • More reliable - no moving parts
  • ASPs on flash memory reducing by nearly 50% per year per GigaByte.
    • Expecting flash to replace disk drives
    • Performance improving rapidly
    • Flash will be fast enough to do random IO
      • I.e. it will just become part of the memory hierarchy, just a larger, slower RAM
  • Expecting to need 16 million cores in the ExaFLOP system
    • In 100,000 sockets
  • Predicts smallest machine in Top500 in 2020 will be 10 PetaFLOP!
  • Clock rates are likely to stay relatively low at around 4GHz
    • This helps in terms of interconnect latencies
  • Mentioned GPU processors as one of the ways forwards - "The jury is out"
    • The economic advantage of a mainstream market is essential, hence x86 and GPU are two frontrunners
  • Said HPC is growing as a market so remains attractive to Sun -Exascale data important to Oracle too

Implementation of a lattice-Boltzmann method for numerical fluid mechanics using the Nvidia CUDA technology (slides)

E. Riegel, T.Indenger, N.A. Adams, TU Munchen, Institute of Aerodynamics

  • Computational fluid mechanics for incompressible flows (slower than Mach 0.3 at sea level, for example)
    • Spacial discretization by partitioning space into cells
  • Propagation and collision steps
  • Lattice-Boltzmann Method (LBM) is a very parallel algorithm so a good target for GPUs
    • Each cell can be computed independently
      • Can also use multiple GPUs for even greater performance
  • SunlightLB - an open source LBM code (D3Q15) for traditional CPUs written in C
    • They tried just porting this first
    • http://sunlightlb.sourceforge.net/
    • Expected a speed-up of 15X (64x64x64 voxels)
    • Actually only got a speed-up of 1.5X, ten times lower than expected
      • CPU 9.0 MVPS (million voxels per second), GPU 13.4 MVPS
    • GPU memory access patterns were the problem
  • So wrote a new LBM code from scratch targeting GPUs
    • LBultra
      • Written in C++
      • D3Q15 fixed refinement Cuda and CPU multi-core kernels supported
      • Optimised memory access patterns
      • Reduced data transfer to the GPU by fusing propagation and collision phases
      • Used GPU's "shared" memory for explicit data caching
      • Ported the Cuda code back to a multi-core host version - interesting!
      • Achieved 9.3X speed-up of 1 GPU vs. 1 CPU
        • CPU 8.4 MVPS, GPU 78 MVPS, using three GPUs achieved 191 MVPS
      • Still not quite optimal memory access patterns or distribution code
      • Validated the port with a simulation of a common test case which is a flow around a sphere
      • Has added ability to use an adaptive mesh, though this isn't quite finished yet (warning!!!):
        • Node resolution is location dependent
        • CPU ~ 10 MVPS, (40 GFLOPS) GPU ~400 MVPS (>900 GFLOPS, near peak performance)
        • HASN'T used SSE (SIMD) optimisations on the host so the host could go up to 4 or 8X faster too (i.e. host could get 160-320 MVPS vs. 400 MVPS for the GPU)

A novel multiple walk parallel algorithm for the Barnes-Hut treecode on GPUs - towards cost-effective, high-performance N-body simulation (slides)

T. Hamada (Nagasaki), K. Nitadori (RIKEN), Japan

  • A strong believer in using GPUs for HPC
  • Want to use N-body for general purpose computing, not just astrophysics
    • E.g. Fluid dynamics (Smooth Particle Hydrodynamics, vortex method etc.)
    • Acoustics, electromagnetics (Boundary Element Method)
  • Research was using Nvidia GT200-based GPUs
  • Want to simulate large-scale cosmological systems
    • Billions of particles
  • Have 256 GPUs connected by a cheap GigE network
  • Running since May 2008
  • 1.5 billion particles using Barnes-Hut on 256 GPUs computed in 17 seconds per time step
  • Of course N-body methods are classically considered to be one of the main class of algorithms
    • Such as the "View from Berkeley" including it as one of its seven algorithmic exemplars or "dwarfs"
  • Achieving about 450 GFLOPS on the latest GPUs
  • Processed one particle per thread on the GPU (up to 2,048 threads and thus particles per GPU)
    • This was a fairly simple, naive approach though
  • Really needed to take advantage of "cut-off" distances to reduce amount of computation required (down from O(n^2) to O(n))
    • "Multiple Walks" new approach much better (but didn't explain how this works very well)
  • In April this year achieved 50 TFLOPS (single precision) on a large system (500 million particles) using 256 GPUs
    • About 124 MFLOPS per $ including host computer - this is very good indeed!
    • 2 GPUs per host computer

Faster FAST: multicore acceleration of streaming financial data (slides)

Davide Pasetto et al, IBM

  • Message processing rates have exploded since 2006 when stock exchanges became more automated (exponential growth)
  • Around 1 million messages per second in 2008 requiring analytics to be calculated
  • Need the answer in milli- or even microseconds (otherwise might miss the deal)
    • Worth millions of dollars
    • But it's an arms race!
  • Financial institutions already adopting HPC technologies:
    • InfiniBand and 10 GigE
    • Low-latency OS-bypass communication protocols
    • Hardware accelerators (GPUs, FPGAs etc.)
    • Latest generation of multi-core processors
  • The financial institutions are struggling to test and validate massively parallel software
  • To keep latency low the financial datacentres tend to live in Wall St or Canary Wharf, with corresponding space and power supply and cooling limitations
  • The "ticker" plant:
    • Incoming live market, exchange and consolidated feeds
    • Decode and normalize this data
    • Analytics and data caching
    • The distribute the results of this pre-processing to their users - traders, customers etc.
    • Want the microsecond latency from data arrival to results getting back to the user
  • This paper focused on Options Pricing Reporting Authority (OPRA) feeds on the US stock exchange
  • Could they do what was required using just off the shelf CPUs?
  • Messages peaking over 1 million per second
    • Distributed in a compressed format
    • Fastest growing data feed, growing exponentially
    • Existing solutions used FPGAs or multi-cores
  • Uses multi-cast technology
  • Also uses bit level encoding - Most Significant Bit denotes if this byte is the end of a field
  • Fields are either unsigned integers or character strings
  • There is a reference decoder for OPRA
  • Built their own implementation of OPRA bottom up, optimizing the most important kernels
  • Did use assembly-level optimisation including SSE, intrinsics etc.
  • Got 3-4X speed-up vs. the reference decoder (I'm surprised it's not more actually)
  • Intel quad core achieved the highest performance
  • This was actually better than using FPGAs (though didn't show FPGA performance as a comparison - naughty)
  • Answered a question from me that FPGAs get around 2.5 million messages per second while a single core of an Intel Nehalem CPU should reach 4 million messages per second

High Performance Computing for the simulation of large scale aircraft structures

Martin Kussner, Abaqus/3DS MD in Germany

  • Multi-scale modelling is the big thing in aero today
  • Movement towards more non-linear Finite Element (FE) analysis, partly driven by composite materials
  • 1M Degrees Of Freedom (DOF) used to be a big problem but not any more - 50M is a big problem today
  • A PRACE paper reported later has achieved 500M DOF...
  • Driven by need for:
    • Shorter design cycles
    • Improved performance and economy
  • Said a large system today is:
    • 10-20M DOF
    • 3-7M elements
    • 5-10,000 discrete fasteners
    • 2,000 composite layers
  • Described an implicit, direct solver using distributed memory as the worst case for parallelisation
  • Aiming for clusters of >1000 cores for Abaqus software in the future
  • Most users have 64-256 cores per simulation today where the software currently performs quite well

Achievement and future needs in HPC

Detlef Mueller-Wiesner, COO of EADS France

  • PRACE is a European supercomputer initiative to understand large-scale PetaScale systems
  • Want to eliminate the need for physical testing and rely solely on computer simulation of new aircraft
  • Sell the first built new airplane, not just for testing!
  • Also want to be able to predict flight performance prior to the first flight
  • Authorities already accept simulations for an electromagnetic (EM) test of a change to an existing aircraft system
  • EADS believes it will need to increase its HPC performance by 100% per year
  • User interface is critical for simulations - how do the users interact with, understand and interpret the results?
  • Have CFD simulated an A380 in landing and take-off configurations, including ground effect and landing gear, all within 1% of the measured results, an amazing result!
  • Also big users of Fast Multipole Methods (FMM) for large electromagnetic simulations
    • O(nlogn) vs. O(n^3) for traditional, more LINPACK-like methods
  • FMM can be used for EM, acoustics, vibration analysis, heat transfer and elasticity
  • "Supercomputing is innovation in action!"

High scalability multipole methods: solving half a billion unknowns (slides)

J. Mourino (not Jose) et al, Supercomputing Centre of Galicia, Spain

  • Want to be able to use high frequencies in electromagnetic simulations of large objects
    • Real car at 79GHz -> 400 million unknowns
    • These are frequencies used by in-car collision avoidance systems
  • Traditional solution is Method of Moments (MoM) - LINPACK-like
  • New method, fast multipole (FMM) scales as O(n^3/2)
  • Multilevel FMM scales as O(nlogn) but poor scalability across many processors
  • Their new method is FMM-FFT
  • Full domain is divided into groups in a 3D circular convolution style
  • Uses the FFT to speed-up the translation stage
  • Modern supercomputers are getting very good at doing large FFTs and scaling well
  • With this method a single global communication step is required at the end of the Matrix Vector Product
  • But this method uses lots of memory - further refinements have addressed this
  • HEMCUVE is the name of their code, written in C++
  • Needs 6 GBytes per core
  • Scaled really well to 1024 processors
  • Have a 2,580 cores, 20 TByte, 16 TFLOP system to use for this
    • Called FinisTerrae
    • Uses Intel Itanium 2 Montvale with 64 GB per node (8 GB per core) - HP system rx7640 nodes
    • Hit an MPI problem of a limit of 2GBytes per message
    • The only talk I saw at the whole conference that mentioned Intel's Itanium architecture
  • Still takes 30 hours to run an entire problem
  • Have simulated a Citroen C3 car at 24 GHz (radar frequency) which needed 40M unknowns
  • Also done with 79GHz for anti-collision frequency needing 400M unknowns, 10X
  • They're working on solving 1B unknowns using 2000 cores
  • They haven't measured peak performance so can't say much about how well they're really doing

Parallel scalable PDE-constrained optimisation: antenna identification in hyperthermia treatment planning (slides)

O. Schenk et al, University of Basel (collaborators at Purdue)

  • Many different problems are instances of trying to solve large-scale non-linear optimization
  • Will be using BlueGene/L #3 system and also their own 64-node Intel Xeon cluster
  • Aiming to solve systems of 1M to 1B variables/unknowns
  • The hypothermia treatment they're looking at uses heat applied to a tumour at around 41-45 degrees C
  • Typically formulated as a PDE-constrained optimisation problem
  • Inequality constraints:
    • State variables: temperature distribution
    • Control variable - EM antenna placement
  • Electrical field used to induce the heat, blood flow diffuses this away
  • Used NLP optimiser called IPOPT (open source C++, >5000 users) with linear equation solver PSPIKE
  • They've been able to scale up to 512 cores for their biomedical PDE-constrained optimisation
  • There was a lot of heated debate about whether they'd done this is a practical way

© 2009 University of Bristol  |  Terms and Conditions