MRSC technical day 1

 

KEYNOTE ADDRESS: Green Flash – Exascale performance for Petascale power budgets. David Donofrio, Berkeley Labs - LBNL

 

Existing approaches just aren’t feasible for Exascale computing – require too much power.

Aiming for 100X energy efficiency over existing HPC approaches.

Green Flash’s target application is high-resolution cloud modelling. Aiming at 1km, typical approaches are at 200km with goals for 25km.

Has 20M-way parallelism that can be expressed in this model. So they’re going to build a 20M core system. (!)

Has an Icosohedral domain decomposition approach – works well at the poles.

Nearest neighbour dominated communication (like a stencil approach).

Needs 20 PetaFLOPS sustained, 200 PFLOPS peak!

“Small is beautiful” – large arrays of small, simple cores.

They can get 400X more performance per Watt from an embedded core than an IBM Power5 (probably a bit optimistic but the principle is true).

The Tensilica IP core design tools give estimates for speed, core size and core power in real-time as the design is being changed (impressive).

Aiming for a PGAS architecture (Partitioned Global Address Space).

Will use some flash per socket to enable fast checkpoint/restart and therefore improve fault resilience.

Between 64 and 128 cores per chip, also considering on-chip silicon photonics for their Network on Chip (NoC).

Looking beyond DRAM (too power hungry) at Resistive Change RAM (ReRAM). This is non-volatile (no refresh needed) – promising for a 10X improvement in energy efficiency.

They are collaborating with Keren Bergmen’s group at Columbia for on- and off-chip photonic network technology.

Described their climate code as a long sequence of nested loops. Did a lot of work measuring all the different cases.

Will design in extra cores for fault tolerance – full ECC too.

Relying on auto-tuning of the software. These have domain specific knowledge and should thus be able to apply more wide-ranging, aggressive optimisations.

Using RAMP to emulate some of their design on FPGA-based emulators.

Have been able to demo the FPGA-based simulation (at SC09). Cores run at about 25MHz – much faster than SW-based simulation (100 KHz).

The total system will use 5MW, $100M, 500 m^2, to get to 200 PFLOPS peak.

Very interesting approach to solving large problems – and an example of application-specific optimisation of a complete system, hardware and software.

 

Offloading Parallel Code on Heterogeneous Multicores: A Case Study using Intel Threading Building Blocks on Cell. George Russell, Codeplay Software Ltd., Edinburgh, UK

 

Codeplay are a compiler company based in Edinburgh.

Used Intel Thread Building Blocks (TBB). This is aimed at shared-memory multiprocessors.

Their work is “Offload C++” – a conservative C++ extension. Targets heterogeneous cores e.g. IBM’s Cell.

The programming model migrates host threads onto an accelerator.

Used a seismic benchmark from Intel’s TBB examples. They could parallelise this using a “parallel for loop”.

Targeting IBM’s Cell, they didn’t port the whole TBB but just specific pieces – “__offload{}” blocks.

Sounds like this would make porting code to Cell much easier.

They’ve only done one simple optimisation so far.

The work was inspired by the gaming industry where the Cell in the PS3 is “the odd one out” for the developers mostly targeting shared memory multiprocessors – Xbox 360, PCs etc.

http://offload.codeplay.com

 

A domain decomposition method for Parallel Molecular Dynamics. Mauro Bisson, Universita' ”La Sapienza”, Rome, Italy

 

Trying to simulate blood flow through the cardiovascular system.

Their problem is how to parallelise MD inside regular domains.

They have interdomain pairs and particle migration between domains.

Because the domains have irregular shapes it’s difficult to implement domain tests that are efficient. They use a regular grid decomposition to get around this, applying approximations to the real shape of the domains.

Have a collaboration with EPFL, CNR and Harvard for “hemo-dynamics”. Have developed a simulator called MURPHY. This can run on an IBM BlueGene/L at up to 7 TFLOPS and also on an Nvidia-based GPU system in Fortran 90 and Cuda.

They used Guy-Bearns potentials for their MD simulations – not many people doing this yet.

 

HLL mapping to FPGA using a dependency analysis based graphical methodology. Sunita Chandrasekaran, Nanyang Tech. University, Singapore

 

FPGAs are massively parallel devices and using them efficiently remains a hard research problem.

Looking at wavefront-based algorithms. Started with Smith-Waterman in the ClusterW applications – multiple sequence alignment. (This team mapped the same problem to the forerunner to ClearSpeed’s architecture, Pixelfusion’s F150, in about 2001).

FPGA tools-wise they tried Trimaran, Rose and Impulse C. Also looked at other optimising compilers such as Open64 and OpenUH but they were too difficult to get to grips with.

Managed to get 92 PEs on a Virtex 2 from Xilinx running at 34MHz and got good speed-ups, even better on a Virtex 5 (50X faster than a single 2GHz core).

 

Optimizing data locality for the efficient solution of multiphysics problems on systems with thousands of multicore processors. Lee Margetts, University of Manchester

 

In collaboration with Barcelona Supercomputing Centre.

The target application is magnetohydrodynamic flow.

Interested in how a magnetic field affects the flow of a fluid and vice versa.

Ported this to run on HECToR, the UK’s national HPC facility (>5000 quad core AMD processors).

Ran a system with 12M equations, got to about 5-10% of peak performance on HECToR – target was 10-20%.

Looks like a variant of a sparse matrix problem with domain specific optimisations. This improvement helped but still below target by a long way.

The solver was BiCGSTAB(l).

Should work well on GPUs too.

 

Accelerating Publish/Subscribe Matching on Reconfigurable Supercomputing Platform. Kuen Hung Tsoi, Imperial College London, United Kingdom

 

Used for web services, financial data processing and credit card fraud detection.

Systems typically have millions of subscriptions to search through.

 

Implementation of the K-means clustering algorithm in Mitrion-C on an RC-100. Charles J. Gillan, Queen’s University Belfast, Northern Ireland

 

Single precision isn’t enough for this algorithm – they tried this on older GPUs.

The source of their problem data is a confocal microscope.

They were visualising a volume of about 100x100x100 pixels, so just under 1M pixels to be partitioned per time frame. Fortunately the K-means algorithm is massively parallel.

Pixels were 16-bit inputs (A/D converter limitations).

Used Mitrion-C for this implementation. Had to focus on the efficient use of multiple SRAM banks, still quite a low-level consideration.

The RC-100 FPGA blade in the SGI machine includes dual Virtex 4 FPGAs. These are quite old technology now. Also their implementation only used 13% of the flip-flops (and no multipliers – all summation). Mitrion-C also has a restriction to limit clock speed to 100MHz maximum.

The next part of the algorithm needs floating point and does lots of logs and exponentials. They had to implement these on the FPGA themselves and it took reasonable resources.

The belief is that a modern GPU and FPGA would make a similar comparison.

 

Off-loading the Reed-Solomon algorithm to hardware accelerators. Thomas Steinke, Zuse Institute Berlin, Germany

 

Considering very large-scale storage systems – might have 100K hard drives. RAID-5 not good enough for this (data loss every 9 days), RAID-D2 or –D3 required (MTTDL 100 years or 130M years respectively).

IBM’s been looking at using Reed-Solomon as part of the solution to this problem.

Reed-Solomon is non-binary, cyclic block coding circa 1960. Already used on DVDs, CDs etc.

Encoding is a matrix-vector multiplication. Cauchy Reed-Solomon is a more recent algorithmic improvement.

They’ve analysed Nvidia GPUs, FPGAs, PowerXCell 8i and ClearSpeed.

Implemented 5+3 Reed-Solomon schema, Cauchy R-S. Ran problems from 150MB up to 2GB.

An Intel Nehalem could get to around 8.7GBytes/s on 8 cores (4 cores were nearly as fast). Even 1 core could get to 4.6 GBytes/s.

On Cell/BE they got to 14.5 GBytes/s, and had 5.5 GBytes/s on a single core.

 

An unstructured 3D CFD code optimised for multicore and graphics processing units. James Sharp, BAE Systems, Bristol, United Kingdom

 

BAE has products in air, sea, land and water. (Submarines, ships, planes etc).

Here they are focused on high-fidelity CFD simulations, some of which can run on large computers for four or five months!

It’s critical to them to bring down the cost of these simulations, because at the moment it’s still cheaper, in some cases, to do real wind-tunnel testing than simulation, a situation which is clearly ludicrous.

Their current CFD simulations are 3D explicit finite volume, 2nd order time and space with a two-equation turbulence model.

They use a completely unstructured grid to enable the finer modelling of areas of greater interest in the models.

They’ve been using the Boost C++ library for the host versions of their code.

Getting good results from using GPU acceleration – their Nvidia Fermi cluster should arrive within a few weeks. See the presentation for performance graphs.

 

Fast Exhaustive Dictionary Search using CUDA - A Requirements-based Approach. Johannes Niedermaier, LNCC, Petropolis, RJ, Brazil

 

Can compress images with better image quality than JPEG2000, H.264 etc., while achieving greater levels of compression.

Highly parallelisable and a good fit for accelerators – maps nicely to most GPUs. Useful even on laptops, phones etc.

 

Accelerating SOLiD short read assembly with GPU. Peter Szanto, Dept. of Measurement and Information Systems, Budapest University of Technology and Economics, Hungary

 

A gene sequencing methodology. They have to find matches in the reference human genome while assembling new gene sequences from many short sequence reads, typically 25-100 bases (letters) long. Some sequences can be up to 1000 bases long, but these are still very expensive.

For 25-base long reads between 1-4 mismatching characters are allowed.

On the GPU each read is performed on a single thread.

 

Design-Space Exploration of Biologically-Inspired Visual Object Recognition Algorithms Using CPUs, GPUs, and FPGAs. Vinay Sriram, The Rowland Institute, Harvard University, Cambridge, MA, USA

 

Working primarily on face recognition with companies like Facebook.

Modelling how the eyes and brain really recognise objects and more particularly faces.

Now trying to do this in real-time. Much higher fidelity than the quick/cheap/dirty face recognition that already exists.

90% of the compute time is in 2D convolution. Kernels range from 3x3 up to 64x64.

Currently mapping this into some cheap FPGAs.

Also considering OpenCL and mobile phones that could run this – iPhone et al.

Also looking at building custom chips to do this – but I’d be surprised if it was worth it. Sounds like it should just be part of something else. Good idea to do it in OpenCL though.

 

Modeling DNA Radiation Damage on many-core architectures: current GP-GPU and a novel FPGA implementations. Paolo Palazzari, Ylichron Srl, Rome, Italy

 

Uses the HARWEST compiling environment, a “C-to-VHDL toolset which, starting from user specifications written in C language, is able to produce a parallel architecture which performs the original algorithm.”

Very automatic.

This is a computational chemistry application, focusing on quantum mechanics. In particular they model a system with 5 orbitals and 585 basis functions.

 

MRSC technical day 2

 

KEYNOTE ADDRESS: Education: A New Direction for Standards in Reconfigurable Computing. Eric Stahlberg, OpenFPGA and Wittenberg University, USA

 

Reconfigurable computing goal: keep costs down, energy efficiency, fast, scope for optimization.

Much attention has been diverted from FPGAs to GPUs – cheaper, easier to get going. More opportunities in IP.

FPGA users tend to be hardware designers.

Vendors are focusing on verticals: DRC in security, XtremeData in data.

FPGAs may be at the bottom of the trough of disillusionment right now and so may heading toward the slope of enlightenment.

HPC will resume growth in mid 2010 – IDC.

“The door remains open for reconfigurable supercomputing.”

Return On Investment (ROI) is key – a greater ROI wins the day.

FPGAs need to translate into mass market products on smaller margins – currently they’re small volume, high margin products, but this status quo is under threat.

There’s a Ř45M call open summer 2010 for reconfigurable computing research.

Top barriers for RC supercomputing: programming, standards, education, costs.

OpenFPGA has over 500 participants.

www.openfgpa.org

 

Efficient OS Services for Heterogeneous and Reconfigurable Manycores. David Andrews, University of Arkansas

 

Even atomic operations are implemented in an incompatible way between different kinds of processors – an issue in heterogeneous systems. How can these kinds of issues be resolved?

The remote procedure call (RPC) model is still most popular – slaves have to request services via the host. (E.g. Cell, Intel’s EXOCHI).

Monolithic OSs use global data structures – microkernels are designed to get around this probem.

This work has taken some of the microkernel and put it in hardware – thread management, synchronisation, interrupts etc.

With this approach can get a lock in under 5 cycles (in an Virtex 5 FPGA) – accessing memory takes 20 clock cycles, so it’s faster than a load/store. Any core can invoke OS APIs.

Looks like an interesting approach – expect to see this sort of technique adopted in MIMD many-cores from Intel et al.

Should also be a low OS jitter approach too.

 

Direct Connection Of High Capacity Mass Storage To Hardware Accelerators. Gyorgy Dancsi, Budapest University of Technology and Economics, Hungary

 

Want to be able to connect things like disks directly to accelerators to avoid the bandwidth bottlenecks (and wasted energy).

Focused on bioinformatics problems – RC-BLAST and Mercury BLAST, both using FPGAs.

Made their own FPGA boards with SATA interfaces. These include DMA engines so they can transfer data under their own control.

 

ItalyDoing science on a GPU-accelerated, many-core architecture: the EPFL system. Fabrizio Magugliani, E4 Computer Engineering, Italy       

 

E4 are an integrator, i.e. they build systems from disparate components and deliver complete working systems to their end customers.

 

Performance comparison between different compilers and mpi libraries on linux cluster systems. Thomas Blume, MEGWARE Computer GmbH, Germany

 

They’re delivering quite a few accelerated systems these days, both Nvidia and ATI GPUs.

           

Unified Cluster Portfolio. Alberto Galli, HP, Milano, Italy

 

Nothing interesting.

 

A GPU Based Architecture for Distributing Dictionary Attacks to OpenPGP Secret Keyrings. Fabrizio Milo, Dipartimento di Informatica, Universita' 'La Sapienza', Rome, Italy

 

Most of the run-time in cracking keyrings is in the multiprecision integer libraries.

Might have 1M pass phrases – they have a technique to filter these down to a few hundred.

Can now crack 1M passwords in under 4s, 144X faster than one core of a fast CPU.

Used Python to auto-generate some of their GPU code.

Got a big speedup on Fermi because of its greater number of registers.

 

GPU acceleration of the Long-Wave Rapid Radiative Transfer Model in WRF using CUDA FORTRAN. Massimiliano Fatica, NVIDIA Corporation

 

One of the first Cuda Fortran applications.

WRF is a weather prediction code for mesoscale numerical weather prediction. Can calculate factors such as the energy from the Sun passing through the atmosphere and heating the Earth.

It’s a very compute intensive model. It’s a large Fortran 77 code. They’ve started porting a few thousand lines of it to Cuda Fortran.

Cuda Fortran is a collaborative effort with PGI (Portland Group).

Cuda Fortran  is strongly typed – should make it easier to use than Cuda C.

Still early so some features from Cuda C still missing.

They partition the atmosphere into many columns and process these in parallel. Hence each thread processes an entire column, running the whole software stack.

Will make the CPU and GPU code available on the web www.mmm.ucar.edu/wrf/

Got about a 10X speedup compared to a fast quad-core Intel using icc.

The benchmark is in single precision at the moment.

Amenable to a heterogeneous approach – partition the work across the host and GPU and execute on both at the same time.

I have to say this looks very promising! In some ways it’ll be even easier to programme Cuda from Fortran than from C.

 

Wavefront Reconstruction for Extremely Large Telescopes using Manycore Graphical Processors. Sofia Dimoudi, Durham University, United Kingdom

 

Adaptive optics can correct for atmospheric blurring on large astronomical telescopes.

Need about 5TFLOPS of matrix-vector multiplication to achieve real-time operation at 1KHz.

60-80,000 unknowns in the systems they need to solve for this.

Use iterative methods to avoid a computationally costly inversion.

Looking at using both FPGAs and GPUs.

They need a sparse matrix-vector call.

 

FPGA Based Reconfigurable Hardware Accelerator. Tamas Raikovich, Budapest University of Technology and Economics, Hungary        

 

They’ve been exploiting dynamic partial reconfiguration – changing some but not all of the logic in the FPGA at run-time. That was pretty much the main thing they were doing that was new.

 

FPGA implementation of cheminformatics and computational chemistry algorithms and its cost/performance comparison with GPGPU, cloud computing and SIMD implementations. Attila Berces, Chemistry Logic, Hungary

 

http://fpga.omixon.com

Have ported lots of bioinformatics apps and kernels to FPGA already. Can drive them through the web portal listed above – a form of cloud computing.

There’s about to be an explosion in the amount of genomic data generated as the cost to sequence a human genome falls below $1000 soon.

Comment from the CEO of Complete Genomics that their major cost will eventually be the electricity required to run their datacentres.

The Broad Institute already has a 4 PetaByte storage system for their genomic database.

Amazon gets its electricity extremely cheaply by building their datacentres next to cheap power sources (hydro etc). Currently at 5 cents per kWh – typical street passes are 4X this.

17 of the top 20 pharmaceutical companies already use Amazon webservices for research, even for docking.

“Data is more secure with Amazon than internally.”

 

BLAST acceleration via FPGA prefiltering. Peter Laczko, BUTE DMIS, Hungary

 

BLAST aligns genetic sequences – essentially a string matching problem. There are heuristics developed which are O(n) time, much less than this in space for working data set.

Prefiltering aims to reduce the size of the dataset drastically.

Implemented this on an SGI with built-in FPGAs.

Automatically pre-filter ahead of an un-modified BLAST application running on the rest of the system.

Managed to reduce the CCR5 reference whole human genome and got a 310-fold reduction with no false negatives. On the NCBI EST human database got a 100-fold reduction with 2.2% false negatives. Could prefilter at 42M bases/s with 64 queries in parallel, equivalent of 2.7 GBases/s.