SlideShare a Scribd company logo
Performance Optimization of Smoothed Particle
Hydrodynamics for Multi/Many-Core Architectures
Dr. Fabio Baruffa fabio.baruffa@lrz.de
Leibniz Supercomputing Centre
MC² Series: Colfax Research Webinar, http://mc2series.com
March 7th
, 2017
2
Work contributors
● Member of the IPCC @ LRZ
● Expert in performance
optimization and HPC systems
● Member of the IPCC @ LRZ
● Expert in computational
astrophysics and simulations
Dr. Fabio Baruffa
Sr. HPC Application Specialist
Leibniz Supercomputing Centre
Dr. Luigi Iapichino
Scientific Computing Expert
Leibniz Supercomputing Centre
Email contacts: fabio.baruffa@lrz.de luigi.iapichino@lrz.de
3
Intel®
Parallel Computing Centers (IPCC)
● The IPCCs are an Intel initiative for code
modernization of technical computing codes.
● The work primary focus on code optimization
increasing parallelism and scalability on
multi/many core architectures.
● Currently ~70 IPCCs are funded worldwide.
● Our target is to prepare the simulation software
for new platforms achieving high
nodel-level performance and multi-node
scalability.
4
Outline of the talk
Preprint of this work: https://arxiv.org/abs/1612.06090
● Overview of the code: P-Gadget3 and SPH.
● Challenges in code modernization approach.
● Multi-threading parallelism and scalability.
● Enabling vectorization through:
Data layout optimization (AoS → SoA).
Reducing conditional branching.
● Performance results and outlook.
5
Gadget intro
● Leading application for simulating the
formation of the cosmological large-scale structure
(galaxies and clusters) and of processes at
sub-resolution scale (e.g. star formation, metal
enrichment).
● Publicly available, cosmological
TreePM N-body + SPH code.
● Good scaling performance up to
O(100k) Xeon cores
(SuperMUC @ LRZ).
Introduction
6
Smoothed particle hydrodynamics (SPH)
Introduction
● SPH is a Lagrangian particle method for solving the equations of fluid
dynamics, widely used in astrophysics.
● It is a mesh-free method, based on a particle discretization of the
medium.
● The local estimation of gas density (and all other derivation of the
governing equations) is based on a kernel-weighted summation over
neighbor particles:
7
Gadget features
Introduction
The code can be run at different levels of
complexity:
● N-Body-only (a.k.a. dark matter) simulations.
● N-Body + gas component (SPH).
● Additional physics (sub-resolution) modules:
radiative cooling, star formation,…
● More physics → more memory required
per particles (up to ~ 300B / particle).
8
Features of the code
Gadget features
● Gadget has been first developed in the late 90s as serial code, has later
evolved as an MPI and a hybrid code.
● After the last public release Gadget-2, many research groups all over the
world have developed their own branches.
● The branch used for this project (P-Gadget3) has been used for more than
30 research papers over the last two years.
● The code have ~200 files, ~400k code lines, extensive use of #IFDEF, ext.
libs (fftw,hdf5).
9
Basic principles of our development
Basic principle of our development
● Our intention is to ensure:
● Portability on all modern architectures (Intel®
Xeon/MIC, Power, GPU,…);
● Readability for non-experts in HPC;
● Consistency with all the existing functionalities.
● We perform code modifications which are minimally invasive.
● The domain scientists have to be able to modify the code without coping
with performance questions.
10
Code modernization approach
Code modernization
● Scalar optimization: compiler flags, data casting, precision consistency.
● Vectorization: prepare the code for SIMD, avoid vector dependencies.
● Memory access: improve data layout, cache access.
● Multi-threading: enable OpenMP, manage scheduling and pinning.
● Communication: enable MPI, offloading computation.
https://software.intel.com/en-us/articles/what-
is-code-modernization; colfaxresearch.com
11
Code modernization approach
Code modernization
● Scalar optimization: compiler flags, data casting, precision consistency.
● Vectorization: prepare the code for SIMD, avoid vector dependencies.
● Memory access: improve data layout, cache access.
● Multi-threading: enable OpenMP, manage scheduling and pinning.
● Communication: enable MPI, offloading computation.
https://software.intel.com/en-us/articles/what-
is-code-modernization; colfaxresearch.com
Preparation for the next generation processors and efficient usage of the current
hardware
12
Target architectures for our project
Intel®
architectures
● E5-2650v2 Ivy-Bridge (IVB) @ 2.6 GHz,
8-cores / socket.
TDP: 95W, RCP: $1116.
● AVX.
Intel®
Xeon processor Intel®
Xeon Phi™ coprocessor
1st
generation
● Knights Corner (KNC) coprocessor 5110P
@ 1.1GHz, 60 cores.
TDP: 225W, RCP: N/D.
● Native / offload computing.
● Directly login via ssh.
● SIMD 512 bits.
13
Further tested architectures
Intel®
architectures
● E5-2697v3 Haswell (HSW) @ 2.3 GHz,
14-cores / socket.
TDP: 145W, RCP: $2702.
● AVX2, FMA.
● E5-2699v4 Broadwell (BDW) @ 2.2 GHz,
22-cores / socket.
TDP: 145W, RCP: $4115.
● AVX2, FMA.
Intel®
Xeon processors
● Knights Landing (KNL) Processor 7250
@ 1.4 GHz, 68 cores.
TDP: 215W, RCP: $4876.
● Available as bootable processor.
● Binary-compatible with x86.
● High bandwidth memory.
● New AVX512 instructions set.
Intel®
Xeon Phi™ processor
2nd
generation
14
Optimization strategy
Optimization strategy
●
We isolate the representative code kernel subfind_density and run it in as
a stand-alone application, avoiding the overhead from the whole simulation.
●
As most code components, it consists of two sub-phases of nearly equal
execution time (40 to 45% for each of them), namely the neighbour-finding
phase and the remaining physics computations.
●
Our physics workload: ~ 500k particles. This is a typical workload per node of
simulations with moderate resolution.
●
We focus mainly on node-level performance.
●
We use tools from the Intel®
Parallel Studio XE (VTune Amplifier and Advisor).
Simulation details:
www.magneticum.org
15
Isolation of a kernel code
Data serialization
● Serialization: the process of converting data structures or objects into
a format that can be stored and easily retrieved.
● This allows to isolate the computational kernel using realistic input
workload (~ 551MB).
● Dumping data for compression.
Object Byte stream Byte streamByte stream ObjectDB
file
mem
16
Initial profiling
Multi-threading parallelism
thread spinning
● Severe shared-memory
parallelization overhead
● At later iterations, the
particle list is locked and
unlocked constantly due
to the recomputation
● Spinning time 41%
17
Algorithm pseudocode
Subfind algorithm
more_particles = partlist.length;
while(more_particles){
  int i=0;                 
  while(!error && i<partlist.length){
  #pragma omp parallel
  {
    #pragma omp critical
    {
   p = partlist[i++];  
    }
    if(!must_compute(p)) continue;
    ngblist = find_neighbours(p);
    sort(ngblist);
    for(auto n:select(ngblist,K)) 
       compute_interaction(p,n);
  }
  more_particles = mark_for_recomputation(partlist);
}
while loop over the full particle list
each thread gets the next particle
(private p) to process
check for computation
actual computation
18
Removing lock contention
Subfind algorithm
todo_partlist = partlist;
while(partlist.length){
  error=0;
  #pragma omp parallel for schedule(dynamic)
  for(auto p:todo_partlist){
    if(something_is_wrog) error=1;
    ngblist = find_neighbours(p);
    sort(ngblist);
    for(auto n:select(ngblist,K)) 
       compute_interaction(p,n);
  }
//...check for any error
  todo_particles = mark_for_recomputation(partlist);
}
creating a todo particle list
iterations over the todo list
(private ngblist)
actual computation
No-checks for computation
19
Improved performance
Multi-threading parallelism
no spinning
● Lockless scheme
● Time spent in spinning
only 3%
20
Improved speed-up
Multi-threading parallelism
● On IVB
● speed-up: 1.8x
● parallel efficiency: 92%
● On KNC
● speed-up: 5.2x
● parallel efficiency: 57%
21
Obstacles to efficient auto-vectorization
for(n = 0, n < neighboring_particles, n++ ){
    j = ngblist[n];   
           
    if (particle n within smoothing_length){   
                        
       inlined_function1(…, &w);
       inlined_function2(…, &w);
       rho   += P_AoS[j].mass*w;
       vel_x += P_AoS[j].vel_x;
       …
       v2 += vel_x*vel_x + … vel_z*vel_z;      
   }
Target loop
for loop over neighbors
check for computation
computing physics
Particles properties via
AoS
22
struct ParticleAoS
{
  float pos[3];
  float vel[3];
  float mass;
}
struct ParticleSoA
{
  float *pos_x, *pos_y, *pos_z;
  float *vel_x, *vel_y, *vel_z;
  float mass;
}
Data layout
pos[0]
...
pos[1]
pos[2]
vel[0]
...
pos[0]
pos[1]
pos[2]
...
mass
xi+1
xi+2
xi+3
xi+4
xi+5
xi+6
xi
xi+7
...
pos_x
...
pos_x
pos_x
pos_x
pos_x
pos_x
pos_x
pos_x
pos_x
...
xi+1
xi+2
xi+3
xi+4
xi+5
xi+6
xi
xi+7
...particles[i]particles[i+1]
p.pos_x[i]
p.pos_x[i+1]
p.pos_x[i+2]
p.pos_x[i+3]
p.pos_x[i+4]
p.pos_x[i+5]
p.pos_x[i+6]
p.pos_x[i+7]
p.pos_x[i+8]
Memory Memory
Vector
Register
Vector
Register
AoS SoA
Data layout: AoS vs SoA
Automatically vectorized loops can
contain loads from not contiguous
memory locations → non-unit stride
● The compiler has issued hardware
gather/scatter instructions.
 
23
Proposed solution: SoA
●
New particle data structure: defined as Structure of Arrays (SoA).
●
From the original set, only variables used in the kernel are included in the
SoA: ~ 60 bytes per particle.
●
Software gather / scatter routines.
●
Minimally invasive code changes:
●
SoA in the kernel.
●
AoS exposed to other parts of the code.
Data layout
24
Implementation details
Data layout
struct ParticleSoA
{
  float *pos_x, … , *vel_x, …, mass;
}
Particle_SoA P_SoA;
P_SoA.pos_x = malloc(N*sizeof(float));
…
       
…
rho   += P_AoS[j].mass*w;
vel_x += P_AoS[j].vel_x;
…
       
…
rho   += P_SoA.mass[j]*w;
vel_x += P_SoA.vel_x[j];
…
       
struct ParticleAoS
{
  float pos[3], vel[3], mass;
}
Particle_AoS *P_AoS;
P_AoS = malloc(N*sizeof(Particle_AoS);
    
void gather_Pdata(struct Particle_SoA *dst, struct Particle_AoS *src, int N )
for(int i = 0, i < N, i++ ){
    dst ­> pos_x[i] = src[i].pos[1]; dst ­> pos_y[i] = src[i].pos[2]; … 
}   
25
AoS to SoA: performance outcomes
●
Gather+scatter overhead at
most 1.8% of execution time.
→ intensive data-reuse
●
Performance improvement:
●
on IVB: 13%, on KNC: 48%
●
Xeon/Xeon Phi performance
ratio: from 0.15 to 0.45.
●
The data structure is now
vectorization-ready.
Data layout
1/exec.time
higher is better
26
Optimizing for vectorization
●
Modern multi/many-core architectures rely on vectorization as an additional
layer of parallelism to deliver performance.
●
Mind the constraint: keep Gadget readable and portable for the wide user
community! Wherever possible, avoid programming in intrinsics.
●
Analysis with Intel®
Advisor 2016:
• Most of the vectorization potential (10 to 20% of the workload) in the
kernel “compute” loop.
• Prototype loop in Gadget: iteration over the neighbors of a given particle.
●
Similarity with many other N-body codes.
Vectorization
27
Vectorization: improvements from IVB to KNL
●
Vectorization through localized
masking (if-statement moved
inside the inlined functions).
●
Vector efficiency:
perf. gain / vector length
on IVB: 55%
on KNC: 42%
on KNL: 83%
Vectorization
- Yellow + red bar: kernel workload
- Red bar: target loop for vectorization
28
Node-level performance comparison between HSW,
KNC and KNL
Features of the KNL tests:
●
KMP Affinity: scatter;
Memory mode: Flat;
MCDRAM via numactl;
Cluster mode: Quadrant.
Results:
●
Our optimization improves the
speed-up on all systems.
●
Better threading scalability up
to 136 threads on KNL.
●
Hyperthreading performance is
different between KNC and KNL.
Performance results on Knights Landing
29
Performance comparison: first results including KNL
and Broadwell
●
Initial vs. optimized including all
optimizations for subfind_density
●
IVB, HSW, BDW: 1 socket w/o
hyperthreading.
KNC: 1 MIC, 240 threads.
KNL: 1 node, 136 threads.
●
Performance gain:
●
Xeon Phi: 13.7x KNC, 20.1x KNL.
●
Xeon: 2.6x IVB, 4.8x HSW,
4.7x BDW.
Performance results
lower is better
30
Summary and outlook
●
Code modernization as the iterative process for improving the performance of an
HPC application.
●
Our IPCC example: P-Gadget3.
Threading parallelism
Data layout Key points of our work, guided by analysis tools.
Vectorization
●
This effort is (mostly) portable! Good performance found on new architectures (KNL
and BDW) basically out-of-the-box.
●
For KNL, architecture-specific features (MCDRAM, large vector registers and NUMA
characteristics) are currently under investigation for different workloads.
●
Investment on the future of well-established community applications, and crucial for
the effective use of forthcoming HPC facilities.
https://arxiv.org/abs/1612.06090
31
Acknowledgements
●
Research supported by the Intel®
Parallel Computing Center program.
●
Project coauthors: Nicolay J. Hammer (LRZ), Vasileios Karakasis (CSCS).
●
P-Gadget3 developers: Klaus Dolag, Margarita Petkova, Antonio Ragagnin.
●
Research collaborator at Technical University of Munich (TUM): Nikola Tchipev.
●
TCEs at Intel: Georg Zitzlsberger, Heinrich Bockhorst.
●
Thanks to the IXPUG community for useful discussion.
●
Special thanks to Colfax Research for proposing this contribution to the MC² Series,
and for granting access to their computing facilities.

More Related Content

PDF
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
PDF
Parallella: Embedded HPC For Everybody
PDF
DPDK In Depth
PDF
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
PDF
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PDF
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
bfgasnet_pr-v2
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Parallella: Embedded HPC For Everybody
DPDK In Depth
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
Network Programming: Data Plane Development Kit (DPDK)
bfgasnet_pr-v2

What's hot (20)

PDF
QuadIron An open source library for number theoretic transform-based erasure ...
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PDF
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
PDF
Developing an embedded video application on dual Linux + FPGA architecture
PPTX
RISC-V assembly
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
AI is Impacting HPC Everywhere
PDF
Improve Vectorization Efficiency
PPT
Cuda 2011
PPTX
An open flow for dn ns on ultra low-power RISC-V cores
PDF
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
PDF
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
PPTX
Implementation of Soft-core processor on FPGA (Final Presentation)
PDF
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
PDF
Fpga implementation of encryption and decryption algorithm based on aes
PPT
Current Trends in HPC
PPT
Parallelization of Coupled Cluster Code with OpenMP
PPTX
Reverse Engineering of Rocket Chip
PDF
Memory, IPC and L4Re
PPTX
LEGaTO Integration
QuadIron An open source library for number theoretic transform-based erasure ...
On the Capability and Achievable Performance of FPGAs for HPC Applications
A Framework for Efficient Rapid Prototyping by Virtually Enlarging FPGA Resou...
Developing an embedded video application on dual Linux + FPGA architecture
RISC-V assembly
Utilizing AMD GPUs: Tuning, programming models, and roadmap
AI is Impacting HPC Everywhere
Improve Vectorization Efficiency
Cuda 2011
An open flow for dn ns on ultra low-power RISC-V cores
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Implementation of Soft-core processor on FPGA (Final Presentation)
SFO15-BFO2: Reducing the arm linux kernel size without losing your mind
Fpga implementation of encryption and decryption algorithm based on aes
Current Trends in HPC
Parallelization of Coupled Cluster Code with OpenMP
Reverse Engineering of Rocket Chip
Memory, IPC and L4Re
LEGaTO Integration
Ad

Viewers also liked (20)

PPTX
SoC HPC: Design, Optimization, and Application to Algorithmic Trading
PDF
Sonny-Krikorian
PPTX
Cyberbullying
PPT
research and planning presentation
DOC
ВIкторина "Фізика та кіно"
DOCX
Проект "Розвиток космонавтики"
PDF
Day 3 C2C - Smarter Africa Ghana Case Study
PPT
[EN] ECM Enterprise Content Management | Dr. Ulrich Kampffmeyer | AIIM Confer...
PPTX
презентация кулигин
DOCX
Primavera software
PPT
El aparato reproductor
PPTX
Napoleon I Bonaparta
PPTX
Tamna strana fotosinteze
PDF
The Problem With MarTech
PPTX
Prince2 foundation and practitioner
PPTX
Folkets Bryggeri sin uoffisielle sosiale medier strategi
PPTX
Reinforced soil
PPT
Distributed computing
ODP
Distributed Computing
PPSX
LA MISTA - CUENTOS GROTESCOS
SoC HPC: Design, Optimization, and Application to Algorithmic Trading
Sonny-Krikorian
Cyberbullying
research and planning presentation
ВIкторина "Фізика та кіно"
Проект "Розвиток космонавтики"
Day 3 C2C - Smarter Africa Ghana Case Study
[EN] ECM Enterprise Content Management | Dr. Ulrich Kampffmeyer | AIIM Confer...
презентация кулигин
Primavera software
El aparato reproductor
Napoleon I Bonaparta
Tamna strana fotosinteze
The Problem With MarTech
Prince2 foundation and practitioner
Folkets Bryggeri sin uoffisielle sosiale medier strategi
Reinforced soil
Distributed computing
Distributed Computing
LA MISTA - CUENTOS GROTESCOS
Ad

Similar to Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures (20)

PDF
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
PPTX
Lrz kurs: big data analysis
PPT
NWU and HPC
PPTX
Introduction to DPDK
PDF
Mauricio breteernitiz hpc-exascale-iscte
PDF
HOW Series: Knights Landing
PPTX
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
PDF
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
PDF
Automatically partitioning packet processing applications for pipelined archi...
PDF
Exploring the Performance Impact of Virtualization on an HPC Cloud
PDF
Design and Implementation of Quintuple Processor Architecture Using FPGA
PDF
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
PPTX
2018 03 25 system ml ai and openpower meetup
PPTX
Introduction to FPGA acceleration
PPT
PDF
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
PPTX
Hardware architecture of Summit Supercomputer
PDF
Fletcher Framework for Programming FPGA
PPTX
Feedback on Big Compute & HPC on Windows Azure
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Lrz kurs: big data analysis
NWU and HPC
Introduction to DPDK
Mauricio breteernitiz hpc-exascale-iscte
HOW Series: Knights Landing
Design of 32 Bit Processor Using 8051 and Leon3 (Progress Report)
PCCC24(第24回PCクラスタシンポジウム):筑波大学計算科学研究センター テーマ2「スーパーコンピュータCygnus / Pegasus」
Automatically partitioning packet processing applications for pipelined archi...
Exploring the Performance Impact of Virtualization on an HPC Cloud
Design and Implementation of Quintuple Processor Architecture Using FPGA
Presentation Thesis - Convolutional net on the Xeon Phi using SIMD - Gaurav R...
2018 03 25 system ml ai and openpower meetup
Introduction to FPGA acceleration
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Hardware architecture of Summit Supercomputer
Fletcher Framework for Programming FPGA
Feedback on Big Compute & HPC on Windows Azure
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
sap open course for s4hana steps from ECC to s4
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures