SlideShare a Scribd company logo
Low Power High-Performance Computing on the
BeagleBoard Platform
E. Principi, V. Colagiacomo, S. Squartini, and F. Piazza
A3Lab, Department of Information Engineering
Universit`a Politecnica delle Marche
5th European DSP Education and Research Conference
13th and 14th September, 2012, Amsterdam, Netherlands
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Outline
1 Introduction
2 Purpose of this work
3 The BeagleCluster
Hardware Platform
Software Platform
4 Experiments
High-Performance Linpack
Matrix Multiplication
Speaker Diarization
Analysis of power consumption
5 Conclusions and Future Developments
2 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
4 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
4 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
• Hewlett-Packard Redstone servers
• Four rack chassis = 2800
conventional servers
• Energy saving: 90%
• Space saving: 94%
• Currently employed in TryStack
free cloud service
(http://trystack.org)
4 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Hardware Platform
Cluster description
The BeagleCluster is composed of five BeagleBoard-xM.
Beagleboard-xM
Processor TI DM3730
ARM subsystem Cortex-A8 @ 1 GHz
DSP subsystem C64x+ @ 800 MHz
Graphics accelerator PowerVR SGX @ 200 MHz
RAM 512 MB DDR @ 200 MHz
Network interface Ethernet 10/100
6 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Hardware Platform
Cluster description (cont.)
• Asymmetric topology: one head node, four worker nodes.
• Nodes are connected to a Hewlett-Packard ProCurve 1410-8G
switch through the BeagleBoard-xM 100 Mbit interface.
• Nodes are powered by a Lambda AC-DC power supply.
7 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Software Platform
Software Platform
• Operating system: ˚Angstr¨om GNU/Linux distribution (worker
nodes do not have a GUI).
• Tool-chain: CodeSourcery.
• Network File System: data and code are shared throughout
the cluster using Network File System.
• Cluster Command Control: a suite of tools for managing the
cluster (e.g., terminating processes, rebooting worker nodes,
pushing drive images).
• Message Passing Interface (Argonne National Laboratory
MPICH2): application programming interface that allows the
exchange of messages and data among processes running on
the nodes of a cluster.
8 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Software Platform
Software Platform (cont.)
• Ganglia: offers a web interface used to monitor the cluster
activity and to detect abnormal functioning.
9 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL)
• HPL is the de-facto standard benchmark for floating point
performance measurement.
• It is employed in the Top500 and Green500 lists.
• HPL solves a dense system of linear equations using double
precision arithmetic.
• Parallelism is obtained by means of MPI.
• Computation is based on BLAS (Vesperix ATLAS-ARM).
10 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL) (cont.)
MFLOPS
258.6
MFLOPS/W
13.26
Green500 500th position (June 2012)
Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal
Interconnect: 32.05 MFLOPS/W
Note
Arithmetic operations are performed in double precision in the
Vector Floating Point unit: NEON unit cannot be employed.
11 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL) (cont.)
MFLOPS
258.6
MFLOPS/W
13.26
Green500 500th position (June 2012)
Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal
Interconnect: 32.05 MFLOPS/W
Note
Arithmetic operations are performed in double precision in the
Vector Floating Point unit: NEON unit cannot be employed.
11 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Matrix Multiplication
Matrix Multiplication
• This benchmark shows the performance improvement that can
be obtained using NEON optimized code.
• The benchmark multiplies an m × n matrix A with an n × p
matrix B.
• It operates dividing the rows of matrix A in groups, and
processing each group in a different node.
• Communication among nodes is based on MPI.
Platform Execution time
BeagleCluster 42.13 s
BeagleCluster w/ NEON 5.18 s
NEON optimized code significantly reduces the execution time ⇒
HPL performance can be improved by properly exploiting NEON
12 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Matrix Multiplication
Matrix Multiplication
• This benchmark shows the performance improvement that can
be obtained using NEON optimized code.
• The benchmark multiplies an m × n matrix A with an n × p
matrix B.
• It operates dividing the rows of matrix A in groups, and
processing each group in a different node.
• Communication among nodes is based on MPI.
Platform Execution time
BeagleCluster 42.13 s
BeagleCluster w/ NEON 5.18 s
NEON optimized code significantly reduces the execution time ⇒
HPL performance can be improved by properly exploiting NEON
12 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization
• A speaker diarization algorithm detects “who speaks now”.
• The algorithm here addressed is based on the real-time
implementation described in [Colagiacomo, et al. 2010].
• The calculation of the cross-correlations between the channel
i signal xi(t) and the channel j signal xj(t) is the most
computational demanding part:
Cij(t) = max
τ
{IFFT[FFT(xi(t)xj(t − τ)) • FFT(w(t))]} .
Here, t is the time index, τ is the correlation lag, w(t) is the
Hamming window and • denotes the element-wise product.
13 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
• Cluster-wide parallelism has been obtained assigning the
feature extraction stage of each channel to one of the worker
nodes.
• The server process in the head node dispatches audio frames
to the worker nodes through the MPI Bcast instruction and
performs the final classification.
• Performance have been evaluated in terms of Real-Time
Factor (RTF):
RTF = Total execution time
Speech segment duration
14 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
• Audio data: four lapel microphone signals of meeting
ES2009b contained in the AMI corpus.
• Comparison with an Asus F9SG laptop (Intel Core2 Duo
T8300 CPU running at 2.4 GHz and with 2 GB of RAM)
• Power consumption is measured switching the LCD monitor
off.
15 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Single-board implementation results
• Real-time execution is achieved through the NEON instruction
set and reducing the number of cross-correlations: the
maximum of Cij(t) is searched incrementing τ by ∆τ > 1.
∆τ Laptop BeagleBoard-xM
1 2.47 12.73
16 0.25 1.02
32 0.18 0.63
64 0.14 0.44
128 0.12 0.36
The choice of ∆τ is critical both
for the laptop and the
BeagleBoard-xM.
16 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Cluster-wide implementation results
∆τ Single-board Five nodes
1 12.73 4.71
16 1.02 1.69
32 0.63 1.63
64 0.44 1.56
128 0.36 1.55
• The MPI version is almost 3 times as fast as the single-board one when
∆τ = 1.
• As ∆τ increases, the MPI implementation performance decreases: the
communication overhead becomes the new bottleneck.
17 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Cluster-wide implementation
• This has been verified in a four nodes cluster.
• Nodes read audio data directly from the local file system.
• One of the worker nodes performs both the feature extraction
and the classification tasks.
∆τ Five nodes Four nodes (w/ local data)
1 4.71 3.35
16 1.69 0.33
32 1.63 0.23
64 1.56 0.18
128 1.55 0.16
Reducing the communication overhead real-time execution can be
achieved with ∆τ = 16.
18 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Analysis of power consumption
BeagleCluster
20.32 W
Laptop
32.36 W
Energy ratio
Er =
RTFcluster · Pcluster
RTFlaptop · Plaptop
∼= 1.2
The communication overhead limits the energy efficiency of the Bea-
gleCluster.
Energy ratio of the four nodes cluster
Er
∼= 0.69
Reducing the communication overhead the BeagleCluster is more
efficient than the laptop PC.
19 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Conclusions
• A cluster computer based on the BeagleBoard-xM platform
has been described.
• The cluster is based on open software for executing parallel
tasks, management, and monitoring the nodes status.
• High Performance Linpack has been used to obtain the
number of floating point operations per second.
• The performance improvement that can be achieved using
NEON optimized code has been shown by means of a matrix
multiplication benchmark.
• Processing time and power consumption have been measured
by means of a cluster-wide speaker diarization algorithm to
evaluate the real-time capabilities and the energy efficiency of
the cluster.
20 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Conclusions (cont.)
• Results showed that using the 100 Mbit Ethernet interface,
the BeagleCluster consumes 1.2 times the energy spent by the
laptop PC.
• Removing the communication bottleneck, the BeagleCluster
achieves a superior energy efficiency.
• The cost of the 5 nodes cluster is 655 e. Compared to the
laptop PC, whose cost is 1100 e, the BeagleCluster is about
500 e cheaper.
21 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Future developments
• The software platform will be expanded with a resource
manager and a scheduler to enable the execution of batch
jobs.
• The energy efficiency will be assessed in a High-Availability
scenario, for example using the cluster for hosting websites.
• The use of more efficient hardware platforms (e.g.,
PandaBoards) and of the DM3730 DSP will be considered.
22 / 25
Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Thank you for your attention!
Emanuele Principi Vito Colagiacomo
e.principi@univpm.it s1037562@studenti.univpm.it
Stefano Squartini Francesco Piazza
s.squartini@univpm.it f.piazza@univpm.it
23 / 25
Manufacturer AMPROBE
Model LH41A
Measuring Range 0-40A, DC or AC peak
Resolution 1 mA in 4 A range
10 mA in 40 A range
Accuracy ±1.3% + 5 digits
Frequency Range DC in DC
40 Hz to 400 Hz in AC
24 / 25
High-Performance Linpack: details
Rmax 258.6 MFLOPS
Problem size 15000
Block size 16
Grid ratio 2 × 2
25 / 25
H. W. Meuer, “The TOP500 Project: Looking Back Over 15 Years of
Supercomputing Experience,” Informatik-Spektrum, vol. 31, no. 3, pp. 203–222,
2008. [Online]. Available: http://www.top500.org
C. L. Belady, “In the Data Center, Power and Cooling Cost More Than the IT
Equipment It Supports,” Electronics Cooling Magazine, vol. 13, no. 1, May 2007.
W.-c. Feng and K. Cameron, “The Green500 List: Encouraging Sustainable
Supercomputing,” IEEE Computer, vol. 40, no. 12, pp. 50–55, Dec. 2007.
[Online]. Available: http://www.green500.org
I. Ahmad and S. Ranka, Eds., Handbook of Energy-Aware and Green Computing,
1st ed., ser. Information Science. Boca Raton, US: CRC Press, Jan. 2012.
S. Andrade, J. Dourado, and C. Maciel, “Low-power cluster using OMAP3530,”
in Proc. of EDERC, Nice, France, Dec. 2010, pp. 220–224.
K. F¨urlinger, C. Klausecker, and D. Kranzlm¨uller, “Towards energy efficient
parallel computing on consumer electronic devices,” in Proc. of ICT-GLOW.
Berlin, Heidelberg: Springer-Verlag, 2011, pp. 1–9.
M. Brim, R. Flanery, A. Geist, B. Luethke, and S. L. Scott, “Cluster Command
and Control (C3) Tool Suite,” Parallel and Distributed Computing Practices,
vol. 4, no. 4, Dec. 2001.
25 / 25
Argonne National Laboratory, “MPICH2,”
http://www.mcs.anl.gov/research/projects/mpich2/.
M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia distributed monitoring
system: design, implementation, and experience,” Parallel Computing, vol. 30,
no. 7, pp. 817–840, 2004.
M. Moattar and M. Homayounpour, “A review on speaker diarization systems
and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.
E. Principi, R. Rotili, M. W¨ollmer, F. Eyben, S. Squartini, and B. Schuller,
“Real-Time Activity Detection in a Multi-Talker Reverberated Environment,”
Cognitive Computation, pp. 1–12, 2012.
V. Colagiacomo, E. Principi, S. Cifani, and S. Squartini, “Real-Time Speaker
Diarization on TI OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 1st-2nd
2010.
InfiniBand Trade Association, “InfiniBand Architecture Specification Release
1.2.1,” Jan. 2008.
N. J. Boden, D. Cohen, R. E. Felderman, A. Kulawik, C. Seitz, J. N. Seizovic,
and W. Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro,
vol. 15, no. 1, pp. 29–36, Feb. 1995.
25 / 25

More Related Content

PDF
Spine net learning scale permuted backbone for recognition and localization
PDF
cug2011-praveen
PPTX
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
PPTX
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
PDF
Learning global pooling operators in deep neural networks for image retrieval...
PPTX
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
PDF
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
PDF
Large Scale Kernel Learning using Block Coordinate Descent
Spine net learning scale permuted backbone for recognition and localization
cug2011-praveen
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
Learning global pooling operators in deep neural networks for image retrieval...
Times Series Feature Extraction Methods of Wearable Signal Data for Deep Lear...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
Large Scale Kernel Learning using Block Coordinate Descent

What's hot (20)

PDF
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
PDF
CloudLightning and the OPM-based Use Case
PDF
QuantumChemistry500
PPT
Per domain power analysis
PPTX
LEGaTO: Software Stack Runtimes
PDF
Standardising the compressed representation of neural networks
PDF
Subgraph Matching for Resource Allocation in the Federated Cloud Environment
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PDF
Cloud, Fog, or Edge: Where and When to Compute?
PDF
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
PDF
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
PDF
AI On the Edge: Model Compression
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PPTX
RL-Cache: Learning-Based Cache Admission for Content Delivery
PDF
Detection focal loss 딥러닝 논문읽기 모임 발표자료
PDF
Introduction to Chainer Chemistry
PDF
Implementing a neural network potential for exascale molecular dynamics
PDF
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Autonomic Resource Provisioning for Cloud-Based Software
IEEE P2P 2013 - Bootstrapping Skynet: Calibration and Autonomic Self-Control ...
CloudLightning and the OPM-based Use Case
QuantumChemistry500
Per domain power analysis
LEGaTO: Software Stack Runtimes
Standardising the compressed representation of neural networks
Subgraph Matching for Resource Allocation in the Federated Cloud Environment
Accelerating Real Time Applications on Heterogeneous Platforms
Cloud, Fog, or Edge: Where and When to Compute?
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
AI On the Edge: Model Compression
IIBMP2019 講演資料「オープンソースで始める深層学習」
RL-Cache: Learning-Based Cache Admission for Content Delivery
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Introduction to Chainer Chemistry
Implementing a neural network potential for exascale molecular dynamics
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
Deep learning for molecules, introduction to chainer chemistry
Autonomic Resource Provisioning for Cloud-Based Software
Ad

Viewers also liked (8)

PDF
Optimized implementation of an innovative digital audio equalizer
PDF
An Advanced Implementation of a Digital Artificial Reverberator
PDF
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
PDF
Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...
PDF
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
PDF
Approximation of Real Impulse Response Using IIR Structures
PDF
System Identification Based on Hammerstein Models Using Cubic Splines
PDF
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
Optimized implementation of an innovative digital audio equalizer
An Advanced Implementation of a Digital Artificial Reverberator
A NOVEL APPROACH TO CHANNEL DECORRELATION FOR STEREO ACOUSTIC ECHO CANCELLATI...
Hybrid Reverberator Using Multiple Impulse Responses for Audio Rendering Impr...
A Low Latency Implementation of a Non Uniform Partitioned Overlap and Save Al...
Approximation of Real Impulse Response Using IIR Structures
System Identification Based on Hammerstein Models Using Cubic Splines
A Distributed System for Recognizing Home Automation Commands and Distress Ca...
Ad

Similar to Low Power High-Performance Computing on the BeagleBoard Platform (20)

PDF
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
PDF
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
PDF
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
PPTX
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
PDF
01-06 OCRE Test Suite - Fernandes.pdf
PDF
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
PDF
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
PDF
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
PDF
electronics-11-03883.pdf
PDF
Reproducible Network Research With High-­Fidelity Emulation
PDF
Static Energy Prediction in Software: A Worst-Case Scenario Approach
PPT
186 devlin p-poster(2)
PDF
Hpc Cloud project Overview
PDF
Implementation of area optimized low power multiplication and accumulation
PPTX
OpenACC Monthly Highlights: September 2021
PDF
IRJET- Switch Level Implementation of A 4-Bit Logical Unit using Mixed Logic ...
PPTX
OpenACC Highlights: GTC Digital April 2020
PDF
Parallex - The Supercomputer
PPTX
OpenACC Monthly Highlights Summer 2019
PPT
The OptIPuter as a Prototype for CalREN-XD
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project
01-06 OCRE Test Suite - Fernandes.pdf
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
ASSESSING THE PERFORMANCE AND ENERGY USAGE OF MULTI-CPUS, MULTI-CORE AND MANY...
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
electronics-11-03883.pdf
Reproducible Network Research With High-­Fidelity Emulation
Static Energy Prediction in Software: A Worst-Case Scenario Approach
186 devlin p-poster(2)
Hpc Cloud project Overview
Implementation of area optimized low power multiplication and accumulation
OpenACC Monthly Highlights: September 2021
IRJET- Switch Level Implementation of A 4-Bit Logical Unit using Mixed Logic ...
OpenACC Highlights: GTC Digital April 2020
Parallex - The Supercomputer
OpenACC Monthly Highlights Summer 2019
The OptIPuter as a Prototype for CalREN-XD

More from a3labdsp (8)

PDF
Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...
PDF
Hybrid Reverberation Algorithm: a Practical Approach
PDF
Mixed Time Frequency Approach for Multipoint Room Response Equalization
PDF
Audio Morphing for Percussive Sound Generation
PDF
An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...
PDF
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
PDF
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
PDF
A Hybrid Approach for Real-time Room Acoustic Response Simulation
Evaluation of a Multipoint Equalization System based on Impulse Responses Pro...
Hybrid Reverberation Algorithm: a Practical Approach
Mixed Time Frequency Approach for Multipoint Room Response Equalization
Audio Morphing for Percussive Sound Generation
An Efficient DSP Implementation of a Dynamic Convolution Using Principal Comp...
Approximation of Dynamic Convolution Exploiting Principal Component Analysis:...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
A Hybrid Approach for Real-time Room Acoustic Response Simulation

Recently uploaded (20)

PPTX
A Presentation on Touch Screen Technology
PPTX
Tartificialntelligence_presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
project resource management chapter-09.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Touch Screen Technology
Tartificialntelligence_presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Univ-Connecticut-ChatGPT-Presentaion.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Programs and apps: productivity, graphics, security and other tools
Group 1 Presentation -Planning and Decision Making .pptx
Enhancing emotion recognition model for a student engagement use case through...
WOOl fibre morphology and structure.pdf for textiles
project resource management chapter-09.pdf
Web App vs Mobile App What Should You Build First.pdf
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
Hybrid model detection and classification of lung cancer
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Building Integrated photovoltaic BIPV_UPV.pdf

Low Power High-Performance Computing on the BeagleBoard Platform

  • 1. Low Power High-Performance Computing on the BeagleBoard Platform E. Principi, V. Colagiacomo, S. Squartini, and F. Piazza A3Lab, Department of Information Engineering Universit`a Politecnica delle Marche 5th European DSP Education and Research Conference 13th and 14th September, 2012, Amsterdam, Netherlands
  • 2. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Outline 1 Introduction 2 Purpose of this work 3 The BeagleCluster Hardware Platform Software Platform 4 Experiments High-Performance Linpack Matrix Multiplication Speaker Diarization Analysis of power consumption 5 Conclusions and Future Developments 2 / 25
  • 3. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Introduction High-performance computing clusters are employed in computation- ally intensive tasks (e.g., weather prediction, astronomical mod- elling). Usually, they are evaluated only in terms of Floating Point Opera- tions Per Second (FLOPS) (e.g., Top500 list). The costs of energy and infrastructure exceed the costs of the computational devices, and this gap is expected to grow by 2014 [Belady, 2007]. A new metric FLOPS/Watt 3 / 25
  • 4. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Introduction High-performance computing clusters are employed in computation- ally intensive tasks (e.g., weather prediction, astronomical mod- elling). Usually, they are evaluated only in terms of Floating Point Opera- tions Per Second (FLOPS) (e.g., Top500 list). The costs of energy and infrastructure exceed the costs of the computational devices, and this gap is expected to grow by 2014 [Belady, 2007]. A new metric FLOPS/Watt 3 / 25
  • 5. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Introduction High-performance computing clusters are employed in computation- ally intensive tasks (e.g., weather prediction, astronomical mod- elling). Usually, they are evaluated only in terms of Floating Point Opera- tions Per Second (FLOPS) (e.g., Top500 list). The costs of energy and infrastructure exceed the costs of the computational devices, and this gap is expected to grow by 2014 [Belady, 2007]. A new metric FLOPS/Watt 3 / 25
  • 6. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Introduction High-performance computing clusters are employed in computation- ally intensive tasks (e.g., weather prediction, astronomical mod- elling). Usually, they are evaluated only in terms of Floating Point Opera- tions Per Second (FLOPS) (e.g., Top500 list). The costs of energy and infrastructure exceed the costs of the computational devices, and this gap is expected to grow by 2014 [Belady, 2007]. A new metric FLOPS/Watt 3 / 25
  • 7. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Tendency in the industry • Use of processors traditionally employed in the mobile world. • Canonical built a 42-core ARM cluster for compiling the Ubuntu distribution. • Calxeda developed the EnergyCore ECX-1000 series of server-on-a-chip based on ARM Cortex-A9. 4 / 25
  • 8. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Tendency in the industry • Use of processors traditionally employed in the mobile world. • Canonical built a 42-core ARM cluster for compiling the Ubuntu distribution. • Calxeda developed the EnergyCore ECX-1000 series of server-on-a-chip based on ARM Cortex-A9. 4 / 25
  • 9. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Tendency in the industry • Use of processors traditionally employed in the mobile world. • Canonical built a 42-core ARM cluster for compiling the Ubuntu distribution. • Calxeda developed the EnergyCore ECX-1000 series of server-on-a-chip based on ARM Cortex-A9. • Hewlett-Packard Redstone servers • Four rack chassis = 2800 conventional servers • Energy saving: 90% • Space saving: 94% • Currently employed in TryStack free cloud service (http://trystack.org) 4 / 25
  • 10. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Purpose of this work Develop Develop an energy efficient cluster computer composed of off-the- shelf inexpensive hardware and open software and propose it to the scientific community. Evaluate Evaluate the cluster both through conventional benchmarks and a real-time constrained speech processing application. Measure Measure the power consumption of the cluster, assess the energy efficiency, and compare it with a laptop PC. 5 / 25
  • 11. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Purpose of this work Develop Develop an energy efficient cluster computer composed of off-the- shelf inexpensive hardware and open software and propose it to the scientific community. Evaluate Evaluate the cluster both through conventional benchmarks and a real-time constrained speech processing application. Measure Measure the power consumption of the cluster, assess the energy efficiency, and compare it with a laptop PC. 5 / 25
  • 12. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Purpose of this work Develop Develop an energy efficient cluster computer composed of off-the- shelf inexpensive hardware and open software and propose it to the scientific community. Evaluate Evaluate the cluster both through conventional benchmarks and a real-time constrained speech processing application. Measure Measure the power consumption of the cluster, assess the energy efficiency, and compare it with a laptop PC. 5 / 25
  • 13. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Hardware Platform Cluster description The BeagleCluster is composed of five BeagleBoard-xM. Beagleboard-xM Processor TI DM3730 ARM subsystem Cortex-A8 @ 1 GHz DSP subsystem C64x+ @ 800 MHz Graphics accelerator PowerVR SGX @ 200 MHz RAM 512 MB DDR @ 200 MHz Network interface Ethernet 10/100 6 / 25
  • 14. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Hardware Platform Cluster description (cont.) • Asymmetric topology: one head node, four worker nodes. • Nodes are connected to a Hewlett-Packard ProCurve 1410-8G switch through the BeagleBoard-xM 100 Mbit interface. • Nodes are powered by a Lambda AC-DC power supply. 7 / 25
  • 15. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Software Platform Software Platform • Operating system: ˚Angstr¨om GNU/Linux distribution (worker nodes do not have a GUI). • Tool-chain: CodeSourcery. • Network File System: data and code are shared throughout the cluster using Network File System. • Cluster Command Control: a suite of tools for managing the cluster (e.g., terminating processes, rebooting worker nodes, pushing drive images). • Message Passing Interface (Argonne National Laboratory MPICH2): application programming interface that allows the exchange of messages and data among processes running on the nodes of a cluster. 8 / 25
  • 16. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Software Platform Software Platform (cont.) • Ganglia: offers a web interface used to monitor the cluster activity and to detect abnormal functioning. 9 / 25
  • 17. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments High-Performance Linpack High-Performance Linpack (HPL) • HPL is the de-facto standard benchmark for floating point performance measurement. • It is employed in the Top500 and Green500 lists. • HPL solves a dense system of linear equations using double precision arithmetic. • Parallelism is obtained by means of MPI. • Computation is based on BLAS (Vesperix ATLAS-ARM). 10 / 25
  • 18. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments High-Performance Linpack High-Performance Linpack (HPL) (cont.) MFLOPS 258.6 MFLOPS/W 13.26 Green500 500th position (June 2012) Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal Interconnect: 32.05 MFLOPS/W Note Arithmetic operations are performed in double precision in the Vector Floating Point unit: NEON unit cannot be employed. 11 / 25
  • 19. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments High-Performance Linpack High-Performance Linpack (HPL) (cont.) MFLOPS 258.6 MFLOPS/W 13.26 Green500 500th position (June 2012) Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal Interconnect: 32.05 MFLOPS/W Note Arithmetic operations are performed in double precision in the Vector Floating Point unit: NEON unit cannot be employed. 11 / 25
  • 20. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Matrix Multiplication Matrix Multiplication • This benchmark shows the performance improvement that can be obtained using NEON optimized code. • The benchmark multiplies an m × n matrix A with an n × p matrix B. • It operates dividing the rows of matrix A in groups, and processing each group in a different node. • Communication among nodes is based on MPI. Platform Execution time BeagleCluster 42.13 s BeagleCluster w/ NEON 5.18 s NEON optimized code significantly reduces the execution time ⇒ HPL performance can be improved by properly exploiting NEON 12 / 25
  • 21. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Matrix Multiplication Matrix Multiplication • This benchmark shows the performance improvement that can be obtained using NEON optimized code. • The benchmark multiplies an m × n matrix A with an n × p matrix B. • It operates dividing the rows of matrix A in groups, and processing each group in a different node. • Communication among nodes is based on MPI. Platform Execution time BeagleCluster 42.13 s BeagleCluster w/ NEON 5.18 s NEON optimized code significantly reduces the execution time ⇒ HPL performance can be improved by properly exploiting NEON 12 / 25
  • 22. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization • A speaker diarization algorithm detects “who speaks now”. • The algorithm here addressed is based on the real-time implementation described in [Colagiacomo, et al. 2010]. • The calculation of the cross-correlations between the channel i signal xi(t) and the channel j signal xj(t) is the most computational demanding part: Cij(t) = max τ {IFFT[FFT(xi(t)xj(t − τ)) • FFT(w(t))]} . Here, t is the time index, τ is the correlation lag, w(t) is the Hamming window and • denotes the element-wise product. 13 / 25
  • 23. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization (cont.) • Cluster-wide parallelism has been obtained assigning the feature extraction stage of each channel to one of the worker nodes. • The server process in the head node dispatches audio frames to the worker nodes through the MPI Bcast instruction and performs the final classification. • Performance have been evaluated in terms of Real-Time Factor (RTF): RTF = Total execution time Speech segment duration 14 / 25
  • 24. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization (cont.) • Audio data: four lapel microphone signals of meeting ES2009b contained in the AMI corpus. • Comparison with an Asus F9SG laptop (Intel Core2 Duo T8300 CPU running at 2.4 GHz and with 2 GB of RAM) • Power consumption is measured switching the LCD monitor off. 15 / 25
  • 25. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization (cont.) Single-board implementation results • Real-time execution is achieved through the NEON instruction set and reducing the number of cross-correlations: the maximum of Cij(t) is searched incrementing τ by ∆τ > 1. ∆τ Laptop BeagleBoard-xM 1 2.47 12.73 16 0.25 1.02 32 0.18 0.63 64 0.14 0.44 128 0.12 0.36 The choice of ∆τ is critical both for the laptop and the BeagleBoard-xM. 16 / 25
  • 26. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization (cont.) Cluster-wide implementation results ∆τ Single-board Five nodes 1 12.73 4.71 16 1.02 1.69 32 0.63 1.63 64 0.44 1.56 128 0.36 1.55 • The MPI version is almost 3 times as fast as the single-board one when ∆τ = 1. • As ∆τ increases, the MPI implementation performance decreases: the communication overhead becomes the new bottleneck. 17 / 25
  • 27. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Speaker Diarization (cont.) Cluster-wide implementation • This has been verified in a four nodes cluster. • Nodes read audio data directly from the local file system. • One of the worker nodes performs both the feature extraction and the classification tasks. ∆τ Five nodes Four nodes (w/ local data) 1 4.71 3.35 16 1.69 0.33 32 1.63 0.23 64 1.56 0.18 128 1.55 0.16 Reducing the communication overhead real-time execution can be achieved with ∆τ = 16. 18 / 25
  • 28. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Speaker Diarization Analysis of power consumption BeagleCluster 20.32 W Laptop 32.36 W Energy ratio Er = RTFcluster · Pcluster RTFlaptop · Plaptop ∼= 1.2 The communication overhead limits the energy efficiency of the Bea- gleCluster. Energy ratio of the four nodes cluster Er ∼= 0.69 Reducing the communication overhead the BeagleCluster is more efficient than the laptop PC. 19 / 25
  • 29. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Conclusions • A cluster computer based on the BeagleBoard-xM platform has been described. • The cluster is based on open software for executing parallel tasks, management, and monitoring the nodes status. • High Performance Linpack has been used to obtain the number of floating point operations per second. • The performance improvement that can be achieved using NEON optimized code has been shown by means of a matrix multiplication benchmark. • Processing time and power consumption have been measured by means of a cluster-wide speaker diarization algorithm to evaluate the real-time capabilities and the energy efficiency of the cluster. 20 / 25
  • 30. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Conclusions (cont.) • Results showed that using the 100 Mbit Ethernet interface, the BeagleCluster consumes 1.2 times the energy spent by the laptop PC. • Removing the communication bottleneck, the BeagleCluster achieves a superior energy efficiency. • The cost of the 5 nodes cluster is 655 e. Compared to the laptop PC, whose cost is 1100 e, the BeagleCluster is about 500 e cheaper. 21 / 25
  • 31. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Future developments • The software platform will be expanded with a resource manager and a scheduler to enable the execution of batch jobs. • The energy efficiency will be assessed in a High-Availability scenario, for example using the cluster for hosting websites. • The use of more efficient hardware platforms (e.g., PandaBoards) and of the DM3730 DSP will be considered. 22 / 25
  • 32. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments Thank you for your attention! Emanuele Principi Vito Colagiacomo [email protected] [email protected] Stefano Squartini Francesco Piazza [email protected] [email protected] 23 / 25
  • 33. Manufacturer AMPROBE Model LH41A Measuring Range 0-40A, DC or AC peak Resolution 1 mA in 4 A range 10 mA in 40 A range Accuracy ±1.3% + 5 digits Frequency Range DC in DC 40 Hz to 400 Hz in AC 24 / 25
  • 34. High-Performance Linpack: details Rmax 258.6 MFLOPS Problem size 15000 Block size 16 Grid ratio 2 × 2 25 / 25
  • 35. H. W. Meuer, “The TOP500 Project: Looking Back Over 15 Years of Supercomputing Experience,” Informatik-Spektrum, vol. 31, no. 3, pp. 203–222, 2008. [Online]. Available: http://www.top500.org C. L. Belady, “In the Data Center, Power and Cooling Cost More Than the IT Equipment It Supports,” Electronics Cooling Magazine, vol. 13, no. 1, May 2007. W.-c. Feng and K. Cameron, “The Green500 List: Encouraging Sustainable Supercomputing,” IEEE Computer, vol. 40, no. 12, pp. 50–55, Dec. 2007. [Online]. Available: http://www.green500.org I. Ahmad and S. Ranka, Eds., Handbook of Energy-Aware and Green Computing, 1st ed., ser. Information Science. Boca Raton, US: CRC Press, Jan. 2012. S. Andrade, J. Dourado, and C. Maciel, “Low-power cluster using OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 2010, pp. 220–224. K. F¨urlinger, C. Klausecker, and D. Kranzlm¨uller, “Towards energy efficient parallel computing on consumer electronic devices,” in Proc. of ICT-GLOW. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 1–9. M. Brim, R. Flanery, A. Geist, B. Luethke, and S. L. Scott, “Cluster Command and Control (C3) Tool Suite,” Parallel and Distributed Computing Practices, vol. 4, no. 4, Dec. 2001. 25 / 25
  • 36. Argonne National Laboratory, “MPICH2,” http://www.mcs.anl.gov/research/projects/mpich2/. M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia distributed monitoring system: design, implementation, and experience,” Parallel Computing, vol. 30, no. 7, pp. 817–840, 2004. M. Moattar and M. Homayounpour, “A review on speaker diarization systems and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012. E. Principi, R. Rotili, M. W¨ollmer, F. Eyben, S. Squartini, and B. Schuller, “Real-Time Activity Detection in a Multi-Talker Reverberated Environment,” Cognitive Computation, pp. 1–12, 2012. V. Colagiacomo, E. Principi, S. Cifani, and S. Squartini, “Real-Time Speaker Diarization on TI OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 1st-2nd 2010. InfiniBand Trade Association, “InfiniBand Architecture Specification Release 1.2.1,” Jan. 2008. N. J. Boden, D. Cohen, R. E. Felderman, A. Kulawik, C. Seitz, J. N. Seizovic, and W. Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro, vol. 15, no. 1, pp. 29–36, Feb. 1995. 25 / 25