SlideShare a Scribd company logo
Paolo DiTommaso - Notredame Lab 

Center for Genomic Regulation (CRG)
HPC Advisory Council - 22 March 2016, Lugano
Reproducible computational
pipelines with Docker and
Nextflow
@PaoloDiTommaso
Research software engineer
Comparative Bioinformatics,
Notredame Lab
Center for Genomic Regulation (CRG)
TWO MAJOR CHALLENGES
IN COMPUTATIONAL
BIOLOGY
COMPUTATIONAL
COMPUTATIONAL
REPRODUCIBILITY CRISIS
To replicate the result of a typical 

computational biology paper

requires 280 hours!
WHAT'S WRONG WITH
COMPUTATIONAL
WORKFLOWS?
COMPLEXITY
• Dozens of dependencies (binary tools, compilers,
libraries, system tools, etc)
• Experimental nature of academic SW tends to be
difficult to install, configure and deploy
• Heterogeneous executing platforms and system
architecture (laptop→supercomputer)
Reproducible Computational Pipelines with Docker and Nextflow
CONTAINERS ARE
THETHIRD BIG WAVE
INVIRTUALISATION
TECHNOLOGY
BENEFITS
• Smaller images (~100MB)
• Fast instantiation time (~1sec)
• Almost native performance
• Easy to build, publish, share and deploy
• Transparent build process
TRANSPARENT 

EXECUTION
cmd_x	--opt	file.txt
docker	run	-v	$PWD:$PWD	-w	$PWD	<image>	cmd_x	--opt	file.txt
Host
NAIVE APPROACH
Docker image
User application
Binary tools
Workflow scripts
Config files
Libraries
Environment
Operating System
Docker engine
SCALING OUT
. . . .
CONTAINERS ORCHESTRATION
• Swarm
• Fleet
• Kubernetes
• Marathon
NOTTHE RIGHT ANSWER
FOR COMPUTATIONAL
PIPELINES
SERVICES ORCHESTRATION

≠

TASKS SCHEDULING
OUR SOLUTION
Nextflow
Host file system
Registry
• A workflow framework that allows the same pipeline
to run across different platforms
• Provides a high level parallelisation model
• Isolates task dependencies using containers
• It enables fast prototyping reusing any existing piece
of software
process	foo	{	
			input:	
			val	str	from	'Hello'	
			output:	
			file	'my_file'	into	result	
			script:	
			"""	
			echo	$str	world!	>	my_file	
			"""	
}	
PROCESS DEFINITION
REACTIVE NETWORK
DATAFLOW
• Declarative computational model for concurrent processes
• Processes wait for data, when an input set is ready the
process is executed
• They communicate by using dataflow variables i.e. async
stream of data called channels
• Parallelisation and tasks dependencies are implicitly defined
by process in/out declarations
PLATFORM AGNOSTIC
Dataflow
Task dispatcher
Executors
POSIX
processes
qsub/bsub/...
tasks
DSL interpreter
nextflow
SUPPORTED PLATFORMS
BATCH SCHEDULER
nextflow
login node
NFS
cluster node
cluster node
cluster node
cluster node
batch scheduler
submit tasks
cluster node
DISTRIBUTED MODE
Login node
NFS/Lustre
Job request
cluster node
cluster node
Job wrapper	#!/bin/bash		
	#$	-q	<queue>	
	#$	-pe	ompi	<nodes>	
	#$	-l	virtual_free=<mem>	
	mpirun	nextflow	run	<your-pipeline>	-with-mpi	
HPC cluster
nextflow cluster
nextflow driver
nextflow worker
nextflow worker
nextflow worker
USE CASE
• Deploying phylogenetic pipeline in BSC MareNostrum
• 500 lines of Nextflow scripting
• ~ 400k jobs
• 512 cores - 32 nodes
• ~ 50k cpu/h
CONFIGURATION FILE
process	{

		executor	=	'slurm'		

		queue	=	'cn-el6'

		memory	=	'10GB'

		cpus	=	8

		time	=	'2h'

		container	=	'your/image:latest'

}
DOCKER AT CRG
Nextflow
Config file
Pipeline script
docker
registry
head
node
Univa grid engine
PROS
• Dead easy deployment procedure
• Self-contained and precise controlled runtime
• Rapidly reproduce any former configuration
• Consistent results over time and across different
platforms
CONS
• Requires a modern Linux kernel (≥3.10)
• Security concerns
• Containers/images cleanup
SHIFTER
• Container technology developed at NERSC
• Nextflow has built-in support for Shifter
• Experimental feature, under test
• It only requires an extra setting in the
configuration file
WHAT ABOUT
PERFORMANCE?
BENCHMARK*
* DiTommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) 

The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 

https://dx.doi.org/10.7717/peerj.1273
DEMO
Reproducible Computational Pipelines with Docker and Nextflow
$ nextflow run nextflow-io/rnatoy -with-docker
$ nextflow run nextflow-io/rnatoy -with-docker
N E X T F L O W ~ version 0.14.3
Pulling nextflow-io/rnatoy ...
downloaded from https://github.com/nextflow-io/rnatoy.git
Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master]
R N A T O Y P I P E L I N E
=================================
genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa
annotat : /User/../data/ggal_1_4885000_49020000.bed.gff
pair1 : /User/../data/*_1.fq
pair2 : /User/../data/*_2.fq
[warm up] executor > local
[02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71)
[ea/97d004] Submitted process > mapping (ggal_gut)
[98/16c9e5] Submitted process > mapping (ggal_liver)
[b5/38a0c7] Submitted process > makeTranscript (ggal_gut)
[00/e5efd6] Submitted process > makeTranscript (ggal_liver)
Saving: transcript_ggal_gut.gtf
Saving: transcript_ggal_liver.gtf
$ nextflow run nextflow-io/rnatoy -revision v1.0
N E X T F L O W ~ version 0.14.3
Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0]
R N A T O Y P I P E L I N E
=================================
[35/cb611b] Submitted process > prepareTranscriptome (1)
[cd/239926] Submitted process > buildIndex (1)
[c6/f6488d] Submitted process > mapping (2)
[bc/b3ea76] Submitted process > mapping (1)
[f4/8d4628] Submitted process > makeTranscript (1)
[eb/92db7f] Submitted process > makeTranscript (2)
Saving: transcript_ggal_alpha.gtf
Saving: transcript_ggal_beta.gtf
$ vim nextflow.config
process {
executor = 'slurm'
memory = 10.GB
cpus = 32
}
WHO IS USING NEXTFLOW?
WHO IS USING NEXTFLOW?
International Agency for Research on Cancer
Lyon, France
Writing reproducible and
scalable bioinformatics pipelines
using nextflow, docker and github
Matthieu Foll
Nov. 12th 2015
CONCLUSION
• Containers are a game-changer for computational
workflows packaging and deployment
• Nextflow is a reactive/functional framework for
computational workflows.
• Docker + Nextflow = Reproducible self-
contained pipelines.
ACKNOWLEDGMENT
Evan Floden, CRG Emilio Palumbo, CRG
Maria Chatzou, CRG Cedric Notredame, CRG
THANKS
LINKS
project home

http://nextflow.io
GitHub repository
http://github.com/nextflow-io/nextflow
Docker benchmark
https://peerj.com/articles/1273/
Docker-Univa white paper
http://www.nextflow.io/misc/Univa-Docker-Whitepaper_FINAL.pdf

More Related Content

What's hot (20)

PDF
Building robust and friendly command line applications in go
Andrii Soldatenko
 
PPTX
Installing Postgres on Linux
EDB
 
PDF
[D2]pinpoint 개발기
NAVER D2
 
PDF
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
PDF
Linux BPF Superpowers
Brendan Gregg
 
PDF
IoT 기반 스마트 사이니지 서비스 시나리오 _ 2016 service model scenario
M&M Networks
 
PDF
Overview of kubernetes network functions
HungWei Chiu
 
PDF
PostgreSQL SQLチューニング入門 実践編(pgcon14j)
Satoshi Yamada
 
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
PPTX
PHPで並列処理する ライブラリを作った
Hironobu Saitoh
 
PPT
Jenkins Scriptler in 90mins
Larry Cai
 
PDF
BPF: Tracing and more
Brendan Gregg
 
PDF
DockerとPodmanの比較
Akihiro Suda
 
PDF
Security Monitoring with eBPF
Alex Maestretti
 
PPTX
WFSのゲームエンジンの歴史と今後の戦略
gree_tech
 
PDF
YOW2021 Computing Performance
Brendan Gregg
 
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
PDF
10分で分かるデータストレージ
Takashi Hoshino
 
PDF
Container Performance Analysis
Brendan Gregg
 
PDF
Git and git flow
Fran García
 
Building robust and friendly command line applications in go
Andrii Soldatenko
 
Installing Postgres on Linux
EDB
 
[D2]pinpoint 개발기
NAVER D2
 
Linux Performance Analysis: New Tools and Old Secrets
Brendan Gregg
 
Linux BPF Superpowers
Brendan Gregg
 
IoT 기반 스마트 사이니지 서비스 시나리오 _ 2016 service model scenario
M&M Networks
 
Overview of kubernetes network functions
HungWei Chiu
 
PostgreSQL SQLチューニング入門 実践編(pgcon14j)
Satoshi Yamada
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
PHPで並列処理する ライブラリを作った
Hironobu Saitoh
 
Jenkins Scriptler in 90mins
Larry Cai
 
BPF: Tracing and more
Brendan Gregg
 
DockerとPodmanの比較
Akihiro Suda
 
Security Monitoring with eBPF
Alex Maestretti
 
WFSのゲームエンジンの歴史と今後の戦略
gree_tech
 
YOW2021 Computing Performance
Brendan Gregg
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
10分で分かるデータストレージ
Takashi Hoshino
 
Container Performance Analysis
Brendan Gregg
 
Git and git flow
Fran García
 

Viewers also liked (10)

PDF
Packaging Software, Puppet Labs Style - PuppetConf 2014
Puppet
 
PDF
Habitat choreography talk
Ian Henry
 
PDF
OpenNebula is Evolving... Fast! - Jaime Melis
OpenNebula Project
 
PDF
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
PPTX
Creating Packages that Run Anywhere with Chef Habitat
Nell Shamrell-Harrington
 
PDF
Managing data workflows with Luigi
Teemu Kurppa
 
PDF
What HPC can learn from DevOps?
Walid Shaari
 
PDF
Docker Dhahran Nov 2016 meetup
Walid Shaari
 
PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Packaging Software, Puppet Labs Style - PuppetConf 2014
Puppet
 
Habitat choreography talk
Ian Henry
 
OpenNebula is Evolving... Fast! - Jaime Melis
OpenNebula Project
 
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Creating Packages that Run Anywhere with Chef Habitat
Nell Shamrell-Harrington
 
Managing data workflows with Luigi
Teemu Kurppa
 
What HPC can learn from DevOps?
Walid Shaari
 
Docker Dhahran Nov 2016 meetup
Walid Shaari
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 

Similar to Reproducible Computational Pipelines with Docker and Nextflow (20)

PDF
Computational workflows for omics analyses at the IARC
Matthieu Foll
 
PDF
Nextflow Camp 2019: nf-core tutorial
Phil Ewels
 
PDF
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Phil Ewels
 
PDF
Standardising Swedish genomics analyses using nextflow
Phil Ewels
 
PPTX
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Venkat Malladi
 
PPTX
From Zero to Nextflow 2017
Luca Cozzuto
 
PDF
160620 sole nomics v2
M. Gonzalo Claros
 
PDF
RootStack - Devfactory
Kangaroot
 
PDF
nf-core: A community-driven collection of omics portable pipelines
Jose Espinosa-Carrasco
 
PDF
Reproducible bioinformatics for everyone: Nextflow & nf-core
Phil Ewels
 
PDF
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
PDF
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Dmitri Zimine
 
PPT
BioMake BOSC 2004
Chris Mungall
 
PPTX
Docker & ECS: Secure Nearline Execution
Brennan Saeta
 
PPTX
Scientific Computing @ Fred Hutch
Dirk Petersen
 
PDF
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Stephen Turner
 
PDF
Data Pipelines with Python - NWA TechFest 2017
Casey Kinsey
 
PDF
The Popper Experimentation Protocol and CLI tool
Ivo Jimenez
 
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
PDF
Developing and sharing reproducible bioinformatics pipelines: best practices
Yohann Lelièvre
 
Computational workflows for omics analyses at the IARC
Matthieu Foll
 
Nextflow Camp 2019: nf-core tutorial
Phil Ewels
 
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Phil Ewels
 
Standardising Swedish genomics analyses using nextflow
Phil Ewels
 
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Venkat Malladi
 
From Zero to Nextflow 2017
Luca Cozzuto
 
160620 sole nomics v2
M. Gonzalo Claros
 
RootStack - Devfactory
Kangaroot
 
nf-core: A community-driven collection of omics portable pipelines
Jose Espinosa-Carrasco
 
Reproducible bioinformatics for everyone: Nextflow & nf-core
Phil Ewels
 
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Dmitri Zimine
 
BioMake BOSC 2004
Chris Mungall
 
Docker & ECS: Secure Nearline Execution
Brennan Saeta
 
Scientific Computing @ Fred Hutch
Dirk Petersen
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Stephen Turner
 
Data Pipelines with Python - NWA TechFest 2017
Casey Kinsey
 
The Popper Experimentation Protocol and CLI tool
Ivo Jimenez
 
Building cloud-enabled genomics workflows with Luigi and Docker
Jacob Feala
 
Developing and sharing reproducible bioinformatics pipelines: best practices
Yohann Lelièvre
 

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
inside-BigData.com
 
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
PPTX
Transforming Private 5G Networks
inside-BigData.com
 
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
PDF
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
PDF
Machine Learning for Weather Forecasts
inside-BigData.com
 
PPTX
HPC AI Advisory Council Update
inside-BigData.com
 
PDF
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
PDF
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
PDF
State of ARM-based HPC
inside-BigData.com
 
PDF
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
PDF
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
PDF
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
PDF
Overview of HPC Interconnects
inside-BigData.com
 
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
inside-BigData.com
 

Recently uploaded (20)

PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
PDF
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PDF
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
PDF
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PPTX
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PPTX
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
PDF
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
PDF
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
How to Visualize the ​Spatio-Temporal Data Using CesiumJS​
SANGHEE SHIN
 
Darley - FIRST Copenhagen Lightning Talk (2025-06-26) Epochalypse 2038 - Time...
treyka
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
5 Things to Consider When Deploying AI in Your Enterprise
Safe Software
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
Automating the Geo-Referencing of Historic Aerial Photography in Flanders
Safe Software
 
The Future of Product Management in AI ERA.pdf
Alyona Owens
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
My Journey from CAD to BIM: A True Underdog Story
Safe Software
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Smart Factory Monitoring IIoT in Machine and Production Operations.pptx
Rejig Digital
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
Enabling the Digital Artisan – keynote at ICOCI 2025
Alan Dix
 
Java 25 and Beyond - A Roadmap of Innovations
Ana-Maria Mihalceanu
 
Quantum AI Discoveries: Fractal Patterns Consciousness and Cyclical Universes
Saikat Basu
 

Reproducible Computational Pipelines with Docker and Nextflow