Paolo DiTommaso - Notredame Lab 

Center for Genomic Regulation (CRG)
HPC Advisory Council - 22 March 2016, Lugano
Reproducible computational
pipelines with Docker and
Nextflow
@PaoloDiTommaso
Research software engineer
Comparative Bioinformatics,
Notredame Lab
Center for Genomic Regulation (CRG)
TWO MAJOR CHALLENGES
IN COMPUTATIONAL
BIOLOGY
COMPUTATIONAL
COMPUTATIONAL
REPRODUCIBILITY CRISIS
To replicate the result of a typical 

computational biology paper

requires 280 hours!
WHAT'S WRONG WITH
COMPUTATIONAL
WORKFLOWS?
COMPLEXITY
• Dozens of dependencies (binary tools, compilers,
libraries, system tools, etc)
• Experimental nature of academic SW tends to be
difficult to install, configure and deploy
• Heterogeneous executing platforms and system
architecture (laptop→supercomputer)
Reproducible Computational Pipelines with Docker and Nextflow
CONTAINERS ARE
THETHIRD BIG WAVE
INVIRTUALISATION
TECHNOLOGY
BENEFITS
• Smaller images (~100MB)
• Fast instantiation time (~1sec)
• Almost native performance
• Easy to build, publish, share and deploy
• Transparent build process
TRANSPARENT 

EXECUTION
cmd_x	--opt	file.txt
docker	run	-v	$PWD:$PWD	-w	$PWD	<image>	cmd_x	--opt	file.txt
Host
NAIVE APPROACH
Docker image
User application
Binary tools
Workflow scripts
Config files
Libraries
Environment
Operating System
Docker engine
SCALING OUT
. . . .
CONTAINERS ORCHESTRATION
• Swarm
• Fleet
• Kubernetes
• Marathon
NOTTHE RIGHT ANSWER
FOR COMPUTATIONAL
PIPELINES
SERVICES ORCHESTRATION

≠

TASKS SCHEDULING
OUR SOLUTION
Nextflow
Host file system
Registry
• A workflow framework that allows the same pipeline
to run across different platforms
• Provides a high level parallelisation model
• Isolates task dependencies using containers
• It enables fast prototyping reusing any existing piece
of software
process	foo	{	
			input:	
			val	str	from	'Hello'	
			output:	
			file	'my_file'	into	result	
			script:	
			"""	
			echo	$str	world!	>	my_file	
			"""	
}	
PROCESS DEFINITION
REACTIVE NETWORK
DATAFLOW
• Declarative computational model for concurrent processes
• Processes wait for data, when an input set is ready the
process is executed
• They communicate by using dataflow variables i.e. async
stream of data called channels
• Parallelisation and tasks dependencies are implicitly defined
by process in/out declarations
PLATFORM AGNOSTIC
Dataflow
Task dispatcher
Executors
POSIX
processes
qsub/bsub/...
tasks
DSL interpreter
nextflow
SUPPORTED PLATFORMS
BATCH SCHEDULER
nextflow
login node
NFS
cluster node
cluster node
cluster node
cluster node
batch scheduler
submit tasks
cluster node
DISTRIBUTED MODE
Login node
NFS/Lustre
Job request
cluster node
cluster node
Job wrapper	#!/bin/bash		
	#$	-q	<queue>	
	#$	-pe	ompi	<nodes>	
	#$	-l	virtual_free=<mem>	
	mpirun	nextflow	run	<your-pipeline>	-with-mpi	
HPC cluster
nextflow cluster
nextflow driver
nextflow worker
nextflow worker
nextflow worker
USE CASE
• Deploying phylogenetic pipeline in BSC MareNostrum
• 500 lines of Nextflow scripting
• ~ 400k jobs
• 512 cores - 32 nodes
• ~ 50k cpu/h
CONFIGURATION FILE
process	{

		executor	=	'slurm'		

		queue	=	'cn-el6'

		memory	=	'10GB'

		cpus	=	8

		time	=	'2h'

		container	=	'your/image:latest'

}
DOCKER AT CRG
Nextflow
Config file
Pipeline script
docker
registry
head
node
Univa grid engine
PROS
• Dead easy deployment procedure
• Self-contained and precise controlled runtime
• Rapidly reproduce any former configuration
• Consistent results over time and across different
platforms
CONS
• Requires a modern Linux kernel (≥3.10)
• Security concerns
• Containers/images cleanup
SHIFTER
• Container technology developed at NERSC
• Nextflow has built-in support for Shifter
• Experimental feature, under test
• It only requires an extra setting in the
configuration file
WHAT ABOUT
PERFORMANCE?
BENCHMARK*
* DiTommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) 

The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 

https://dx.doi.org/10.7717/peerj.1273
DEMO
Reproducible Computational Pipelines with Docker and Nextflow
$ nextflow run nextflow-io/rnatoy -with-docker
$ nextflow run nextflow-io/rnatoy -with-docker
N E X T F L O W ~ version 0.14.3
Pulling nextflow-io/rnatoy ...
downloaded from https://github.com/nextflow-io/rnatoy.git
Launching 'nextflow-io/rnatoy' - revision: 9c61bf5ac5 [master]
R N A T O Y P I P E L I N E
=================================
genome : /User/../data/ggal_1_4885000_49020000.Ggal71.500bp.fa
annotat : /User/../data/ggal_1_4885000_49020000.bed.gff
pair1 : /User/../data/*_1.fq
pair2 : /User/../data/*_2.fq
[warm up] executor > local
[02/b08c28] Submitted process > buildIndex (ggal_1_4885000_49020000.Ggal71)
[ea/97d004] Submitted process > mapping (ggal_gut)
[98/16c9e5] Submitted process > mapping (ggal_liver)
[b5/38a0c7] Submitted process > makeTranscript (ggal_gut)
[00/e5efd6] Submitted process > makeTranscript (ggal_liver)
Saving: transcript_ggal_gut.gtf
Saving: transcript_ggal_liver.gtf
$ nextflow run nextflow-io/rnatoy -revision v1.0
N E X T F L O W ~ version 0.14.3
Launching 'nextflow-io/rnatoy' - revision: 0d0443d8f7 [v1.0]
R N A T O Y P I P E L I N E
=================================
[35/cb611b] Submitted process > prepareTranscriptome (1)
[cd/239926] Submitted process > buildIndex (1)
[c6/f6488d] Submitted process > mapping (2)
[bc/b3ea76] Submitted process > mapping (1)
[f4/8d4628] Submitted process > makeTranscript (1)
[eb/92db7f] Submitted process > makeTranscript (2)
Saving: transcript_ggal_alpha.gtf
Saving: transcript_ggal_beta.gtf
$ vim nextflow.config
process {
executor = 'slurm'
memory = 10.GB
cpus = 32
}
WHO IS USING NEXTFLOW?
WHO IS USING NEXTFLOW?
International Agency for Research on Cancer
Lyon, France
Writing reproducible and
scalable bioinformatics pipelines
using nextflow, docker and github
Matthieu Foll
Nov. 12th 2015
CONCLUSION
• Containers are a game-changer for computational
workflows packaging and deployment
• Nextflow is a reactive/functional framework for
computational workflows.
• Docker + Nextflow = Reproducible self-
contained pipelines.
ACKNOWLEDGMENT
Evan Floden, CRG Emilio Palumbo, CRG
Maria Chatzou, CRG Cedric Notredame, CRG
THANKS
LINKS
project home

http://nextflow.io
GitHub repository
http://github.com/nextflow-io/nextflow
Docker benchmark
https://peerj.com/articles/1273/
Docker-Univa white paper
http://www.nextflow.io/misc/Univa-Docker-Whitepaper_FINAL.pdf

More Related Content

PDF
Standardising Swedish genomics analyses using nextflow
PDF
Reproducible bioinformatics for everyone: Nextflow & nf-core
PDF
いいからベイズ推定してみる
PDF
フリーソフトではじめるChIP-seq解析_第40回勉強会資料
PDF
第5回 配信講義 計算科学技術特論B(2022)
PDF
NMF in PyTorch
PPTX
MongoDB: システム可用性を拡張するインデクス戦略
PPTX
変分ベイズ法の説明
Standardising Swedish genomics analyses using nextflow
Reproducible bioinformatics for everyone: Nextflow & nf-core
いいからベイズ推定してみる
フリーソフトではじめるChIP-seq解析_第40回勉強会資料
第5回 配信講義 計算科学技術特論B(2022)
NMF in PyTorch
MongoDB: システム可用性を拡張するインデクス戦略
変分ベイズ法の説明

What's hot (20)

PPTX
從狗熊到英雄 - 我的.Net 6 blazor新體驗
PPTX
SAP HANAのソースエンドポイントとしての利用
PDF
[DL輪読会] off-policyなメタ強化学習
PPTX
Fontconfigことはじめ
PDF
OpenStack勉強会
PPT
第7回 配信講義 計算科学技術特論B(2022)
PDF
文章生成の未解決問題
PPTX
Fugaku, the Successes and the Lessons Learned
PDF
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
PPTX
Bioconductorも便利ですよ ~ConsensusClusterPlus(CCP)の紹介~
PDF
第8回 配信講義 計算科学技術特論B(2022)
PDF
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
PDF
MySQL 5.7 InnoDB 日本語全文検索(その2)
PDF
Data channelの活用方法とその可能性 - WebRTC Conference Japan
PDF
Performance Comparison of Mutex, RWLock and Atomic types in Rust
PDF
ZynqMPのブートとパワーマネージメント : (ZynqMP Boot and Power Management)
PDF
Lxc で始めるケチケチ仮想化生活?!
PDF
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
PDF
Inverse Filter design using smoothed L-curve method in Frequency Domain for S...
PPTX
Python開発は仮想化しろ
從狗熊到英雄 - 我的.Net 6 blazor新體驗
SAP HANAのソースエンドポイントとしての利用
[DL輪読会] off-policyなメタ強化学習
Fontconfigことはじめ
OpenStack勉強会
第7回 配信講義 計算科学技術特論B(2022)
文章生成の未解決問題
Fugaku, the Successes and the Lessons Learned
MySQL Casual Talks Vol.4 「MySQL-5.6で始める全文検索 〜InnoDB FTS編〜」
Bioconductorも便利ですよ ~ConsensusClusterPlus(CCP)の紹介~
第8回 配信講義 計算科学技術特論B(2022)
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
MySQL 5.7 InnoDB 日本語全文検索(その2)
Data channelの活用方法とその可能性 - WebRTC Conference Japan
Performance Comparison of Mutex, RWLock and Atomic types in Rust
ZynqMPのブートとパワーマネージメント : (ZynqMP Boot and Power Management)
Lxc で始めるケチケチ仮想化生活?!
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
Inverse Filter design using smoothed L-curve method in Frequency Domain for S...
Python開発は仮想化しろ

Viewers also liked (10)

PDF
Packaging Software, Puppet Labs Style - PuppetConf 2014
PDF
Habitat choreography talk
PDF
OpenNebula is Evolving... Fast! - Jaime Melis
PDF
Migrating pipelines into Docker
PPTX
Creating Packages that Run Anywhere with Chef Habitat
PDF
Managing data workflows with Luigi
PDF
What HPC can learn from DevOps?
PDF
Docker Dhahran Nov 2016 meetup
PDF
Luigi presentation NYC Data Science
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Packaging Software, Puppet Labs Style - PuppetConf 2014
Habitat choreography talk
OpenNebula is Evolving... Fast! - Jaime Melis
Migrating pipelines into Docker
Creating Packages that Run Anywhere with Chef Habitat
Managing data workflows with Luigi
What HPC can learn from DevOps?
Docker Dhahran Nov 2016 meetup
Luigi presentation NYC Data Science
A Beginner's Guide to Building Data Pipelines with Luigi

Similar to Reproducible Computational Pipelines with Docker and Nextflow (20)

PPTX
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
PDF
Node Interactive Debugging Node.js In Production
PDF
Profiling your Applications using the Linux Perf Tools
PPTX
Von neumann workers
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Security Monitoring with eBPF
PPT
Threading Successes 03 Gamebryo
PDF
Docker Logging and analysing with Elastic Stack
PDF
Docker Logging and analysing with Elastic Stack - Jakub Hajek
PPTX
The power of linux advanced tracer [POUG18]
PDF
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PDF
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
PDF
pgconfasia2016 plcuda en
PDF
Percona Toolkit for Effective MySQL Administration
PDF
The CAOS framework: democratize the acceleration of compute intensive applica...
PPTX
Shareplex Presentation
PDF
Apache Cassandra at Macys
PPTX
Lrz kurs: big data analysis
PDF
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
PDF
Nvidia in bioinformatics
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Node Interactive Debugging Node.js In Production
Profiling your Applications using the Linux Perf Tools
Von neumann workers
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Security Monitoring with eBPF
Threading Successes 03 Gamebryo
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack - Jakub Hajek
The power of linux advanced tracer [POUG18]
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
SaltConf14 - Eric johnson, Google - Orchestrating Google Compute Engine with ...
pgconfasia2016 plcuda en
Percona Toolkit for Effective MySQL Administration
The CAOS framework: democratize the acceleration of compute intensive applica...
Shareplex Presentation
Apache Cassandra at Macys
Lrz kurs: big data analysis
Our Puppet Story – Patterns and Learnings (sage@guug, March 2014)
Nvidia in bioinformatics

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
The various Industrial Revolutions .pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Custom Battery Pack Design Considerations for Performance and Safety
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Statistics on Ai - sourced from AIPRM.pdf
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
DOCX
search engine optimization ppt fir known well about this
PPT
What is a Computer? Input Devices /output devices
PDF
Five Habits of High-Impact Board Members
PPTX
Training Program for knowledge in solar cell and solar industry
Flame analysis and combustion estimation using large language and vision assi...
Final SEM Unit 1 for mit wpu at pune .pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
The various Industrial Revolutions .pptx
A review of recent deep learning applications in wood surface defect identifi...
Custom Battery Pack Design Considerations for Performance and Safety
Credit Without Borders: AI and Financial Inclusion in Bangladesh
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
sbt 2.0: go big (Scala Days 2025 edition)
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Statistics on Ai - sourced from AIPRM.pdf
A proposed approach for plagiarism detection in Myanmar Unicode text
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Taming the Chaos: How to Turn Unstructured Data into Decisions
Enhancing plagiarism detection using data pre-processing and machine learning...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
search engine optimization ppt fir known well about this
What is a Computer? Input Devices /output devices
Five Habits of High-Impact Board Members
Training Program for knowledge in solar cell and solar industry

Reproducible Computational Pipelines with Docker and Nextflow