SlideShare a Scribd company logo
International Agency for Research on Cancer
Lyon, France
Computational workflows for
omics analyses at the IARC
Dr. Matthieu Foll
nextflow CRG course
September 15th
2017
IARC
• Specialised cancer agency (~350 people) of the
World Health Organization (WHO)
• Well known for “blue books” and monographs
• Also produces original research, promoting
collaboration
• Particular interest in low and middle-income countries
• Cancer causes and prevention
Bioinformatics @IARC
• Data mostly comes from high throughput sequencing and
arrays:
- Genetics, genomics, transcriptomics, epigenetics etc.
- Human mostly, mouse, viruses (HPV, EBV…)
• Research groups with different scientific questions:
- Rarely use standardised methods, no routine work.
- We are not big enough and the work is too heterogeneous to
have a central bioinformatics “service”
• Bioinformatics in each group with an overall coordination
Goals and challenges
• Heterogeneous staff (students, postdocs, research
assistants etc.)
- Writing pipeline in not their primary job in most cases
- Benefit more than it costs
• Avoid duplicating efforts
• Foster collaboration
• Keep everyone’s speciality
• High turnover of students/postdocs
• Promote best practices
Workflows we like
• Good science: state of the art methods
• Easy to install, easy to use, easy to understand
• Easy to write for the developer
• Reproducible (and useful), portable and scalable
• Open (open source, comments and improvements,
bug tracking, version control)
• Modular
Our philosophy
• “Do It Once, Do It Right, And Use It Everywhere”
• “Keep it simple, stupid” (KISS principle):
- most systems work best if they are kept simple
- simplicity should be a key goal in design
- code easier to maintain and to understand
Our design
• Too much automation is not for us:
- Hard to read, to maintain and to keep modular
- eg: we prefer to have one alignment pipeline; one variant calling
pipeline; one annotation pipeline; one QC pipeline.
• One pipeline = one GitHub repo
• Docker and Singularity containers
• CircleCI for tests and deployment
• Standardised readme, params, help etc.
• Use GitHub issues and releases
• Master branch ← beta branch ← dev branch
A pipeline life cycle
In practice
• Entry point: GitHub group
- https://github.com/IARCbioinfo
• One central repo references all nextflow pipelines:
- https://github.com/IARCbioinfo/IARC-nf
- List pipelines with a short description
- One pipeline = one repo, ends with “-nf”
- Common instructions to use the pipelines (install nextflow,
configuration, basic usage, docker…)
• A “template-nf” nextflow “hello-world” repo
Computational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARC
Computational workflows for omics analyses at the IARC
Examples
• https://github.com/IARCbioinfo/RNAseq-nf
• https://github.com/IARCbioinfo/template-nf
Challenges we face
• We like Unix pipes to avoid intermediate files, but having
multiple processes in nextflow is easier to read/debug/
maintain
• What to put as parameters?
• One larger pipeline or split into separate pipelines?
• People might use our pipelines as blackboxes
• Users no longer realise they are using a HPC
• CWL? WDL?
• Learning nextflow, a good investment?
What we love
• Integration with GitHub
• Running any pipeline on any machine in <5 minutes
• Running on a cluster is as simple as a one line config file
• Separate the pipeline definition from the execution aspects
• History, log, trace, timeline
• Resume a pipeline
• Docker and Singularity
• The Gitter chat
What we hate
• The learning curve
• When we (think we) have to guess the syntax
• Debugging
• Syncing channels for multiple inputs
• Creating sets/lists/channels to have multiple inputs in a process
• Dealing with optional steps in a pipeline
• Large “work” directories
• Ending up with several logs and trace files in a directory
• Copy/Pasting processes in different pipelines
What we would love
• Deleting large intermediate files as soon as they
are no longer needed
• Importing processes
• Automatically generating usage from params
• Splitting bed files with a splitBed operator
• A nice html report, an email, WebUI for monitoring
• nextflow available in the clouds we want to use
Conclusion
• nextflow was a good choice for us
• it has dramatically changed the way we work
• how do we work together?
Jon Claerbout
“It's not really for the benefit of other people. Experience shows
the principal beneficiary of reproducible research is you the author
yourself”
Join us!
• PhD student
• PostDoc
• Bioinformatics research assistant
• Staff scientists
http://www.iarc.fr/en/vacancies/
follm@iarc.fr https://github.com/IARCbioinfo

More Related Content

PPTX
From Zero to Nextflow 2017
PDF
Standardising Swedish genomics analyses using nextflow
PDF
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
PPTX
Taming Snakemake
PPTX
How to be a bioinformatician
PDF
How static analysis supports quality over 50 million lines of C++ code
PDF
Adding Transparency and Automation into the Galaxy Tool Installation Process
PDF
FireWorks workflow software
From Zero to Nextflow 2017
Standardising Swedish genomics analyses using nextflow
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
Taming Snakemake
How to be a bioinformatician
How static analysis supports quality over 50 million lines of C++ code
Adding Transparency and Automation into the Galaxy Tool Installation Process
FireWorks workflow software

What's hot (20)

PDF
Getting Started with RNA-Seq Data Analysis
PDF
Building Reproducible Network Data Analysis / Visualization Workflows
PPTX
Toward Semantic Sensor Data Archives on the Web
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
PPTX
A Survey of NGS Data Analysis on Hadoop
PDF
The Materials Project - Combining Science and Informatics to Accelerate Mater...
PDF
The Galaxy bioinformatics workflow environment
PDF
Why is Bioinformatics a Good Fit for Spark?
PDF
Reproducible Workflow with Cytoscape and Jupyter Notebook
PDF
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
PDF
FireWorks overview
PDF
Big data solution for ngs data analysis
PDF
Spark Summit East 2015
PDF
NANO266 - Lecture 9 - Tools of the Modeling Trade
PDF
Large Scale Processing of Unstructured Text
PDF
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
PDF
SDCSB CYTOSCAPE AND NETWORK ANALYSIS WORKSHOP at Sanford Consortium
PDF
data.table and H2O at LondonR with Matt Dowle
PPTX
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Getting Started with RNA-Seq Data Analysis
Building Reproducible Network Data Analysis / Visualization Workflows
Toward Semantic Sensor Data Archives on the Web
"Data Provenance: Principles and Why it matters for BioMedical Applications"
A Survey of NGS Data Analysis on Hadoop
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Galaxy bioinformatics workflow environment
Why is Bioinformatics a Good Fit for Spark?
Reproducible Workflow with Cytoscape and Jupyter Notebook
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
FireWorks overview
Big data solution for ngs data analysis
Spark Summit East 2015
NANO266 - Lecture 9 - Tools of the Modeling Trade
Large Scale Processing of Unstructured Text
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
SDCSB CYTOSCAPE AND NETWORK ANALYSIS WORKSHOP at Sanford Consortium
data.table and H2O at LondonR with Matt Dowle
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Ad

Similar to Computational workflows for omics analyses at the IARC (20)

PPTX
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
PDF
Reproducible Computational Pipelines with Docker and Nextflow
PDF
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
PPTX
Advances in Scientific Workflow Environments
PPTX
Software Pipelines: The Good, The Bad and The Ugly
PPTX
Scientific Computing @ Fred Hutch
PDF
Luigi presentation NYC Data Science
PDF
PyData Meetup Presentation in Natal April 2024
PDF
Data Pipelines with Python - NWA TechFest 2017
PDF
Nextflow Camp 2019: nf-core tutorial
PDF
Overview of Scientific Workflows - Why Use Them?
PDF
nf-core: A community-driven collection of omics portable pipelines
PDF
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
PDF
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
PDF
Developing and sharing reproducible bioinformatics pipelines: best practices
PPT
Reproducible bioinformatics pipelines with Docker and Anduril
PDF
Jose Luis Soria - Codemotion 2014 - Designing a release pipeline
PDF
Reproducible bioinformatics workflows with Nextflow and nf-core
PPT
The Taverna Software Suite
PPTX
2014-06-03-Taverna-IS-ENES2
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Reproducible Computational Pipelines with Docker and Nextflow
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Advances in Scientific Workflow Environments
Software Pipelines: The Good, The Bad and The Ugly
Scientific Computing @ Fred Hutch
Luigi presentation NYC Data Science
PyData Meetup Presentation in Natal April 2024
Data Pipelines with Python - NWA TechFest 2017
Nextflow Camp 2019: nf-core tutorial
Overview of Scientific Workflows - Why Use Them?
nf-core: A community-driven collection of omics portable pipelines
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Developing and sharing reproducible bioinformatics pipelines: best practices
Reproducible bioinformatics pipelines with Docker and Anduril
Jose Luis Soria - Codemotion 2014 - Designing a release pipeline
Reproducible bioinformatics workflows with Nextflow and nf-core
The Taverna Software Suite
2014-06-03-Taverna-IS-ENES2
Ad

Recently uploaded (20)

PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
2. Earth - The Living Planet earth and life
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Sciences of Europe No 170 (2025)
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
An interstellar mission to test astrophysical black holes
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Microbiology with diagram medical studies .pptx
PPT
protein biochemistry.ppt for university classes
PPTX
famous lake in india and its disturibution and importance
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
BIOMOLECULES PPT........................
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Cell Membrane: Structure, Composition & Functions
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
2. Earth - The Living Planet earth and life
Viruses (History, structure and composition, classification, Bacteriophage Re...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Sciences of Europe No 170 (2025)
microscope-Lecturecjchchchchcuvuvhc.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Introduction to Cardiovascular system_structure and functions-1
An interstellar mission to test astrophysical black holes
7. General Toxicologyfor clinical phrmacy.pptx
Placing the Near-Earth Object Impact Probability in Context
Microbiology with diagram medical studies .pptx
protein biochemistry.ppt for university classes
famous lake in india and its disturibution and importance
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
BIOMOLECULES PPT........................
INTRODUCTION TO EVS | Concept of sustainability
Cell Membrane: Structure, Composition & Functions

Computational workflows for omics analyses at the IARC

  • 1. International Agency for Research on Cancer Lyon, France Computational workflows for omics analyses at the IARC Dr. Matthieu Foll nextflow CRG course September 15th 2017
  • 2. IARC • Specialised cancer agency (~350 people) of the World Health Organization (WHO) • Well known for “blue books” and monographs • Also produces original research, promoting collaboration • Particular interest in low and middle-income countries • Cancer causes and prevention
  • 3. Bioinformatics @IARC • Data mostly comes from high throughput sequencing and arrays: - Genetics, genomics, transcriptomics, epigenetics etc. - Human mostly, mouse, viruses (HPV, EBV…) • Research groups with different scientific questions: - Rarely use standardised methods, no routine work. - We are not big enough and the work is too heterogeneous to have a central bioinformatics “service” • Bioinformatics in each group with an overall coordination
  • 4. Goals and challenges • Heterogeneous staff (students, postdocs, research assistants etc.) - Writing pipeline in not their primary job in most cases - Benefit more than it costs • Avoid duplicating efforts • Foster collaboration • Keep everyone’s speciality • High turnover of students/postdocs • Promote best practices
  • 5. Workflows we like • Good science: state of the art methods • Easy to install, easy to use, easy to understand • Easy to write for the developer • Reproducible (and useful), portable and scalable • Open (open source, comments and improvements, bug tracking, version control) • Modular
  • 6. Our philosophy • “Do It Once, Do It Right, And Use It Everywhere” • “Keep it simple, stupid” (KISS principle): - most systems work best if they are kept simple - simplicity should be a key goal in design - code easier to maintain and to understand
  • 7. Our design • Too much automation is not for us: - Hard to read, to maintain and to keep modular - eg: we prefer to have one alignment pipeline; one variant calling pipeline; one annotation pipeline; one QC pipeline. • One pipeline = one GitHub repo • Docker and Singularity containers • CircleCI for tests and deployment • Standardised readme, params, help etc. • Use GitHub issues and releases • Master branch ← beta branch ← dev branch
  • 9. In practice • Entry point: GitHub group - https://github.com/IARCbioinfo • One central repo references all nextflow pipelines: - https://github.com/IARCbioinfo/IARC-nf - List pipelines with a short description - One pipeline = one repo, ends with “-nf” - Common instructions to use the pipelines (install nextflow, configuration, basic usage, docker…) • A “template-nf” nextflow “hello-world” repo
  • 14. Challenges we face • We like Unix pipes to avoid intermediate files, but having multiple processes in nextflow is easier to read/debug/ maintain • What to put as parameters? • One larger pipeline or split into separate pipelines? • People might use our pipelines as blackboxes • Users no longer realise they are using a HPC • CWL? WDL? • Learning nextflow, a good investment?
  • 15. What we love • Integration with GitHub • Running any pipeline on any machine in <5 minutes • Running on a cluster is as simple as a one line config file • Separate the pipeline definition from the execution aspects • History, log, trace, timeline • Resume a pipeline • Docker and Singularity • The Gitter chat
  • 16. What we hate • The learning curve • When we (think we) have to guess the syntax • Debugging • Syncing channels for multiple inputs • Creating sets/lists/channels to have multiple inputs in a process • Dealing with optional steps in a pipeline • Large “work” directories • Ending up with several logs and trace files in a directory • Copy/Pasting processes in different pipelines
  • 17. What we would love • Deleting large intermediate files as soon as they are no longer needed • Importing processes • Automatically generating usage from params • Splitting bed files with a splitBed operator • A nice html report, an email, WebUI for monitoring • nextflow available in the clouds we want to use
  • 18. Conclusion • nextflow was a good choice for us • it has dramatically changed the way we work • how do we work together? Jon Claerbout “It's not really for the benefit of other people. Experience shows the principal beneficiary of reproducible research is you the author yourself”
  • 19. Join us! • PhD student • PostDoc • Bioinformatics research assistant • Staff scientists http://www.iarc.fr/en/vacancies/ [email protected] https://github.com/IARCbioinfo