Computational workflows for omics analyses at the IARC

International Agency for Research on Cancer
Lyon, France
Computational workﬂows for
omics analyses at the IARC
Dr. Matthieu Foll
nextﬂow CRG course
September 15th
2017

IARC
• Specialised cancer agency (~350 people) of the
World Health Organization (WHO)
• Well known for “blue books” and monographs
• Also produces original research, promoting
collaboration
• Particular interest in low and middle-income countries
• Cancer causes and prevention

Bioinformatics @IARC
• Data mostly comes from high throughput sequencing and
arrays:
- Genetics, genomics, transcriptomics, epigenetics etc.
- Human mostly, mouse, viruses (HPV, EBV…)
• Research groups with different scientiﬁc questions:
- Rarely use standardised methods, no routine work.
- We are not big enough and the work is too heterogeneous to
have a central bioinformatics “service”
• Bioinformatics in each group with an overall coordination

Goals and challenges
• Heterogeneous staff (students, postdocs, research
assistants etc.)
- Writing pipeline in not their primary job in most cases
- Beneﬁt more than it costs
• Avoid duplicating efforts
• Foster collaboration
• Keep everyone’s speciality
• High turnover of students/postdocs
• Promote best practices

Workﬂows we like
• Good science: state of the art methods
• Easy to install, easy to use, easy to understand
• Easy to write for the developer
• Reproducible (and useful), portable and scalable
• Open (open source, comments and improvements,
bug tracking, version control)
• Modular

Our philosophy
• “Do It Once, Do It Right, And Use It Everywhere”
• “Keep it simple, stupid” (KISS principle):
- most systems work best if they are kept simple
- simplicity should be a key goal in design
- code easier to maintain and to understand

Our design
• Too much automation is not for us:
- Hard to read, to maintain and to keep modular
- eg: we prefer to have one alignment pipeline; one variant calling
pipeline; one annotation pipeline; one QC pipeline.
• One pipeline = one GitHub repo
• Docker and Singularity containers
• CircleCI for tests and deployment
• Standardised readme, params, help etc.
• Use GitHub issues and releases
• Master branch ← beta branch ← dev branch

In practice
• Entry point: GitHub group
- https://github.com/IARCbioinfo
• One central repo references all nextflow pipelines:
- https://github.com/IARCbioinfo/IARC-nf
- List pipelines with a short description
- One pipeline = one repo, ends with “-nf”
- Common instructions to use the pipelines (install nextflow,
configuration, basic usage, docker…)
• A “template-nf” nextflow “hello-world” repo

Computational workflows for omics analyses at the IARC

Examples
• https://github.com/IARCbioinfo/RNAseq-nf
• https://github.com/IARCbioinfo/template-nf

Challenges we face
• We like Unix pipes to avoid intermediate files, but having
multiple processes in nextflow is easier to read/debug/
maintain
• What to put as parameters?
• One larger pipeline or split into separate pipelines?
• People might use our pipelines as blackboxes
• Users no longer realise they are using a HPC
• CWL? WDL?
• Learning nextflow, a good investment?

What we love
• Integration with GitHub
• Running any pipeline on any machine in <5 minutes
• Running on a cluster is as simple as a one line config file
• Separate the pipeline definition from the execution aspects
• History, log, trace, timeline
• Resume a pipeline
• Docker and Singularity
• The Gitter chat

What we hate
• The learning curve
• When we (think we) have to guess the syntax
• Debugging
• Syncing channels for multiple inputs
• Creating sets/lists/channels to have multiple inputs in a process
• Dealing with optional steps in a pipeline
• Large “work” directories
• Ending up with several logs and trace ﬁles in a directory
• Copy/Pasting processes in different pipelines

What we would love
• Deleting large intermediate files as soon as they
are no longer needed
• Importing processes
• Automatically generating usage from params
• Splitting bed files with a splitBed operator
• A nice html report, an email, WebUI for monitoring
• nextflow available in the clouds we want to use

Conclusion
• nextflow was a good choice for us
• it has dramatically changed the way we work
• how do we work together?
Jon Claerbout
“It's not really for the benefit of other people. Experience shows
the principal beneficiary of reproducible research is you the author
yourself”

Join us!
• PhD student
• PostDoc
• Bioinformatics research assistant
• Staff scientists
http://www.iarc.fr/en/vacancies/
follm@iarc.fr https://github.com/IARCbioinfo

Computational workflows for omics analyses at the IARC

More Related Content

What's hot (20)

Similar to Computational workflows for omics analyses at the IARC (20)

Recently uploaded (20)

Computational workflows for omics analyses at the IARC