Bioinformatics Pipelines
Pipelines are fundamental within any data science environment. Data processing is never a single task. Many pipelines are implemented via ad hoc scripts. This can be done in a useful way, but in many cases, they fail several fundamental viewpoints, chiefly reproducibility, maintainability, and extensibility.
In bioinformatics, you can find three main types of pipeline system:
- Frameworks such as Galaxy (https://usegalaxy.org), which are geared toward users, that is, they expose easy-to-use user interfaces and hide most of the underlying machinery.
- Programmatic workflows – geared toward code interfaces that, while generic, originate from the bioinformatics space. Two examples are Snakemake (https://snakemake.readthedocs.io/) and Nextflow (https://www.nextflow.io/).
- Totally generic workflow systems such as Apache Airflow (https://airflow.incubator.apache.org/), which take a less data-centric approach to workflow management.
In...