SlideShare a Scribd company logo
Motivation     Solution             Implementation   Demonstration




          Developing distributed analysis
         pipelines with shared community
        resources using CloudBioLinux and
                    CloudMan

                     Brad Chapman
                   Bioinformatics Core
             Harvard School of Public Health


                          22 September 2011
Motivation     Solution    Implementation   Demonstration



Acknowledgements

     CloudBioLinux – Ntino Krampis, Tim
             Booth, Dawn Field, Pjotr Prins and
             CloudBioLinux community
     CloudMan – Enis Afgan, James Taylor
     Exome pipeline – HSPH, MGH, Win Hide,
             Oliver Hofmann
Motivation   Solution   Implementation   Demonstration



Follow along



     http://www.slideshare.net/chapmanb
Motivation   Solution   Implementation   Demonstration



Cue the “lots of data” slide


     ls -lh fastq/
     24G 1_110907_AD08A5ACXX_1_fastq.txt
     21G 1_110907_AD08A5ACXX_2_fastq.txt
     24G 2_110907_AD08A5ACXX_1_fastq.txt
     20G 2_110907_AD08A5ACXX_2_fastq.txt
Motivation   Solution   Implementation   Demonstration



Rapidly changing tools
Motivation   Solution       Implementation   Demonstration



Science – fundamental challenge



      75%     one-off experimental
      25%     reused code
Motivation       Solution      Implementation      Demonstration



Unfortunate result




     http://news.ycombinator.com/item?id=2735537
Motivation      Solution    Implementation   Demonstration



Hard choices

     Computation
     Demands flexible, well-architected, scalable
     code
     Science
     Requires rapid turn around and
     experimentation
Motivation         Solution    Implementation   Demonstration



2 solutions (at least)



         1   Improve your programming skills
         2   Utilize community resources
Motivation   Solution   Implementation   Demonstration



Become a better coder




     http://software-carpentry.org/
Motivation         Solution     Implementation     Demonstration



Community resources


             Share painful parts
             Base of well-written, scalable code

     Start each problem from a higher level of
     abstraction
Motivation         Solution    Implementation   Demonstration



Community components


             CloudBioLinux – install software
             CloudMan – manage cluster
             Exome analysis pipeline – do science
Motivation        Solution    Implementation    Demonstration



CloudBioLinux

             Amazon image with bioinformatics
             software and libraries
             Automated build framework
             Community effort to maintain and
             extend
     http://cloudbiolinux.org
Motivation        Solution    Implementation   Demonstration



CloudMan

             SGE cluster plus automation
             Web interface and monitoring
             Persistence and sharing
             Powers the Galaxy Cloud offering
     http://wiki.g2.bx.psu.edu/Admin/Cloud
Motivation         Solution        Implementation   Demonstration



Exome analysis pipeline

             Existing algorithms
                 Aligners – Bowtie, BWA
                 Variation – GATK
                 Quality assessment – FastQC, Picard
             Messaging system – AMQP
     https://github.com/chapmanb/bcbb/
     tree/master/nextgen
Motivation   Solution   Implementation   Demonstration



Fastq lane processing
Motivation   Solution   Implementation   Demonstration



Sample processing
Motivation   Solution   Implementation   Demonstration



Variant calling
Motivation   Solution   Implementation   Demonstration



Parallelization
Motivation   Solution   Implementation   Demonstration
Motivation         Solution     Implementation   Demonstration



Amazon


             Virtual machines
                  Share
                  Reproduce
                  Coordinate
             Accessibility
Motivation        Solution   Implementation   Demonstration



What are we going to do?


             Use AWS console to boot
             CloudBioLinux
             Setup CloudMan in AWS console
             Boot CloudMan instance with demo
             data
Motivation         Solution   Implementation   Demonstration



What are we going to do?
continued

             Manage cluster with CloudMan interface
             Setup messaging queue
             Run pipeline, examine results
             Share cluster
Motivation        Solution       Implementation   Demonstration



CloudBioLinux

             Select and launch CloudBioLinux AMI
             from AWS console
             Connect
                 FreeNX graphical client
                 ssh

     Full tutorial PDF: http://j.mp/nnh5TE
Motivation           Solution        Implementation       Demonstration



Prep work


             Signup for AWS account:
             http://aws.amazon.com/
             Create login key pair in AWS Console
             Install NX client:
             http://www.nomachine.com/select-package-client.php
https://console.aws.amazon.com/ec2/
Select CloudBioLinux image from Community AMIs
enter NX password in user-data (freenxpass: secret)
Launch CloudBioLinux server
Get external hostname from Instances page
Connect using NX client, with ubuntu user and secret password
Developing distributed analysis pipelines with shared community resources using CloudBioLinux and CloudMan
Connect with ssh, using private ssh key-pair
Terminate the server when finished
Motivation         Solution    Implementation   Demonstration



Setup CloudMan in AWS console


             Create a custom security group

     Full tutorial:
     http://wiki.g2.bx.psu.edu/Admin/Cloud
Create security group rules following wiki instructions
Final security group specifications
Motivation        Solution   Implementation   Demonstration



Boot CloudMan instance with
demo data


             Start server
             Pass in CloudMan user data
             Load shared CloudMan image
Follow same procedure as CloudBioLinux
Create CloudMan user-data file


 cluster_name: cbldemo
 password: cbl
 access_key: your_access_key
 secret_key: your_long_AWS_secret_key
Provide user-data from file
Choose created security group
Login to instance with password from user-data
Motivation           Solution          Implementation          Demonstration



CloudMan share-an-instance


             Persist data in a CloudMan cluster
             Easily sharable
     For this demo
     cm-b53c6f1223f966914df347687f6fc818/shared/2011-10-07–14-00
Import shared instance with demo data
Motivation         Solution    Implementation   Demonstration



Manage cluster with CloudMan


             Web-based console
             Monitor running processes
             Add nodes to cluster as needed
CloudMan console to interact with cluster
Add node to cluster
Motivation        Solution    Implementation   Demonstration



Setup messaging communication


             Command line access to server
             Adjust RabbitMQ configuration
             Setup messaging queue
Motivation       Solution      Implementation    Demonstration



Command line access to server

     ssh -i ~/.ec2/id-kunkel.keypair
         ubuntu@ec2-67-202-14-208.compute-1.amazonaws.com


     Follow approach used to connect to
     CloudBioLinux cluster; can also connect via
     NX
Motivation       Solution   Implementation   Demonstration




     Edit /export/data/galaxy/universe_wsgi.ini
     configuration file to add internal host name.

         [galaxy_amqp]
         host = ip-10-125-10-182.ec2.internal
         port = 5672
         userid = biouser
         password = tester
Motivation         Solution        Implementation        Demonstration



Setup messaging queue

     $ sudo rabbitmqctl add_user biouser tester
      creating user ’biouser’ ...
      ...done.
     $ sudo rabbitmqctl add_vhost bionextgen
      creating vhost ’bionextgen’ ...
      ...done.
     $ sudo rabbitmqctl set_permissions -p bionextgen
            biouser ".*" ".*" ".*"
      setting permissions for user ’biouser’ in vhost ’bionextgen’ ..
      ...done.
Motivation         Solution    Implementation   Demonstration



Run pipeline, examine results


             Ready to run distributed pipeline
             Demo data – two paired end fastq lanes
             Variant calling workflow
Motivation       Solution   Implementation   Demonstration



Input sequence data


         $ ls -1 /export/data/exome_example/fastq/
         7_100326_FC6107FAAXX_1-chr22.fastq
         7_100326_FC6107FAAXX_2-chr22.fastq
         8_100326_FC6107FAAXX_1-chr22.fastq
         8_100326_FC6107FAAXX_2-chr22.fastq
Motivation         Solution        Implementation        Demonstration



Run level: YAML Configuration
     $ cat /export/data/exome_example/config/run_info.yaml
     ---
     fc_date: ’100326’
     fc_name: FC6107FAAXX
     details:
       - files: [7_100326_FC6107FAAXX_1-chr22.fastq,
                 7_100326_FC6107FAAXX_2-chr22.fastq]
         lane: 7
         description: Test replicate 1
         analysis: SNP calling
         genome_build: hg19
         algorithm:
           quality_format: Standard
           hybrid_bait: hybrid_selection/baits.bed
           hybrid_target: hybrid_selection/targets.bed
Motivation         Solution        Implementation   Demonstration



System level: YAML Configuration
     $ cat /export/data/galaxy/post_process.yaml
     ---
     program:
       bowtie: bowtie
       bwa: bwa
       ucsc_bigwig: wigToBigWig
       picard: /usr/share/java/picard
       gatk: /usr/share/java/gatk
       snpEff: /usr/share/java/snpeff
       fastqc: fastqc
     distributed:
       cluster_platform: sge
       platform_args: ’-q all.q’
       cores_per_host: 1
       rabbitmq_vhost: bionextgen
Motivation       Solution      Implementation    Demonstration



Run exome pipeline


     $ cd /export/data/work
     $ distributed_nextgen_pipeline.py
         /export/data/galaxy/post_process.yaml
         /export/data/exome_example/fastq
         /export/data/exome_example/config/run_info.yaml
Motivation   Solution   Implementation   Demonstration



What just happened?
Motivation         Solution        Implementation       Demonstration



Monitoring: SGE queues


     $ qstat
     ob-ID prior name state submit/start at    queue
     --------------------------------------------------------------
     1 0.55500 nextgen_an r 18:16:32 all.q@ip-10-125-10-182.ec2.int
     2 0.55500 nextgen_an r 18:16:32 all.q@ip-10-86-254-105.ec2.int
     3 0.55500 automated_ r 18:16:47 all.q@ip-10-125-10-182.ec2.int
Motivation         Solution        Implementation        Demonstration



Monitoring: Analysis directory


     $ cd /export/data/work
     $ ls -lh
     drwxr-xr-x 4.0 alignments
     -rw-r--r-- 2.0K automated_initial_analysis.py.o11
     drwxr-xr-x   33 log
     -rw-r--r-- 15K nextgen_analysis_server.py.o10
     -rw-r--r-- 15K nextgen_analysis_server.py.o9
     drwxr-xr-x 102 tmp
Motivation         Solution        Implementation        Demonstration



Monitoring: Log files

     $ less nextgen_analysis_server.py.o10
      INFO: nextgen_pipeline: Processing sample: Test replicate 2;
        lane 8; reference genome hg19; researcher ;
        analysis method SNP calling
      INFO: nextgen_pipeline:
        Aligning lane 8_100326_FC6107FAAXX with bwa aligner
      INFO: nextgen_pipeline:
        Combining and preparing wig file [u’’, u’Test replicate 2’]
      INFO: nextgen_pipeline:
        Recalibrating [u’’, u’Test replicate 2’] with GATK
Motivation         Solution        Implementation          Demonstration



Retrieve results: Copy files

     $ upload_to_galaxy.py
         /export/data/galaxy/post_process.yaml
         /export/data/exome_example/fastq
         /export/data/work
         /export/data/exome_example/config/run_info.yaml


     Final files copied into new directory; allows
     cleanup of analysis directory
Motivation         Solution        Implementation        Demonstration



Retrieve results: Output directory


     $ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7
     -rw-r--r-- 38M 7_100326_FC6107FAAXX.bam
     -rw-r--r-- 22M 7_100326_FC6107FAAXX-coverage.bigwig
     -rw-r--r-- 72M 7_100326_FC6107FAAXX-gatkrecal.bam
     -rw-r--r-- 109K 7_100326_FC6107FAAXX-snp-effects.tsv
     -rw-r--r-- 827K 7_100326_FC6107FAAXX-snp-filter.vcf
     -rw-r--r-- 1.6M 7_100326_FC6107FAAXX-summary.pdf
Motivation        Solution     Implementation       Demonstration



Share results


             Share-an-instance
             Uses CloudMan web interface
             Reproducible research
                 CloudBioLinux AMI – software
                 CloudMan – data and configuration
CloudMan console enables push button sharing
Can make public or available to specific collaborators
When finished, turn everything off through CloudMan
Motivation         Solution    Implementation   Demonstration



Summary

     CloudBioLinux

             Shared machine image of biological
             software
             Boot from AWS console
             Connect with NX graphical client and
             ssh
Motivation         Solution    Implementation   Demonstration



Summary

     CloudMan

             Cluster setup and management
             Boot from share-an-instance
             Manage cluster through web interface
             Share final results
Motivation         Solution     Implementation   Demonstration



Summary

     Exome pipeline

             Parallel framework for running analyses
             Run using automated scripts
             Extract alignments, variant calls and
             summary information
Motivation       Solution      Implementation       Demonstration



Future: interfaces make it easier




     https://bitbucket.org/hbc/galaxy-central-hbc
Motivation   Solution   Implementation   Demonstration



Future: Simplified file selection
Motivation   Solution   Implementation   Demonstration



Future: Top level parameters
Motivation   Solution   Implementation   Demonstration



Future: Galaxy data libraries
Motivation   Solution   Implementation   Demonstration



Future: Galaxy analysis
Motivation   Solution   Implementation   Demonstration



Future: External UCSC
visualization
Motivation       Solution   Implementation   Demonstration



Read more

             Step-by-step instructions
             http://j.mp/rp69nx
             Approaches to parallelism
             http://j.mp/nPQHcm
             Future work
             http://bcbio.wordpress.com

More Related Content

PDF
Optimizing Application Performance on Kubernetes
PDF
[OpenStack Day in Korea] OpenStack Provisioning in 30 minutes
PDF
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
PPTX
Hitchhiker's Guide to Open Source Cloud Computing
PDF
Open shift deployment review getting ready for day 2 operations
PDF
sed.pdf
PPTX
Continuous Delivery - Pipeline as-code
PDF
An Introduction to CMake
 
Optimizing Application Performance on Kubernetes
[OpenStack Day in Korea] OpenStack Provisioning in 30 minutes
Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems...
Hitchhiker's Guide to Open Source Cloud Computing
Open shift deployment review getting ready for day 2 operations
sed.pdf
Continuous Delivery - Pipeline as-code
An Introduction to CMake
 

What's hot (18)

PDF
Let's go HTTPS-only! - More Than Buying a Certificate
PDF
Hadoop Tutorial
PDF
Monitoring Akka with Kamon 1.0
PPTX
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
PDF
Virtual CD4PE Workshop
PDF
Monitor your Java application with Prometheus Stack
PPTX
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
PPTX
Tribal Nova Docker feedback
PDF
Monitoring kubernetes with prometheus
PDF
Spring Boot Revisited with KoFu and JaFu
PDF
CMake: Improving Software Quality and Process
PDF
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
PDF
CloudStack usage service
PDF
Ninja, Choose Your Weapon!
PDF
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
PDF
Ddev workshop t3dd18
PPTX
How to add a new hypervisor to CloudStack - Lessons learned from Hyper-V effort
PDF
Gradle 3.0: Unleash the Daemon!
Let's go HTTPS-only! - More Than Buying a Certificate
Hadoop Tutorial
Monitoring Akka with Kamon 1.0
vBACD July 2012 - Deploying Private PaaS with ActiveState Stackato
Virtual CD4PE Workshop
Monitor your Java application with Prometheus Stack
vBACD- July 2012 - Crash Course in Open Source Cloud Computing
Tribal Nova Docker feedback
Monitoring kubernetes with prometheus
Spring Boot Revisited with KoFu and JaFu
CMake: Improving Software Quality and Process
Cleaning Up the Dirt of the Nineties - How New Protocols are Modernizing the Web
CloudStack usage service
Ninja, Choose Your Weapon!
Kubernetes Cluster API - managing the infrastructure of multi clusters (k8s ...
Ddev workshop t3dd18
How to add a new hypervisor to CloudStack - Lessons learned from Hyper-V effort
Gradle 3.0: Unleash the Daemon!
Ad

Viewers also liked (20)

PPT
Vol 02 chapter 8 2012
PDF
007 014 belcaro corrige
PDF
Phenomenal Oct 22, 2009
PDF
201101 affective learning
KEY
Urban Cottage + IceMilk Aprons
PPTX
Eeuwigblijvenleren
PDF
201004 - brain computer interaction
PPT
Slides boekpresentatie 'Sociale Media en Journalistiek'
PDF
Valvuloplastie
PDF
201506 CSE340 Lecture 23
PPT
Uip Romain
PPT
Eddie Slide Show
PPT
New Venture Presentatie
PDF
Week11 Presentation Group-C
PPT
Jay Cross Vivo Versao Final Corrigida
PPT
Mobile Social Media, Sept. 2010, Do You Want To Be Visible?, Marketing Club K...
PDF
201506 CSE340 Lecture 09
PPT
Chapter 11
PPT
Barya Perception
Vol 02 chapter 8 2012
007 014 belcaro corrige
Phenomenal Oct 22, 2009
201101 affective learning
Urban Cottage + IceMilk Aprons
Eeuwigblijvenleren
201004 - brain computer interaction
Slides boekpresentatie 'Sociale Media en Journalistiek'
Valvuloplastie
201506 CSE340 Lecture 23
Uip Romain
Eddie Slide Show
New Venture Presentatie
Week11 Presentation Group-C
Jay Cross Vivo Versao Final Corrigida
Mobile Social Media, Sept. 2010, Do You Want To Be Visible?, Marketing Club K...
201506 CSE340 Lecture 09
Chapter 11
Barya Perception
Ad

Similar to Developing distributed analysis pipelines with shared community resources using CloudBioLinux and CloudMan (20)

PDF
Amazon resource for bioinformatics
PDF
Chi next gen-ntino-krampis
PDF
CHPC Workshop Morning Session
ODP
Cloud BioLinux S.Africa
PDF
Bosc2011 ntino-krampis-full
PDF
Cloud ntino-krampis
PDF
Genomics on aws-webinar-april2018
PPTX
Cloudgene - A MapReduce based Workflow Management System
PDF
F02-Cloud-Cloud BioLinux
PPTX
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
PDF
Ntino Krampis GSC 2011
PDF
Developing an open source community for cloud bioinformatics
PPTX
Scientific Computing @ Fred Hutch
PDF
E Afgan - Zero to a bioinformatics analysis platform in four minutes
PDF
Nyc big datagenomics-pizarroa-sept2017
PDF
B Chapman - Codefest BOSC2012
PDF
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
PDF
Cloud computing and bioinformatics
PDF
Big data week London Big data pipelining 0.2
Amazon resource for bioinformatics
Chi next gen-ntino-krampis
CHPC Workshop Morning Session
Cloud BioLinux S.Africa
Bosc2011 ntino-krampis-full
Cloud ntino-krampis
Genomics on aws-webinar-april2018
Cloudgene - A MapReduce based Workflow Management System
F02-Cloud-Cloud BioLinux
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
Building cloud-enabled genomics workflows with Luigi and Docker
Ntino Krampis GSC 2011
Developing an open source community for cloud bioinformatics
Scientific Computing @ Fred Hutch
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Nyc big datagenomics-pizarroa-sept2017
B Chapman - Codefest BOSC2012
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Cloud computing and bioinformatics
Big data week London Big data pipelining 0.2

Recently uploaded (20)

PDF
Chapter 2 Digital Image Fundamentals.pdf
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Newfamily of error-correcting codes based on genetic algorithms
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
DevOps & Developer Experience Summer BBQ
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
PDF
KodekX | Application Modernization Development
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
SAP855240_ALP - Defining the Global Template PUBLIC.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Omni-Path Integration Expertise Offered by Nor-Tech
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 2 Digital Image Fundamentals.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Newfamily of error-correcting codes based on genetic algorithms
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DevOps & Developer Experience Summer BBQ
Advanced Soft Computing BINUS July 2025.pdf
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
KodekX | Application Modernization Development
Smarter Business Operations Powered by IoT Remote Monitoring
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
SAP855240_ALP - Defining the Global Template PUBLIC.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Omni-Path Integration Expertise Offered by Nor-Tech
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Developing distributed analysis pipelines with shared community resources using CloudBioLinux and CloudMan

  • 1. Motivation Solution Implementation Demonstration Developing distributed analysis pipelines with shared community resources using CloudBioLinux and CloudMan Brad Chapman Bioinformatics Core Harvard School of Public Health 22 September 2011
  • 2. Motivation Solution Implementation Demonstration Acknowledgements CloudBioLinux – Ntino Krampis, Tim Booth, Dawn Field, Pjotr Prins and CloudBioLinux community CloudMan – Enis Afgan, James Taylor Exome pipeline – HSPH, MGH, Win Hide, Oliver Hofmann
  • 3. Motivation Solution Implementation Demonstration Follow along http://www.slideshare.net/chapmanb
  • 4. Motivation Solution Implementation Demonstration Cue the “lots of data” slide ls -lh fastq/ 24G 1_110907_AD08A5ACXX_1_fastq.txt 21G 1_110907_AD08A5ACXX_2_fastq.txt 24G 2_110907_AD08A5ACXX_1_fastq.txt 20G 2_110907_AD08A5ACXX_2_fastq.txt
  • 5. Motivation Solution Implementation Demonstration Rapidly changing tools
  • 6. Motivation Solution Implementation Demonstration Science – fundamental challenge 75% one-off experimental 25% reused code
  • 7. Motivation Solution Implementation Demonstration Unfortunate result http://news.ycombinator.com/item?id=2735537
  • 8. Motivation Solution Implementation Demonstration Hard choices Computation Demands flexible, well-architected, scalable code Science Requires rapid turn around and experimentation
  • 9. Motivation Solution Implementation Demonstration 2 solutions (at least) 1 Improve your programming skills 2 Utilize community resources
  • 10. Motivation Solution Implementation Demonstration Become a better coder http://software-carpentry.org/
  • 11. Motivation Solution Implementation Demonstration Community resources Share painful parts Base of well-written, scalable code Start each problem from a higher level of abstraction
  • 12. Motivation Solution Implementation Demonstration Community components CloudBioLinux – install software CloudMan – manage cluster Exome analysis pipeline – do science
  • 13. Motivation Solution Implementation Demonstration CloudBioLinux Amazon image with bioinformatics software and libraries Automated build framework Community effort to maintain and extend http://cloudbiolinux.org
  • 14. Motivation Solution Implementation Demonstration CloudMan SGE cluster plus automation Web interface and monitoring Persistence and sharing Powers the Galaxy Cloud offering http://wiki.g2.bx.psu.edu/Admin/Cloud
  • 15. Motivation Solution Implementation Demonstration Exome analysis pipeline Existing algorithms Aligners – Bowtie, BWA Variation – GATK Quality assessment – FastQC, Picard Messaging system – AMQP https://github.com/chapmanb/bcbb/ tree/master/nextgen
  • 16. Motivation Solution Implementation Demonstration Fastq lane processing
  • 17. Motivation Solution Implementation Demonstration Sample processing
  • 18. Motivation Solution Implementation Demonstration Variant calling
  • 19. Motivation Solution Implementation Demonstration Parallelization
  • 20. Motivation Solution Implementation Demonstration
  • 21. Motivation Solution Implementation Demonstration Amazon Virtual machines Share Reproduce Coordinate Accessibility
  • 22. Motivation Solution Implementation Demonstration What are we going to do? Use AWS console to boot CloudBioLinux Setup CloudMan in AWS console Boot CloudMan instance with demo data
  • 23. Motivation Solution Implementation Demonstration What are we going to do? continued Manage cluster with CloudMan interface Setup messaging queue Run pipeline, examine results Share cluster
  • 24. Motivation Solution Implementation Demonstration CloudBioLinux Select and launch CloudBioLinux AMI from AWS console Connect FreeNX graphical client ssh Full tutorial PDF: http://j.mp/nnh5TE
  • 25. Motivation Solution Implementation Demonstration Prep work Signup for AWS account: http://aws.amazon.com/ Create login key pair in AWS Console Install NX client: http://www.nomachine.com/select-package-client.php
  • 27. Select CloudBioLinux image from Community AMIs
  • 28. enter NX password in user-data (freenxpass: secret)
  • 30. Get external hostname from Instances page
  • 31. Connect using NX client, with ubuntu user and secret password
  • 33. Connect with ssh, using private ssh key-pair
  • 34. Terminate the server when finished
  • 35. Motivation Solution Implementation Demonstration Setup CloudMan in AWS console Create a custom security group Full tutorial: http://wiki.g2.bx.psu.edu/Admin/Cloud
  • 36. Create security group rules following wiki instructions
  • 37. Final security group specifications
  • 38. Motivation Solution Implementation Demonstration Boot CloudMan instance with demo data Start server Pass in CloudMan user data Load shared CloudMan image
  • 39. Follow same procedure as CloudBioLinux
  • 40. Create CloudMan user-data file cluster_name: cbldemo password: cbl access_key: your_access_key secret_key: your_long_AWS_secret_key
  • 43. Login to instance with password from user-data
  • 44. Motivation Solution Implementation Demonstration CloudMan share-an-instance Persist data in a CloudMan cluster Easily sharable For this demo cm-b53c6f1223f966914df347687f6fc818/shared/2011-10-07–14-00
  • 45. Import shared instance with demo data
  • 46. Motivation Solution Implementation Demonstration Manage cluster with CloudMan Web-based console Monitor running processes Add nodes to cluster as needed
  • 47. CloudMan console to interact with cluster
  • 48. Add node to cluster
  • 49. Motivation Solution Implementation Demonstration Setup messaging communication Command line access to server Adjust RabbitMQ configuration Setup messaging queue
  • 50. Motivation Solution Implementation Demonstration Command line access to server ssh -i ~/.ec2/id-kunkel.keypair [email protected] Follow approach used to connect to CloudBioLinux cluster; can also connect via NX
  • 51. Motivation Solution Implementation Demonstration Edit /export/data/galaxy/universe_wsgi.ini configuration file to add internal host name. [galaxy_amqp] host = ip-10-125-10-182.ec2.internal port = 5672 userid = biouser password = tester
  • 52. Motivation Solution Implementation Demonstration Setup messaging queue $ sudo rabbitmqctl add_user biouser tester creating user ’biouser’ ... ...done. $ sudo rabbitmqctl add_vhost bionextgen creating vhost ’bionextgen’ ... ...done. $ sudo rabbitmqctl set_permissions -p bionextgen biouser ".*" ".*" ".*" setting permissions for user ’biouser’ in vhost ’bionextgen’ .. ...done.
  • 53. Motivation Solution Implementation Demonstration Run pipeline, examine results Ready to run distributed pipeline Demo data – two paired end fastq lanes Variant calling workflow
  • 54. Motivation Solution Implementation Demonstration Input sequence data $ ls -1 /export/data/exome_example/fastq/ 7_100326_FC6107FAAXX_1-chr22.fastq 7_100326_FC6107FAAXX_2-chr22.fastq 8_100326_FC6107FAAXX_1-chr22.fastq 8_100326_FC6107FAAXX_2-chr22.fastq
  • 55. Motivation Solution Implementation Demonstration Run level: YAML Configuration $ cat /export/data/exome_example/config/run_info.yaml --- fc_date: ’100326’ fc_name: FC6107FAAXX details: - files: [7_100326_FC6107FAAXX_1-chr22.fastq, 7_100326_FC6107FAAXX_2-chr22.fastq] lane: 7 description: Test replicate 1 analysis: SNP calling genome_build: hg19 algorithm: quality_format: Standard hybrid_bait: hybrid_selection/baits.bed hybrid_target: hybrid_selection/targets.bed
  • 56. Motivation Solution Implementation Demonstration System level: YAML Configuration $ cat /export/data/galaxy/post_process.yaml --- program: bowtie: bowtie bwa: bwa ucsc_bigwig: wigToBigWig picard: /usr/share/java/picard gatk: /usr/share/java/gatk snpEff: /usr/share/java/snpeff fastqc: fastqc distributed: cluster_platform: sge platform_args: ’-q all.q’ cores_per_host: 1 rabbitmq_vhost: bionextgen
  • 57. Motivation Solution Implementation Demonstration Run exome pipeline $ cd /export/data/work $ distributed_nextgen_pipeline.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/exome_example/config/run_info.yaml
  • 58. Motivation Solution Implementation Demonstration What just happened?
  • 59. Motivation Solution Implementation Demonstration Monitoring: SGE queues $ qstat ob-ID prior name state submit/start at queue -------------------------------------------------------------- 1 0.55500 nextgen_an r 18:16:32 [email protected] 2 0.55500 nextgen_an r 18:16:32 [email protected] 3 0.55500 automated_ r 18:16:47 [email protected]
  • 60. Motivation Solution Implementation Demonstration Monitoring: Analysis directory $ cd /export/data/work $ ls -lh drwxr-xr-x 4.0 alignments -rw-r--r-- 2.0K automated_initial_analysis.py.o11 drwxr-xr-x 33 log -rw-r--r-- 15K nextgen_analysis_server.py.o10 -rw-r--r-- 15K nextgen_analysis_server.py.o9 drwxr-xr-x 102 tmp
  • 61. Motivation Solution Implementation Demonstration Monitoring: Log files $ less nextgen_analysis_server.py.o10 INFO: nextgen_pipeline: Processing sample: Test replicate 2; lane 8; reference genome hg19; researcher ; analysis method SNP calling INFO: nextgen_pipeline: Aligning lane 8_100326_FC6107FAAXX with bwa aligner INFO: nextgen_pipeline: Combining and preparing wig file [u’’, u’Test replicate 2’] INFO: nextgen_pipeline: Recalibrating [u’’, u’Test replicate 2’] with GATK
  • 62. Motivation Solution Implementation Demonstration Retrieve results: Copy files $ upload_to_galaxy.py /export/data/galaxy/post_process.yaml /export/data/exome_example/fastq /export/data/work /export/data/exome_example/config/run_info.yaml Final files copied into new directory; allows cleanup of analysis directory
  • 63. Motivation Solution Implementation Demonstration Retrieve results: Output directory $ ls -lh /export/data/galaxy/storage/100326_FC6107FAAXX/7 -rw-r--r-- 38M 7_100326_FC6107FAAXX.bam -rw-r--r-- 22M 7_100326_FC6107FAAXX-coverage.bigwig -rw-r--r-- 72M 7_100326_FC6107FAAXX-gatkrecal.bam -rw-r--r-- 109K 7_100326_FC6107FAAXX-snp-effects.tsv -rw-r--r-- 827K 7_100326_FC6107FAAXX-snp-filter.vcf -rw-r--r-- 1.6M 7_100326_FC6107FAAXX-summary.pdf
  • 64. Motivation Solution Implementation Demonstration Share results Share-an-instance Uses CloudMan web interface Reproducible research CloudBioLinux AMI – software CloudMan – data and configuration
  • 65. CloudMan console enables push button sharing
  • 66. Can make public or available to specific collaborators
  • 67. When finished, turn everything off through CloudMan
  • 68. Motivation Solution Implementation Demonstration Summary CloudBioLinux Shared machine image of biological software Boot from AWS console Connect with NX graphical client and ssh
  • 69. Motivation Solution Implementation Demonstration Summary CloudMan Cluster setup and management Boot from share-an-instance Manage cluster through web interface Share final results
  • 70. Motivation Solution Implementation Demonstration Summary Exome pipeline Parallel framework for running analyses Run using automated scripts Extract alignments, variant calls and summary information
  • 71. Motivation Solution Implementation Demonstration Future: interfaces make it easier https://bitbucket.org/hbc/galaxy-central-hbc
  • 72. Motivation Solution Implementation Demonstration Future: Simplified file selection
  • 73. Motivation Solution Implementation Demonstration Future: Top level parameters
  • 74. Motivation Solution Implementation Demonstration Future: Galaxy data libraries
  • 75. Motivation Solution Implementation Demonstration Future: Galaxy analysis
  • 76. Motivation Solution Implementation Demonstration Future: External UCSC visualization
  • 77. Motivation Solution Implementation Demonstration Read more Step-by-step instructions http://j.mp/rp69nx Approaches to parallelism http://j.mp/nPQHcm Future work http://bcbio.wordpress.com