SlideShare a Scribd company logo
Generating the count table
and validating assumptions
RNA-seq for DE analysis training
Joachim Jacob
20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
Goal
Summarize the read counts per gene from
a mapping result.
The outcome is a raw count table on
which we can perform some QC.
This table is used by the differential
expression algorithm to detect DE genes.
Status
The challenge
'Exons' are the type of features used here.
They are summarized per 'gene'

Alt splicing
Overlaps no feature

Concept:
GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads
GeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',
Tools to count features
●

Different tools exist to accomplish this:

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
Dealing with ambiguity
●

We focus on the gene level: merge all counts over
different isoforms into one, taking into account:
●

●

●

Reads that do not overlap a feature, but appear in
introns. Take into account?
Reads that align to more than one feature (exon or
transcript). Transcripts can be overlapping - perhaps
on different strands. (PE, and strandedness can
resolve this partially).
Reads that partially overlap a feature, not following
known annotations.
HTSeq count has 3 modes
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'intersection_st
rict mode'.
Galaxy allows
experimenting!

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinates
of the features to be counted
mode
Reverse stranded: heck with mapping viz
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over
'exons' grouped by 'gene_id'. Make sure
these fields are correct in your GTF file.
Resulting count table column

One sample !
Merging to create experiment count table
Resulting count table
Quality control of count table
Relative numbers

Absolute numbers

In the end, we used about 70% of the reads. Check for your experiment.
Quality control of count table
2 types of QC:
●

General metrics

●

Sample-specific quality control
QC: general metrics
●

General numbers
QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
Gene

Counts

42 genes (0,0063%)
of the 6665 genes
take 25% of all
counts.
This graph can be
constructed from
the count table.
TEF1alpha, putative ribo prot,...
QC: general metrics
●

General numbers
QC: general metrics
●

We can plot the counts per sample: filter
out the '0', and transform on log2.

The bulk of the genes have counts
in the hundreds.

Few are extremely highly expressed
A minority have extremely low counts
log2(count)
QC: log2 density graph
●

We can do this for all samples, and merge
All samples show
nice overlap, peaks
are similar

Strange
Deviation
here
QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
We see a horizontal shift
of the graph, rather than a
vertical shift, pointing to
no saturation.
QC: log2, merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
QC: rarefaction curve
What is the number
of total detected
features, how does
the feature space
increase with each
additional sample
added?
There should be
saturation, but
here there is none.
Code:
ggplot(data = nonzero_counts, aes(total,
counts)) + geom_line() + labs(x = "total
number of sequenced reads",
y = "number of genes with counts > 0")
Sample A
Sample A + sample B
Sample A + sample B + sample C
Etc.

QC: rarefaction curve
rRNA genes

Saturation: OK!
QC: transformations for viz

Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log2.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
QC: count transformations
Not normalizations!
●

Techniques used for microarray can be
applied on VST transformed counts.
Log2

http://www.biomedcentral.com/1471-2105/14/91

rLog

VST

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
QC including condition info
●

●

We can also include condition
information, to interpret our QC better.
For this, we need to gather sample
information.
Make a separate file
in which sample info
is provided (metadata)
QC with condition info

What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the treatment
and the strain. Must match
the sample descriptions file.
QC with condition info
Clustering of the distance between samples based on
transformed counts can reveal sample errors.

VST transformed

Colour scale
Of the distance
measure between
Samples. Similar conditions
Should cluster together

rLog transformed
QC with condition info
Clustering of transformed counts can reveal sample
errors.

VST transformed

rLog transformed
QC with condition info
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.
Collect enough metadata
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.

Why do
these resemble
each other?
QC with condition info
During library preparation, collect as much as
information as possible, to add to the sample
descriptions. Pay particular attention to differences
between samples: e.g. day of preparation,
centrifuges used, ...

Why do
these resemble
each other?
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples.

Day 1
Day 2

Additional metadata
Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).

Day 1
Day 2

Additional metadata
Collect enough metadata
Next step
Now we know our data from the inside out, we
can run a DE algorithm on the count table!
Keywords
Raw counts
VST

Write in your own words what the terms mean
Break

More Related Content

PDF
RNA-seq for DE analysis: detecting differential expression - part 5
PDF
RNA-seq for DE analysis: the biology behind observed changes - part 6
PDF
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
PDF
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
PDF
RNA-seq: general concept, goal and experimental design - part 1
PDF
Part 5 of RNA-seq for DE analysis: Detecting differential expression
PDF
Part 1 of RNA-seq for DE analysis: Defining the goal
PPTX
Dgaston dec-06-2012
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: the biology behind observed changes - part 6
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 4 of RNA-seq for DE analysis: Extracting count table and QC
RNA-seq: general concept, goal and experimental design - part 1
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 1 of RNA-seq for DE analysis: Defining the goal
Dgaston dec-06-2012

What's hot (20)

PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PDF
RNA-seq: Mapping and quality control - part 3
PDF
presentation
PDF
Part 2 of RNA-seq for DE analysis: Investigating raw data
PDF
wings2014 Workshop 1 Design, sequence, align, count, visualize
PPTX
RNASeq DE methods review Applied Bioinformatics Journal Club
PDF
Talk ABRF 2015 (Gunnar Rätsch)
PDF
DEseq, voom and vst
PPTX
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
PDF
RNASeq Experiment Design
PDF
BITS - Comparative genomics: the Contra tool
PDF
2015.04.08-Next-generation-sequencing-issues
PPTX
Transcript detection in RNAseq
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPT
Rna seq pipeline
PPTX
Eccmid meet the expert 2015
PDF
Rna seq
PPTX
RNA-Seq_Presentation
PDF
ChipSeq Data Analysis
PPTX
GIAB Sep2016 Lightning megan cleveland targeted seq
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: Mapping and quality control - part 3
presentation
Part 2 of RNA-seq for DE analysis: Investigating raw data
wings2014 Workshop 1 Design, sequence, align, count, visualize
RNASeq DE methods review Applied Bioinformatics Journal Club
Talk ABRF 2015 (Gunnar Rätsch)
DEseq, voom and vst
RNA-Seq Analysis: Everything You Always Wanted to Know...and then some
RNASeq Experiment Design
BITS - Comparative genomics: the Contra tool
2015.04.08-Next-generation-sequencing-issues
Transcript detection in RNAseq
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Rna seq pipeline
Eccmid meet the expert 2015
Rna seq
RNA-Seq_Presentation
ChipSeq Data Analysis
GIAB Sep2016 Lightning megan cleveland targeted seq
Ad

Viewers also liked (20)

PDF
An introduction to RNA-seq data analysis
POT
RNA-seq quality control and pre-processing
PDF
Introduction to Linux for bioinformatics
PDF
Text mining on the command line - Introduction to linux for bioinformatics
PDF
Managing your data - Introduction to Linux for bioinformatics
PPTX
Deep learning with Tensorflow in R
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
BITS - Genevestigator to easily access transcriptomics data
PDF
Productivity tips - Introduction to linux for bioinformatics
PDF
BITS - Comparative genomics on the genome level
PDF
The structure of Linux - Introduction to Linux for bioinformatics
PPTX
RNA-seq differential expression analysis
PPTX
Rna seq and chip seq
PPTX
Macs course
PDF
Bioinformatics and NGS for advancing in hearing loss research
PPTX
Bioinformatics
PPTX
Sfu ngs course_workshop tutorial_2.1
PDF
BITS - Search engines for mass spec data
PPTX
Emerging challenges in data-intensive genomics
PDF
Unit 9 - DNA, RNA, and Proteins Notes
An introduction to RNA-seq data analysis
RNA-seq quality control and pre-processing
Introduction to Linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
Deep learning with Tensorflow in R
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
BITS - Genevestigator to easily access transcriptomics data
Productivity tips - Introduction to linux for bioinformatics
BITS - Comparative genomics on the genome level
The structure of Linux - Introduction to Linux for bioinformatics
RNA-seq differential expression analysis
Rna seq and chip seq
Macs course
Bioinformatics and NGS for advancing in hearing loss research
Bioinformatics
Sfu ngs course_workshop tutorial_2.1
BITS - Search engines for mass spec data
Emerging challenges in data-intensive genomics
Unit 9 - DNA, RNA, and Proteins Notes
Ad

Similar to RNA-seq for DE analysis: extracting counts and QC - part 4 (20)

PDF
Gwas.emes.comp
PPTX
RNA sequencing data analysis course by Simon Andrews
PPTX
Pasteur deep seq analysis practical Part - 2015
PPTX
RNA-Seq_analysis_course(2).pptx
PPTX
EiB Seminar from Antoni Miñarro, Ph.D
PDF
RNA sequencing analysis tutorial with NGS
PDF
05_Microbio590B_QC_2022.pdf
PDF
Data basics
PDF
Quality control of sequencing with fast qc obtained with
PPTX
DHC Microbiome Presentation 4-23-19.pptx
DOCX
1_chlamydia task completely best.docx
PPTX
Real time pcr
PDF
Introducing data analysis: reads to results
PPTX
Tools for Transcriptome Data Analysis
PDF
Investigating the 3D structure of the genome with Hi-C data analysis
PDF
ppgardner-lecture06-homologysearch.pdf
PDF
Robots, Small Molecules & R
PPTX
Metabolomic Data Analysis Workshop and Tutorials (2014)
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Gwas.emes.comp
RNA sequencing data analysis course by Simon Andrews
Pasteur deep seq analysis practical Part - 2015
RNA-Seq_analysis_course(2).pptx
EiB Seminar from Antoni Miñarro, Ph.D
RNA sequencing analysis tutorial with NGS
05_Microbio590B_QC_2022.pdf
Data basics
Quality control of sequencing with fast qc obtained with
DHC Microbiome Presentation 4-23-19.pptx
1_chlamydia task completely best.docx
Real time pcr
Introducing data analysis: reads to results
Tools for Transcriptome Data Analysis
Investigating the 3D structure of the genome with Hi-C data analysis
ppgardner-lecture06-homologysearch.pdf
Robots, Small Molecules & R
Metabolomic Data Analysis Workshop and Tutorials (2014)
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016

More from BITS (16)

PDF
BITS - Comparative genomics: gene family analysis
PDF
BITS - Introduction to comparative genomics
PDF
BITS - Protein inference from mass spectrometry data
PDF
BITS - Overview of sequence databases for mass spectrometry data analysis
PDF
BITS - Introduction to proteomics
PDF
BITS - Introduction to Mass Spec data generation
PPTX
BITS training - UCSC Genome Browser - Part 2
PPTX
Marcs (bio)perl course
PDF
Basics statistics
PDF
Cytoscape: Integrating biological networks
PDF
Cytoscape: Gene coexppression and PPI networks
PDF
Genevestigator
PDF
BITS: UCSC genome browser - Part 1
PPT
Vnti11 basics course
PPT
Bits protein structure
PPT
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS - Comparative genomics: gene family analysis
BITS - Introduction to comparative genomics
BITS - Protein inference from mass spectrometry data
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Introduction to proteomics
BITS - Introduction to Mass Spec data generation
BITS training - UCSC Genome Browser - Part 2
Marcs (bio)perl course
Basics statistics
Cytoscape: Integrating biological networks
Cytoscape: Gene coexppression and PPI networks
Genevestigator
BITS: UCSC genome browser - Part 1
Vnti11 basics course
Bits protein structure
BITS: Introduction to Linux - Software installation the graphical and the co...

Recently uploaded (20)

PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Lesson notes of climatology university.
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Weekly quiz Compilation Jan -July 25.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
2.FourierTransform-ShortQuestionswithAnswers.pdf
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Anesthesia in Laparoscopic Surgery in India
O7-L3 Supply Chain Operations - ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
Lesson notes of climatology university.
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
O5-L3 Freight Transport Ops (International) V1.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Microbial diseases, their pathogenesis and prophylaxis
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx

RNA-seq for DE analysis: extracting counts and QC - part 4

  • 1. Generating the count table and validating assumptions RNA-seq for DE analysis training Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
  • 2. Goal Summarize the read counts per gene from a mapping result. The outcome is a raw count table on which we can perform some QC. This table is used by the differential expression algorithm to detect DE genes.
  • 4. The challenge 'Exons' are the type of features used here. They are summarized per 'gene' Alt splicing Overlaps no feature Concept: GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads GeneB = exon 1 + exon 2 + exon 3 = 180 reads No normalization yet! Just pure counts, aka 'raw counts',
  • 5. Tools to count features ● Different tools exist to accomplish this: http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
  • 6. Dealing with ambiguity ● We focus on the gene level: merge all counts over different isoforms into one, taking into account: ● ● ● Reads that do not overlap a feature, but appear in introns. Take into account? Reads that align to more than one feature (exon or transcript). Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially). Reads that partially overlap a feature, not following known annotations.
  • 7. HTSeq count has 3 modes HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_st rict mode'. Galaxy allows experimenting! http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
  • 8. Indicate the SE or PE nature of your data (note: mate-pair is not appropriate naming here) The annotation file with the coordinates of the features to be counted mode Reverse stranded: heck with mapping viz Check with mapping QC (see earlier) For RNA-seq DE we summarize over 'exons' grouped by 'gene_id'. Make sure these fields are correct in your GTF file.
  • 9. Resulting count table column One sample !
  • 10. Merging to create experiment count table
  • 12. Quality control of count table Relative numbers Absolute numbers In the end, we used about 70% of the reads. Check for your experiment.
  • 13. Quality control of count table 2 types of QC: ● General metrics ● Sample-specific quality control
  • 15. QC: general metrics Which genes are most highly present? Which fractions do they occupy? Gene Counts 42 genes (0,0063%) of the 6665 genes take 25% of all counts. This graph can be constructed from the count table. TEF1alpha, putative ribo prot,...
  • 17. QC: general metrics ● We can plot the counts per sample: filter out the '0', and transform on log2. The bulk of the genes have counts in the hundreds. Few are extremely highly expressed A minority have extremely low counts log2(count)
  • 18. QC: log2 density graph ● We can do this for all samples, and merge All samples show nice overlap, peaks are similar Strange Deviation here
  • 19. QC: log2 merging samples Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples. We see a horizontal shift of the graph, rather than a vertical shift, pointing to no saturation.
  • 20. QC: log2, merging samples Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.
  • 21. QC: rarefaction curve What is the number of total detected features, how does the feature space increase with each additional sample added? There should be saturation, but here there is none. Code: ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0")
  • 22. Sample A Sample A + sample B Sample A + sample B + sample C Etc. QC: rarefaction curve rRNA genes Saturation: OK!
  • 23. QC: transformations for viz Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2. http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
  • 24. QC: count transformations Not normalizations! ● Techniques used for microarray can be applied on VST transformed counts. Log2 http://www.biomedcentral.com/1471-2105/14/91 rLog VST http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
  • 25. QC including condition info ● ● We can also include condition information, to interpret our QC better. For this, we need to gather sample information. Make a separate file in which sample info is provided (metadata)
  • 26. QC with condition info What are the differences in counts in each sample dependent on? Here: counts are dependent on the treatment and the strain. Must match the sample descriptions file.
  • 27. QC with condition info Clustering of the distance between samples based on transformed counts can reveal sample errors. VST transformed Colour scale Of the distance measure between Samples. Similar conditions Should cluster together rLog transformed
  • 28. QC with condition info Clustering of transformed counts can reveal sample errors. VST transformed rLog transformed
  • 29. QC with condition info Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.
  • 30. Collect enough metadata Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more. Why do these resemble each other?
  • 31. QC with condition info During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ... Why do these resemble each other?
  • 32. Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples. Day 1 Day 2 Additional metadata
  • 33. Collect enough metadata In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect). Day 1 Day 2 Additional metadata
  • 35. Next step Now we know our data from the inside out, we can run a DE algorithm on the count table!
  • 36. Keywords Raw counts VST Write in your own words what the terms mean
  • 37. Break