SlideShare a Scribd company logo
Data processing &
visualization
methods
Claire Chung
2021/08/17
LSCI6101 – Techniques in Biocomputing
Overview
• Bioinformatics as data science
• Why data processing & data visualization are important
• Basic data processing skills
• Regular expression
• Basic programming concepts (Extended)
• Processing tabular data using R (Extended)
• Basic data visualization in bioinformatics
• Common plot types in bioinformatics
• Common data visualization packages
• General principles & tips for data visualization
Biocomputing
Techniques in
2
Aims
• The first taste: Learn through hands-on examples
• Get code snippets to play with and modify for future use
• Provide *keywords* and key resources to kick start onward learning
3
Different “tastes” of bioinformaticians
ENGINEER DATA SCIENTIST “WET LAB”-FOCUSED
Discovery-oriented
Feasibility & elegance of
methodology
4
Work of
bioinformatician
Understanding of
bioinformatics
study design & data
from data
generation
experiments
to resulted
data
Selection & usage
of proper tools to
handle data
May include
software
installation,
which can be
non-trivial
Communicate progress and
presentation of data analysis
results
5
6
Bioinformatics as Data Science
Ask
Get
Explore
Analyze
Comm-
unicate
an interesting question
• Understand the biology behind
• Design the study & experiments needed
the data
• Generate data via experiments
and/or
• Get relevant data from public db
the data
• Know the metadata (origin? Type?
Format specification?)
• Perform quality check
• Perform data cleaning
• Transform data where necessary
• Understand the data distribution
preliminarily <= aided by data viz
the data
• Choose the right tools
• Understand and interpret
the results
• Often needs data transformation
the results
• Effective communication
requires intuitive graphics
• Choose the right plot type
• Tune the aesthetics
• Add proper legend
• Clear writing
The data cycle
Data management & Operation
processes are non-trivial too
Application of skills to learn in this session
Data processing
• Data cleaning
• Data filtering
Data visualization
• Exploratory Data Analysis (EDA)
• Result presentation
• Check the number of data entries
• Check if the data contain irrelevant
entries, missing values, unsupported
characters, extra space
• Fix or remove erratic data
• Changing file formats to fit different tools
• Filter data for downstream analysis, e.g.
filter assembled transcripts by class code
• Check if the data distribution looks reasonable
• Look for trend and/or outliers preliminarily
7
Effective of Excel data tools is good for simple,
quick handling
Data tab
Sorting & Filtering
https://www.exceltip.com/basic-excel/data-tab.html
https://excelwithbusiness.com/blog/15-excel-data-analysis-functions-need/
Formula bar
8
Problem with “just” using Excel
• Limited rows
• Slow in opening large files
• Cannot streamline “pipe” input & output from and to other processing
• Often we just need one line of command to finish
• E.g. `cat XXX.gtf | awk '$3=="gene"'| cut -f9 | sed 's/.*gene_id=([^;]*)*.*/1/g' | sort -u`
extracts unique gene IDs from the entire GTF annotation
• On Excel, you may need many clicks and “save as" and open for the same action
• Less versatile processing functions
• May inadvertently have data changed automatically by Excel formatting
• e.g. “gene symbols such as SEPT2 (Septin 2) and
MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1,
E3 Ubiquitin Protein Ligase] are converted
by default to ‘2-Sep’ and ‘1-Mar’”
https://genomebiology.biomedcentral.co
m/articles/10.1186/s13059-016-1044-7 9
Common programming/scripting languages
in bioinformatics
Programming / scripting languages
• Bash: built-in with Unix-like system (MacOS, Linux)
• Python
• R
• Perl: phasing out; still commonly see in older packages
Other tools
• Microsoft Excel or equivalent spreadsheet software
• More advanced text editors
10
Data handling tip 1:
Regular expression
11
Regular expression- basics
• Pattern of strings (a series of characters / digits / symbols /
whitespace)
• Often abbreviated as “regex” or “regexp”
• Useful in searching and/or replacing string (e.g. changing ID formats)
• Available in most if not all programming languages, as well as more
advanced text editors (i.e. most other than Windows notepad)
• Different software may differ slightly in syntax, but mostly similar
12
Case study: why we need scripting / regular
expression commands?
• If I would like to get only the
DNA sequence into one line?
• What if I have >10k lines?
13
Example solution
1. Open “Find & Replace function”
• Windows: Ctrl + H
• MacOS: Command ⌘ + Option ⌥ + F
2. Select “Case sensitive”
3. Type “[A-Z]{3}s” in the “Find” blank
4. Select “Find All”
5. Copy and paste selection to a new file
14
Example solution
6. Type “s+n” to select all trailing
space and line break on each line
7. Select “Replace All”
DONE!!
15
Regular expression (example from Python)
• Digit (0-9): d
• Non-digit: D
• Whitespace (space, tab): s
• Non-whitespace: S
• Line break: n
https://docs.python.org/3/library/re.html
Check the exact syntax from the
documentation of the tool you use.
For instance, * in some tools only means
repeating the previous item, e.g. d*
means a series of digits, instead of any
number of any characters following a digit
16
Regular expression (example from Python)
• Start of line: ^ (when placed at the start of a pattern)
• End of line: $ (when placed at the end of a pattern)
• Present 0 or 1 time: ?
• Present 1 or more times: +
• Repeat n times: {n}
• Wildcard: *
https://docs.python.org/3/library/re.html
17
18
Common string / file manipulation operations on
MacOS & Linux
Function: Bash command
• Row counting: gc
• Sorting: sort
• Selection / Filtering by row: grep, awk, sed
• Replacement: awk, sed
• Column selection: cut
*slight syntax difference between MacOS and Linux sometimes,
e.g. grep -e vs grep -P, for selecting by regex patterns
Since so far not everyone have access to Linux servers or are using
MacOS/Linux computers, we will have our hands on using some more
advanced GUI text editors so you may process small datasets
on your own computers too.
19
More advanced text editors
• Notepad++
• https://notepad-plus-plus.org/downloads/v8.1.3/
• Entirely free; open source
• Sublime Text (for demo in class)
• https://www.sublimetext.com/download
• More functions; would prompt for license purchase
• Code-ready: Syntax highlighting
• Data-ready: can open larger text files quicker
by loading small chunks of the file once at a time
20
Hands-on practice
• Download Sublime Text
• On Galaxy, go to Shared data > History > 2021-08-15_DE_analysis
• Download history 1 Drosophila annotation.
• Save as Drosophila_annotation.gff
Task:
• Extract all “gene” features (not CDS / exon / 5’ or 3’ UTR / start or end
codon, etc.), with “gene_name”s starting with “CR” and followed by a 5-
digit ID
• Flybase: CG for protein-coding genes, CR for non-protein-coding genes
21
22
Hands-on practice
Method 1: Excel
1. Select the 9th
column of data
2. Select the “Data” tool tab
3. Select “Text to Columns”
4. Select “Delimited”
5. Select “semicolon”
6. Select ”Standard”
7. Click “Finish”
23
24
Extra columns added
25
Gene features filtered
Filter for values equal to “gene”
26
Oh no… Excel filter is
not case sensitive
And am I going to add 10 more
rules to specify for digits?
Filter for values that
contains ‘gene_name “CR’
More caveats with using native Excel
functions for filtering
• Slow with many steps
• Not specific enough
• What if we have a file with non-uniform number of items for different
attribute rows? (It can happen)
• First check the number of rows containing the desired feature, e.g.
“gene_name” is the same as the total row number
• And in the last example, there can also be genes like CRXXX1?
• Not in Drosophila. How do I know? Surely, I didn’t eyeball them…
27
When you master regex…
• Open the ”Drosophila_annotation.gff” file again in sublime text
• Open “Find” function by Ctrl + F (Windows) / Command +F (Mac)
⌘
• Turn on “Regular Expression” and “Case sensitive” modes
• Input ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+"
• Click “Find All”
• Copy & Paste to a new file
• DONE!!!
28
You can even speed up
by using hotkeys than
clicking 
The regex pattern explained
• ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+”
• Starting ^: a string that starts with
• (): a capture group of
• []: allowed symbols contained within
• ^ within []: not
• t: tab
• +: present for one or more times
• {n}: repeated n times
• n: line break
• a string that starts with two times a capture group of non-tab characters followed by a tab,
followed by the string “gene”, followed by 5 more times of the group of non-tab characters
followed by a tab, then some non-line end (i.e. basically any) characters, finally followed by
the string ‘gene_name “CRd+”’
• i.e. get tab-separated rows that have the string “gene” in 3rd
column, then in the 9th
column,
contains the string gene_name “CRxxx”, where xxx is any number of digits
29
Remarks before you feel like reading spells….
30
• This is not a contest. Accuracy is always the most important
• Surely it takes time to practice and master
• Searching for reference is just normal and often necessary even when we are
more experienced
• Just use the method you are most confident and comfortable with
• Before getting familiar with regex, just use any method to filter down to the
closest criteria to your target before eyeballing, and you already saved lots of
time
• But when you master the skill, it will save you tonnes of time, and provides a
systematic way to reduce human error
Homework (Extended)
• If you are using Mac or WSL on Windows, you may:
• Open the Terminal
• navigate to the directory you placed the annotation file using command `cd`
• Type ` cat Drosophila_annotation.gff | awk '$3=="gene"' | grep -e
'gene_name "CRd*”’` and get the results
31
Homework
1. Download GENCODE human genome annotation
• http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/ge
ncode.v38.annotation.gtf.gz
• Release 38 (GRCh38.p13)
• Comprehensive gene annotation
• GTF format
2. In your own way, extract rows that
• are “genes”
• from “chrX”
• have “level 1” transcript support level
32
Homework
3. Can you find out the number of genes with name starting with “OR_F”, where _
denotes a number of one or more digit(s)?
• Olfactory Receptor family ____ subfamily F member
4. Check the GFF/GTF File Format specification
• https://asia.ensembl.org/info/website/upload/gff.html
• Find what information can be found in each row
5. Check the meaning of “transcript support level”
• https://www.gencodegenes.org/pages/faq.html
6. (Extended) You may also download your transcript assembly history and try filtering
by different “class_code”
• Real question by your classmate and real-life use case!
33
Homework Remarks
1. Check the GFF/GTF File Format specification
• https://asia.ensembl.org/info/website/upload/gff.html
• Find what information can be found in each row
2. Check the meaning of “transcript support level”
• https://www.gencodegenes.org/pages/faq.html
34
Data handling tip 2:
Basic programming logic
35
Why we may want to do some coding?
• Involves handling high dimensional data (i.e. a whole lot of features)
• Each file can be large (of thousands to billions rows)
• Often need to operate on a large collection of large files
• Auto >>> manual
• Not all useful methods have
graphics user interface (GUI) versions
• Input data format do not fit
• The output may not match your goals
• e.g. overlaying plots, highlighting targets
Manual: Walk to each haystack,
look with bare eyes for needles
Coding/Programming is basically…. LOGIC
• Variables (to store values)
• Basic data structures (how we organize the data):
• integer, float/decimal, string, boolean, array/list
• 1D: Array / list
• >=2D: tables / matrix
• Operators
• Arithmetic: add +, minus -, multiply *, divide /
• Logical: AND &, OR |
• Equality: greater than >, smaller than <, equal ==
• Flow control: Conditions (IF-THEN-ELSE) & Loops (FOR/WHILE)
Variable
• a = 1 <= assign a value of 1 to the “namespace” of a
• Like how your name represents your being and can be used for calling
• Here, “a” is the variable, and 1 is its value
Assigning a variable
• Bash: a=1 (do not leave space)
• Python: a = 1 (can omit the space, just not tidy or conventional)
• R: a <- 1 (can also use = like Python, just not R-style)
Calling a variable
• Bash: $a
• Python / R: a
38
String
• A series of characters / digits / symbols / whitespace
• e.g. ‘abc’, ‘ATGCACGAG’, ‘12345’, ‘Hello, World’, ‘dawlekjr;alwejr’, ‘ ‘
• Can concatenate, search for part of the string (“substring”) by pattern, etc.
• Usually denoted by being put into single or double quotes
• Some programming languages have a separate class of data type for single
characters
• Bash: no data type, but you can specify by adding quotes
• Python: str; ‘123’ or str(123) creates a string of ‘123’
• R: character; ‘123’ or as.character(123) creates a string of ‘123’
39
Numeric, i.e. numbers
• Can perform arithmetic operations
• Integers
• Floats / decimal
• Bash: no data types
• Python: int, float
• R: integer, double
• Some programming languages have special data types for larger ranges
of integer, but it is out of scope here
40
Boolean
• True or False
• Bash: Not exist;
does have boolean expressions
(comparison & conditions)
• Python: bool; True, False
• R: logical; TRUE, FALSE
41
https://www.fallacies.ca/ttable.htm
List or Array
• A collection of elements
• Some collection data types allow collection of elements of different
data types, e.g. [‘a’, 1234, True], some only allow a uniform data type
• For simplicity, we only consider lists/arrays of one data type, such as
what we will use for data visualization later
• Bash array: declare -a my_array; my_array=(1 2 3)
• Python list: my_list = [1, 2, 3]
• R: c(1, 2, 3)
42
Operators
• Arithmetic: add +, minus -, multiply *, divide /
• Logical: AND &, OR |
• Equality: greater than >, smaller than <, equal ==
43
If-then(-else)
• “If I am hungry after class, (then) I go to coffee corner for dinner”
• “else I go home directly”
44
For-loop
• “For every day in this week, I eat an apple”
• for day in week:
• eat(apple)
• Enumerate (count) element in the for-condition, perform action(s) for
each element
• Bash: for i in `seq 1 20`; do echo $i; done
45
While-loop
• “While I live, I breathe”
• Looping non-stop until the predicate condition becomes false
• Beware of infinite loop: Add condition check within the loop
46
Data handling tip 3:
Processing tabular data
47
The Python language
• The most used programming language
in the world
• Versatile usage
The R language
• Designed for statistical analysis
• Bioconductor
• A market of biological analysis-related packages
• Mostly published: proven
• Unified downloading method
No need to reinvent the wheel
Ready-made packages:
https://www.bioconductor.org/install/
DESeq2 is in
Bioconductor
http://bioconductor.org/packag
es/release/bioc/vignettes/DESe
q2/inst/doc/DESeq2.html
Packages for data handling on R
• readr for data import
• dplyr for data processing
• ggplot2 for data visualization
51
Tidyverse:
R packages designed for data
science built with unified
grammar and data structures
Packages for data handling on Python
Data processing & calculation
• pandas
• numpy
• Scikit-learn
• StatsModels
Data visualization
• Matplotlib
• Seaborn/Bokeh(/Plotly)
52
• Plotly not available in jupyter/datascience-notebook Docker image
• Not necessary until more advanced usage such as interactive dashboard construction
• Name mentioned for popularity
Why show me both R & Python?
• Learning the language(s) allows customized data analysis & visualization
workflows
• To be future-proof, (and as a Pythonista), I strongly encourage you to
learn Python
• To be competent in bioinformatics, you should have good command of R
to use Bioconductor packages when moving outside Galaxy
• So…
• We will only use R tidyverse for demo today
• Basic syntax of Python equivalent is also provided along side for home
practice
Hands on!
54
Docker
• Application container
• Cross-platform portability
• Reduces installation hassles
• Avoid dependencies issues (missing or version clash)
55
https://www.docker.com/get-started > Download Docker Desktop
- “an open platform for developing, shipping, and running applications”
Jupyter notebook: interactive coding environment
• Former IPython: I for interactive
• Extended to support kernels of various programming languages
• Python, R, Julia, C++, Ruby, etc.
• Markdown for easy note jotting
• Portable & Runnable
• Commonly used for development and workflow sharing
• Installation: type in Terminal / PowerShell the following
56
docker pull jupyter/datascience-notebook
Finding your terminal / PowerShell
• MacOS / Linux users:
Find the app “Terminal” from LaunchPad
• Windows users:
Find “PowerShell” from start menu
57
Download data for input
58
Dataset
Download
1. On Galaxy, go to
“Shared Data” > “History” >“2021-08-15_DE_analysis”
2. Download the DESeq2 result file (History 35)
• Save as “20210819-DE-result-demo.txt”
Running data science Jupyter notebook on docker
• On your internet browser, go to http://localhost:8888
• When prompted for token,
copy & paste the string behind
“?token=” shown on the
Terminal / Command Prompt
• Different each time
59
docker run –p 8888:8888 jupyter/datascience-notebook
Create a new R notebook
60
Upload dataset
61
Using installed package/library
Python
• import some_package
• Import some_package as abbr
R
• library(some_package)
62
Import the tidyverse library (family)
63
Not errors: the more user friendly dplyr::filter() replaces the stats::filter() to be called by
filter(). You may still call stats::filter() by writing it in full as stats::filter(). Similar for lag()
”Dataframe” – the data type for your table
• 2 axes: columns (vertical; variable) & rows (horizontal; observation)
• col_names / header: name/ID of columns
• row_names / index: name/ID of rows
64
https://subscription.packtpub.com/book/data/9781784393878
/1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe
Importing data
pandas @Python
• import pandas as pd
• df = pd.read_csv(‘xxx.csv’,
header=None, comment=‘#’)
• df = pd.read_csv(‘xxx.tsv’, sep=‘
t’, header=None, comment=‘#’)
dplyr @R
• library(tidyverse)
• df <- read_csv(‘xxx.csv’,
comment=‘#’)
• df <- read_tsv(‘xxx.tsv’,
comment=‘#’)
65
Read our data file
66
Important to specify the absence of header
row, else you will miss a row of data entry!
Have a glimpse of the dataframe
pandas @Python
• df
• df.head()
• df.tail()
dplyr @R
• df
• head(df)
• tail(df)
67
head: first few rows
tail: last few rows
68
df
Renaming columns
pandas @Python
• df.columns = [‘iamcol1’,
‘iamcol2’, …]
R
• Base R way:
df <- colnames(df) = c(‘iamcol1’,
‘iamcol2’, …)
• dplyr way:
df <- rename(df, old_name =
new_name)
69
The more advanced `rename_with` function
that allows more complicated custom functions
and the use of regex is left for exploration
rename all
rename specific column(s)
70
!! Poor “machine readability” => might cause error
Better naming
Filtering rows
pandas @Python
• df[df[data_mask]]
• E.g.
de_result[de_result[pval < 0.05]
dplyr @R
• filter(df, data_mask)
• E.g.
filter(de_result, pval < 0.05)
71
72
Selecting columns
pandas @Python
• df[‘column_name’]
dplyr @R
• df$column_name
• Returns a collection
• df[‘column_name’]
• Returns a tibble dataframe
• df %>% select(A, B, E)
• Returns a tibble dataframe
73
Chaining
pandas @Python
• by dots
• df[data_mask].func().func()
dplyr @R
• by %>%
• starwars %>% group_by(gender)
%>% filter(mass > mean(mass,
na.rm = TRUE))
74
The more advanced `rename_with` function
that allows more complicated custom functions
and the use of regex is left for exploration
Example from: https://dplyr.tidyverse.org/reference/filter.html
75
76
New package installation (Extended)
Python
• On Terminal / Command Prompt
R
• Within R
77
python3 –m pip install –user some_package install.packages(”some_package")
Most necessary packages for basic data processing
& visualization are installed on jupyter/datascience-
notebook Docker image already
Data visualization basics
78
Why visualize data?
• Numbers are unintuitive
• Exploratory data analysis: Understand the data properties, QC
• Result presentation: Let readers get the basic ideas at one glimpse
79
Can you make sense of
these numbers?
How to choose a type of visualization?
• Data type
• Categorical
• Quantitative / Numerical
• Rank / Ordinal
• Relation
• Distribution: e.g. histogram, box plot, violin plot, density plot
• Correlation: e.g. scatter plot, heatmap
• Ranking: e.g. bar plot
• Timeseries: line chart
80
Data visualization packages on R
• ggplot2: versatile plotting library for various plot types
• gplots
• Specific libraries, e.g. pheatmap, for specific plot types
81
Data visualization packages on Python
• Matplotlib: versatile, can create any visualization & tune any detail
• Seaborn: more advanced plot types; visually appealing preset styles
• Bokeh / Plotly / HoloViews: visually appealing interactive plots
82
Data visualization in
bioinformatics
83
Heatmap
• Showing values with colors
DESeq2 results
Source: Differential gene expression induced by Verteporfin in endometrial cancer cells
(Bang, et al., 2019)
• often drawn using log fold change values
84
Conditional formatting for
a glimpse
85
Packages for heatmap drawing
• heatmap by ggplot2 geom_tile:
• Native, more options for data normalization & clustering
• More steps
• https://www.r-graph-gallery.com/heatmap
• pheatmap:
• https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-phe
atmap-package/
• gplots heatmap.2:
• https://www.rdocumentation.org/packages/gplots/versions/3.1.1/topics/
heatmap.2
Heatmap
with
Pairwise
clustering
Dendrogram
Heatmap:
- Showing valuers with colors
Pairwise clustering:
- Group items of high similarity
(i.e. short distances) together
e.g. are technical replicates
clustered together?
Volcano plot
• A type of scatter plot
• Visualizes differential expression results
• Shows up / down expression regulation
• y-axis: statistical significance (-logP)
• x-axis: log fold change
• Can input results from e.g. DESeq2
87
https://galaxyproject.github.io/training-m
aterial/topics/transcriptomics/tutorials/rn
a-seq-viz-with-volcanoplot/tutorial.html
MA plot
Yin, T., Cook, D. & Lawrence, M. ggbio: an R package for extending the grammar of
graphics for genomic data. Genome Biol 13, R77 (2012). https://doi.org/10.1186/gb-
2012-13-8-r77
• DE analysis results
• Up & down regulation
• y: log fold change
• x: normalized mean
• Color:
Statistical significance
• Available in DESeq2
88
Genome-wide Variation: Manhattan plot
• A type of scatter plot
• Shows genome wide location
of loci features, e.g. variants
89
https://www.researchgate.net/publication/272083683_Genome-wide_association_study_of_cl
inically_defined_gout_identifies_multiple_risk_loci_and_its_association_with_clinical_subtype
s/figures?lo=1
Network
• Connections show
interaction, e.g. between
genes
• Can be drawn by e.g.
Cytoscape
90
https://www.researchgate.net/pu
blication/321256870_Differential_
gene_expression_in_heterophils_i
solated_from_commercial_hybrid
_and_Thai_indigenous_broiler_chi
ckens_under_quercetin_suppleme
ntation/figures?lo=1
Node
Edge
1. On Galaxy, go to
“Shared Data” > “History” >“2021-08-15_DE_analysis”
2. Download the featureCounts result files (Histories 27, 29, 31, 33)
• Save as
“20210819-SRR1210078_WT_rep1-featureCounts.txt”
“20210819-SRR1210079_WT_rep2-featureCounts.txt”
“20210819-SRR1210084_C24_rep1-featureCounts.txt”
“20210819-SRR1210085_C24_rep2-featureCounts.txt”
3. Upload to the Jupyter notebook
91
Download data for input
ggplot2 basics (hands-on)
Basics
*Plot element layers & settings can be
added by chaining function() with +
Facet’s Panels
Labels
Breaks
*Layers of
Genometic Objects
*Legend
Plot Anatomy
Further ggplot2 examples (Extended)
From
• Histogram
https://www.r-graph-gallery.com/220-basic-ggplot2-histogram.html
• Scatterplot
• https://www.r-graph-gallery.com/274-map-a-variable-to-ggplot2-scatt
erplot.html
• Boxplot
https://www.r-graph-gallery.com/boxplot.html
Python visualization basics (Extended)
• Matplotlib gallery
• https://matplotlib.org/stable/gallery/index.html
• Seaborn gallery
• https://seaborn.pydata.org/examples/index.html
• Python graph gallery
• https://www.python-graph-gallery.com/
Data visualization
general principles
95
Does it worth a figure
• Key result?
• Also depends on journal and
current trend
Vale, 2015 ; Accelerating scientific publication in biology
96
Data visualization is where art meets science
97
BIG DATA V.01
(https://in.pinterest.com/pin/817614507335199087/)
Adjusting a figure
The most basic plot output before
aesthetic tuning is NOT DONE DEAL
Gestalt
principles of
visual
perception
Matplotlib 2.x by Example, 2017
99
Plot as visual aid: Perception
• Jakob’s Law in UX web design: “People expect your website to work
the same way as the other websites they're using”
• sO thIs iS od
d
• Highlight what is important (you ARE distracted)
• Same color for the same variable
What Are Data Visualization Style Guidelines? by Amy Cesal
https://medium.com/nightingale/style-guidelines-92ebe166addc
100
Reader-friendliness
• Aim: Be intuitive than fancy
• Proper contrast (not like this)
• Be color weakness-friendly (this may be not)
• Proper contrast in hues
• Paletter generator:
https://color.adobe.com/create/color-accessibility
Proof tool on Adobe Illustrator
More resources:
• https://www.color-blindness.com/coblis-color-blindness-simulator/
• https://helpx.adobe.com/creative-cloud/adobe-color-accessibility-tool
s.html
• https://creativepro.com/viewing-color-blind-previews-of-pages/
102
Proper color contrast for categorical data
https://thenode.biologists.com/data-visualization-with-flying-c
olors/research/
http://bconnelly.net/posts/creating_colorblind-friendly_figures/
Smooth colormap for quantitative data
• The ”scalebar” for heatmap
• Visual “uniformity” preferred
Visually “amplified” expected value changes
Smooth visual gradience
Post-processing
• Export plot output from code in vector graphics, e.g. SVG, PDF
• Editability: you can open with Illustrator, Inkscape, etc
• Full resolution: export to at least 300 dpi for rasterized graphics, e.g. PNG
• Only make necessary & scientifically allowed (i.e. not misleading)
changes
• E.g. Edit font size, add * to indicate statistical significance where necessary
• Always aim to be clean & intuitive
“Make up till you cannot be recognized is disguise” – Dayo Wong
104
Final tips
105
Choose your weapon
• Excel: know formulas & functions well for simple handling
• Bash: quick string / file view & manipulation; scripting, e.g. when
looping through hundreds of files or tonnes of lines
• Python / R: More advanced data analytics & visualization
• At least be good enough in one that you can do everything with it
• Use what you are confident with
Google is your friend!!
• StackOverflow (various StackExchange sites)
• Quora/Reddit/CSDN/Qiita (whatever)
• 🔍 {tool I use} {my purpose}
• 🔍 {the error message}
• 🔍 {tool idk how to use} manual
• 🔍 {tool with hard-to-read official manual} tutorial
• 🔍 {tool} cheatsheet
Books on Data &
Visualization
Now Matplotlib 3.4.2 108
More free online resources
• RegExr: Learn, Build, & Test RegEx
• https://regexr.com/
• regex101: build, test, and debug regex
• https://regex101.com/
• R cheatsheet
• https://www.rstudio.com/resources/cheatsheets/
• R Tidyverse tutorial by DataCamp
• https://www.datacamp.com/community/tutorials/tidyverse-tutorial-r
• Data Visualisation: the Good, the Bad and the Ugly (1) by Mina Pêcheux
• https://minapecheux.com/website/2018/07/17/data-visualization-the-good-the-bad
-and-the-ugly-1/
109
110
More free online resources (Extended)
• BASH programming
• https://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
• Advanced Bash-Scripting Guide by Mendel Cooper
• https://tldp.org/LDP/abs/html/
• Bash scripting cheatsheet
• https://devhints.io/bash
• Learnpython.org
• https://www.learnpython.org/
HAVE FUN PLAYING
WITH DATA!!

More Related Content

PPTX
uw cse correct style and speed autumn 2020
PPTX
Reading Notes : the practice of programming
PPTX
Reproducible research concepts and tools
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
PPTX
Python Tutorial Part 1
PPTX
Data analysis patterns, tools and data types in genomics
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PPTX
ANTLR - Writing Parsers the Easy Way
uw cse correct style and speed autumn 2020
Reading Notes : the practice of programming
Reproducible research concepts and tools
Machine Learning with ML.NET and Azure - Andy Cross
Python Tutorial Part 1
Data analysis patterns, tools and data types in genomics
From Pipelines to Refineries: Scaling Big Data Applications
ANTLR - Writing Parsers the Easy Way

Similar to Data processing and visualization basics (20)

DOCX
Data structure and algorithm.
PPTX
Data science and Hadoop
PPT
Basic terminologies & asymptotic notations
PPTX
Bioinformatics v2014 wim_vancriekinge
PPTX
Basic data analysis using R.
PPTX
Pa2 session 1
PPT
Python ppt
PPT
Intro_2.ppt
PPT
Intro.ppt
PPT
Intro.ppt
PPTX
Distributed Model Validation with Epsilon
PDF
S2-Programming_with_Data_Computational_Physics.pdf
PDF
Who go Types in my Systems Programing!
PDF
PPTX
Inside SQL Server In-Memory OLTP
PDF
You and your code.pdf
PPTX
CPP18 - String Parsing
PPTX
Natural Language Query to SQL conversion using Machine Learning Approach
PDF
Rails Tips and Best Practices
PDF
TeelTech - Advancing Mobile Device Forensics (online version)
Data structure and algorithm.
Data science and Hadoop
Basic terminologies & asymptotic notations
Bioinformatics v2014 wim_vancriekinge
Basic data analysis using R.
Pa2 session 1
Python ppt
Intro_2.ppt
Intro.ppt
Intro.ppt
Distributed Model Validation with Epsilon
S2-Programming_with_Data_Computational_Physics.pdf
Who go Types in my Systems Programing!
Inside SQL Server In-Memory OLTP
You and your code.pdf
CPP18 - String Parsing
Natural Language Query to SQL conversion using Machine Learning Approach
Rails Tips and Best Practices
TeelTech - Advancing Mobile Device Forensics (online version)
Ad

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Presentation on HIE in infants and its manifestations
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Cell Structure & Organelles in detailed.
PPTX
Pharma ospi slides which help in ospi learning
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Computing-Curriculum for Schools in Ghana
PDF
O7-L3 Supply Chain Operations - ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Anesthesia in Laparoscopic Surgery in India
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
GDM (1) (1).pptx small presentation for students
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
Presentation on HIE in infants and its manifestations
A systematic review of self-coping strategies used by university students to ...
Cell Structure & Organelles in detailed.
Pharma ospi slides which help in ospi learning
VCE English Exam - Section C Student Revision Booklet
Final Presentation General Medicine 03-08-2024.pptx
Module 4: Burden of Disease Tutorial Slides S2 2025
Computing-Curriculum for Schools in Ghana
O7-L3 Supply Chain Operations - ICLT Program
Ad

Data processing and visualization basics

  • 1. Data processing & visualization methods Claire Chung 2021/08/17 LSCI6101 – Techniques in Biocomputing
  • 2. Overview • Bioinformatics as data science • Why data processing & data visualization are important • Basic data processing skills • Regular expression • Basic programming concepts (Extended) • Processing tabular data using R (Extended) • Basic data visualization in bioinformatics • Common plot types in bioinformatics • Common data visualization packages • General principles & tips for data visualization Biocomputing Techniques in 2
  • 3. Aims • The first taste: Learn through hands-on examples • Get code snippets to play with and modify for future use • Provide *keywords* and key resources to kick start onward learning 3
  • 4. Different “tastes” of bioinformaticians ENGINEER DATA SCIENTIST “WET LAB”-FOCUSED Discovery-oriented Feasibility & elegance of methodology 4
  • 5. Work of bioinformatician Understanding of bioinformatics study design & data from data generation experiments to resulted data Selection & usage of proper tools to handle data May include software installation, which can be non-trivial Communicate progress and presentation of data analysis results 5
  • 6. 6 Bioinformatics as Data Science Ask Get Explore Analyze Comm- unicate an interesting question • Understand the biology behind • Design the study & experiments needed the data • Generate data via experiments and/or • Get relevant data from public db the data • Know the metadata (origin? Type? Format specification?) • Perform quality check • Perform data cleaning • Transform data where necessary • Understand the data distribution preliminarily <= aided by data viz the data • Choose the right tools • Understand and interpret the results • Often needs data transformation the results • Effective communication requires intuitive graphics • Choose the right plot type • Tune the aesthetics • Add proper legend • Clear writing The data cycle Data management & Operation processes are non-trivial too
  • 7. Application of skills to learn in this session Data processing • Data cleaning • Data filtering Data visualization • Exploratory Data Analysis (EDA) • Result presentation • Check the number of data entries • Check if the data contain irrelevant entries, missing values, unsupported characters, extra space • Fix or remove erratic data • Changing file formats to fit different tools • Filter data for downstream analysis, e.g. filter assembled transcripts by class code • Check if the data distribution looks reasonable • Look for trend and/or outliers preliminarily 7
  • 8. Effective of Excel data tools is good for simple, quick handling Data tab Sorting & Filtering https://www.exceltip.com/basic-excel/data-tab.html https://excelwithbusiness.com/blog/15-excel-data-analysis-functions-need/ Formula bar 8
  • 9. Problem with “just” using Excel • Limited rows • Slow in opening large files • Cannot streamline “pipe” input & output from and to other processing • Often we just need one line of command to finish • E.g. `cat XXX.gtf | awk '$3=="gene"'| cut -f9 | sed 's/.*gene_id=([^;]*)*.*/1/g' | sort -u` extracts unique gene IDs from the entire GTF annotation • On Excel, you may need many clicks and “save as" and open for the same action • Less versatile processing functions • May inadvertently have data changed automatically by Excel formatting • e.g. “gene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’” https://genomebiology.biomedcentral.co m/articles/10.1186/s13059-016-1044-7 9
  • 10. Common programming/scripting languages in bioinformatics Programming / scripting languages • Bash: built-in with Unix-like system (MacOS, Linux) • Python • R • Perl: phasing out; still commonly see in older packages Other tools • Microsoft Excel or equivalent spreadsheet software • More advanced text editors 10
  • 11. Data handling tip 1: Regular expression 11
  • 12. Regular expression- basics • Pattern of strings (a series of characters / digits / symbols / whitespace) • Often abbreviated as “regex” or “regexp” • Useful in searching and/or replacing string (e.g. changing ID formats) • Available in most if not all programming languages, as well as more advanced text editors (i.e. most other than Windows notepad) • Different software may differ slightly in syntax, but mostly similar 12
  • 13. Case study: why we need scripting / regular expression commands? • If I would like to get only the DNA sequence into one line? • What if I have >10k lines? 13
  • 14. Example solution 1. Open “Find & Replace function” • Windows: Ctrl + H • MacOS: Command ⌘ + Option ⌥ + F 2. Select “Case sensitive” 3. Type “[A-Z]{3}s” in the “Find” blank 4. Select “Find All” 5. Copy and paste selection to a new file 14
  • 15. Example solution 6. Type “s+n” to select all trailing space and line break on each line 7. Select “Replace All” DONE!! 15
  • 16. Regular expression (example from Python) • Digit (0-9): d • Non-digit: D • Whitespace (space, tab): s • Non-whitespace: S • Line break: n https://docs.python.org/3/library/re.html Check the exact syntax from the documentation of the tool you use. For instance, * in some tools only means repeating the previous item, e.g. d* means a series of digits, instead of any number of any characters following a digit 16
  • 17. Regular expression (example from Python) • Start of line: ^ (when placed at the start of a pattern) • End of line: $ (when placed at the end of a pattern) • Present 0 or 1 time: ? • Present 1 or more times: + • Repeat n times: {n} • Wildcard: * https://docs.python.org/3/library/re.html 17
  • 18. 18
  • 19. Common string / file manipulation operations on MacOS & Linux Function: Bash command • Row counting: gc • Sorting: sort • Selection / Filtering by row: grep, awk, sed • Replacement: awk, sed • Column selection: cut *slight syntax difference between MacOS and Linux sometimes, e.g. grep -e vs grep -P, for selecting by regex patterns Since so far not everyone have access to Linux servers or are using MacOS/Linux computers, we will have our hands on using some more advanced GUI text editors so you may process small datasets on your own computers too. 19
  • 20. More advanced text editors • Notepad++ • https://notepad-plus-plus.org/downloads/v8.1.3/ • Entirely free; open source • Sublime Text (for demo in class) • https://www.sublimetext.com/download • More functions; would prompt for license purchase • Code-ready: Syntax highlighting • Data-ready: can open larger text files quicker by loading small chunks of the file once at a time 20
  • 21. Hands-on practice • Download Sublime Text • On Galaxy, go to Shared data > History > 2021-08-15_DE_analysis • Download history 1 Drosophila annotation. • Save as Drosophila_annotation.gff Task: • Extract all “gene” features (not CDS / exon / 5’ or 3’ UTR / start or end codon, etc.), with “gene_name”s starting with “CR” and followed by a 5- digit ID • Flybase: CG for protein-coding genes, CR for non-protein-coding genes 21
  • 22. 22
  • 23. Hands-on practice Method 1: Excel 1. Select the 9th column of data 2. Select the “Data” tool tab 3. Select “Text to Columns” 4. Select “Delimited” 5. Select “semicolon” 6. Select ”Standard” 7. Click “Finish” 23
  • 25. 25 Gene features filtered Filter for values equal to “gene”
  • 26. 26 Oh no… Excel filter is not case sensitive And am I going to add 10 more rules to specify for digits? Filter for values that contains ‘gene_name “CR’
  • 27. More caveats with using native Excel functions for filtering • Slow with many steps • Not specific enough • What if we have a file with non-uniform number of items for different attribute rows? (It can happen) • First check the number of rows containing the desired feature, e.g. “gene_name” is the same as the total row number • And in the last example, there can also be genes like CRXXX1? • Not in Drosophila. How do I know? Surely, I didn’t eyeball them… 27
  • 28. When you master regex… • Open the ”Drosophila_annotation.gff” file again in sublime text • Open “Find” function by Ctrl + F (Windows) / Command +F (Mac) ⌘ • Turn on “Regular Expression” and “Case sensitive” modes • Input ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+" • Click “Find All” • Copy & Paste to a new file • DONE!!! 28 You can even speed up by using hotkeys than clicking 
  • 29. The regex pattern explained • ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+” • Starting ^: a string that starts with • (): a capture group of • []: allowed symbols contained within • ^ within []: not • t: tab • +: present for one or more times • {n}: repeated n times • n: line break • a string that starts with two times a capture group of non-tab characters followed by a tab, followed by the string “gene”, followed by 5 more times of the group of non-tab characters followed by a tab, then some non-line end (i.e. basically any) characters, finally followed by the string ‘gene_name “CRd+”’ • i.e. get tab-separated rows that have the string “gene” in 3rd column, then in the 9th column, contains the string gene_name “CRxxx”, where xxx is any number of digits 29
  • 30. Remarks before you feel like reading spells…. 30 • This is not a contest. Accuracy is always the most important • Surely it takes time to practice and master • Searching for reference is just normal and often necessary even when we are more experienced • Just use the method you are most confident and comfortable with • Before getting familiar with regex, just use any method to filter down to the closest criteria to your target before eyeballing, and you already saved lots of time • But when you master the skill, it will save you tonnes of time, and provides a systematic way to reduce human error
  • 31. Homework (Extended) • If you are using Mac or WSL on Windows, you may: • Open the Terminal • navigate to the directory you placed the annotation file using command `cd` • Type ` cat Drosophila_annotation.gff | awk '$3=="gene"' | grep -e 'gene_name "CRd*”’` and get the results 31
  • 32. Homework 1. Download GENCODE human genome annotation • http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/ge ncode.v38.annotation.gtf.gz • Release 38 (GRCh38.p13) • Comprehensive gene annotation • GTF format 2. In your own way, extract rows that • are “genes” • from “chrX” • have “level 1” transcript support level 32
  • 33. Homework 3. Can you find out the number of genes with name starting with “OR_F”, where _ denotes a number of one or more digit(s)? • Olfactory Receptor family ____ subfamily F member 4. Check the GFF/GTF File Format specification • https://asia.ensembl.org/info/website/upload/gff.html • Find what information can be found in each row 5. Check the meaning of “transcript support level” • https://www.gencodegenes.org/pages/faq.html 6. (Extended) You may also download your transcript assembly history and try filtering by different “class_code” • Real question by your classmate and real-life use case! 33
  • 34. Homework Remarks 1. Check the GFF/GTF File Format specification • https://asia.ensembl.org/info/website/upload/gff.html • Find what information can be found in each row 2. Check the meaning of “transcript support level” • https://www.gencodegenes.org/pages/faq.html 34
  • 35. Data handling tip 2: Basic programming logic 35
  • 36. Why we may want to do some coding? • Involves handling high dimensional data (i.e. a whole lot of features) • Each file can be large (of thousands to billions rows) • Often need to operate on a large collection of large files • Auto >>> manual • Not all useful methods have graphics user interface (GUI) versions • Input data format do not fit • The output may not match your goals • e.g. overlaying plots, highlighting targets Manual: Walk to each haystack, look with bare eyes for needles
  • 37. Coding/Programming is basically…. LOGIC • Variables (to store values) • Basic data structures (how we organize the data): • integer, float/decimal, string, boolean, array/list • 1D: Array / list • >=2D: tables / matrix • Operators • Arithmetic: add +, minus -, multiply *, divide / • Logical: AND &, OR | • Equality: greater than >, smaller than <, equal == • Flow control: Conditions (IF-THEN-ELSE) & Loops (FOR/WHILE)
  • 38. Variable • a = 1 <= assign a value of 1 to the “namespace” of a • Like how your name represents your being and can be used for calling • Here, “a” is the variable, and 1 is its value Assigning a variable • Bash: a=1 (do not leave space) • Python: a = 1 (can omit the space, just not tidy or conventional) • R: a <- 1 (can also use = like Python, just not R-style) Calling a variable • Bash: $a • Python / R: a 38
  • 39. String • A series of characters / digits / symbols / whitespace • e.g. ‘abc’, ‘ATGCACGAG’, ‘12345’, ‘Hello, World’, ‘dawlekjr;alwejr’, ‘ ‘ • Can concatenate, search for part of the string (“substring”) by pattern, etc. • Usually denoted by being put into single or double quotes • Some programming languages have a separate class of data type for single characters • Bash: no data type, but you can specify by adding quotes • Python: str; ‘123’ or str(123) creates a string of ‘123’ • R: character; ‘123’ or as.character(123) creates a string of ‘123’ 39
  • 40. Numeric, i.e. numbers • Can perform arithmetic operations • Integers • Floats / decimal • Bash: no data types • Python: int, float • R: integer, double • Some programming languages have special data types for larger ranges of integer, but it is out of scope here 40
  • 41. Boolean • True or False • Bash: Not exist; does have boolean expressions (comparison & conditions) • Python: bool; True, False • R: logical; TRUE, FALSE 41 https://www.fallacies.ca/ttable.htm
  • 42. List or Array • A collection of elements • Some collection data types allow collection of elements of different data types, e.g. [‘a’, 1234, True], some only allow a uniform data type • For simplicity, we only consider lists/arrays of one data type, such as what we will use for data visualization later • Bash array: declare -a my_array; my_array=(1 2 3) • Python list: my_list = [1, 2, 3] • R: c(1, 2, 3) 42
  • 43. Operators • Arithmetic: add +, minus -, multiply *, divide / • Logical: AND &, OR | • Equality: greater than >, smaller than <, equal == 43
  • 44. If-then(-else) • “If I am hungry after class, (then) I go to coffee corner for dinner” • “else I go home directly” 44
  • 45. For-loop • “For every day in this week, I eat an apple” • for day in week: • eat(apple) • Enumerate (count) element in the for-condition, perform action(s) for each element • Bash: for i in `seq 1 20`; do echo $i; done 45
  • 46. While-loop • “While I live, I breathe” • Looping non-stop until the predicate condition becomes false • Beware of infinite loop: Add condition check within the loop 46
  • 47. Data handling tip 3: Processing tabular data 47
  • 48. The Python language • The most used programming language in the world • Versatile usage
  • 49. The R language • Designed for statistical analysis • Bioconductor • A market of biological analysis-related packages • Mostly published: proven • Unified downloading method No need to reinvent the wheel Ready-made packages:
  • 51. Packages for data handling on R • readr for data import • dplyr for data processing • ggplot2 for data visualization 51 Tidyverse: R packages designed for data science built with unified grammar and data structures
  • 52. Packages for data handling on Python Data processing & calculation • pandas • numpy • Scikit-learn • StatsModels Data visualization • Matplotlib • Seaborn/Bokeh(/Plotly) 52 • Plotly not available in jupyter/datascience-notebook Docker image • Not necessary until more advanced usage such as interactive dashboard construction • Name mentioned for popularity
  • 53. Why show me both R & Python? • Learning the language(s) allows customized data analysis & visualization workflows • To be future-proof, (and as a Pythonista), I strongly encourage you to learn Python • To be competent in bioinformatics, you should have good command of R to use Bioconductor packages when moving outside Galaxy • So… • We will only use R tidyverse for demo today • Basic syntax of Python equivalent is also provided along side for home practice
  • 55. Docker • Application container • Cross-platform portability • Reduces installation hassles • Avoid dependencies issues (missing or version clash) 55 https://www.docker.com/get-started > Download Docker Desktop - “an open platform for developing, shipping, and running applications”
  • 56. Jupyter notebook: interactive coding environment • Former IPython: I for interactive • Extended to support kernels of various programming languages • Python, R, Julia, C++, Ruby, etc. • Markdown for easy note jotting • Portable & Runnable • Commonly used for development and workflow sharing • Installation: type in Terminal / PowerShell the following 56 docker pull jupyter/datascience-notebook
  • 57. Finding your terminal / PowerShell • MacOS / Linux users: Find the app “Terminal” from LaunchPad • Windows users: Find “PowerShell” from start menu 57
  • 58. Download data for input 58 Dataset Download 1. On Galaxy, go to “Shared Data” > “History” >“2021-08-15_DE_analysis” 2. Download the DESeq2 result file (History 35) • Save as “20210819-DE-result-demo.txt”
  • 59. Running data science Jupyter notebook on docker • On your internet browser, go to http://localhost:8888 • When prompted for token, copy & paste the string behind “?token=” shown on the Terminal / Command Prompt • Different each time 59 docker run –p 8888:8888 jupyter/datascience-notebook
  • 60. Create a new R notebook 60
  • 62. Using installed package/library Python • import some_package • Import some_package as abbr R • library(some_package) 62
  • 63. Import the tidyverse library (family) 63 Not errors: the more user friendly dplyr::filter() replaces the stats::filter() to be called by filter(). You may still call stats::filter() by writing it in full as stats::filter(). Similar for lag()
  • 64. ”Dataframe” – the data type for your table • 2 axes: columns (vertical; variable) & rows (horizontal; observation) • col_names / header: name/ID of columns • row_names / index: name/ID of rows 64 https://subscription.packtpub.com/book/data/9781784393878 /1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe
  • 65. Importing data pandas @Python • import pandas as pd • df = pd.read_csv(‘xxx.csv’, header=None, comment=‘#’) • df = pd.read_csv(‘xxx.tsv’, sep=‘ t’, header=None, comment=‘#’) dplyr @R • library(tidyverse) • df <- read_csv(‘xxx.csv’, comment=‘#’) • df <- read_tsv(‘xxx.tsv’, comment=‘#’) 65
  • 66. Read our data file 66 Important to specify the absence of header row, else you will miss a row of data entry!
  • 67. Have a glimpse of the dataframe pandas @Python • df • df.head() • df.tail() dplyr @R • df • head(df) • tail(df) 67 head: first few rows tail: last few rows
  • 68. 68 df
  • 69. Renaming columns pandas @Python • df.columns = [‘iamcol1’, ‘iamcol2’, …] R • Base R way: df <- colnames(df) = c(‘iamcol1’, ‘iamcol2’, …) • dplyr way: df <- rename(df, old_name = new_name) 69 The more advanced `rename_with` function that allows more complicated custom functions and the use of regex is left for exploration rename all rename specific column(s)
  • 70. 70 !! Poor “machine readability” => might cause error Better naming
  • 71. Filtering rows pandas @Python • df[df[data_mask]] • E.g. de_result[de_result[pval < 0.05] dplyr @R • filter(df, data_mask) • E.g. filter(de_result, pval < 0.05) 71
  • 72. 72
  • 73. Selecting columns pandas @Python • df[‘column_name’] dplyr @R • df$column_name • Returns a collection • df[‘column_name’] • Returns a tibble dataframe • df %>% select(A, B, E) • Returns a tibble dataframe 73
  • 74. Chaining pandas @Python • by dots • df[data_mask].func().func() dplyr @R • by %>% • starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE)) 74 The more advanced `rename_with` function that allows more complicated custom functions and the use of regex is left for exploration Example from: https://dplyr.tidyverse.org/reference/filter.html
  • 75. 75
  • 76. 76
  • 77. New package installation (Extended) Python • On Terminal / Command Prompt R • Within R 77 python3 –m pip install –user some_package install.packages(”some_package") Most necessary packages for basic data processing & visualization are installed on jupyter/datascience- notebook Docker image already
  • 79. Why visualize data? • Numbers are unintuitive • Exploratory data analysis: Understand the data properties, QC • Result presentation: Let readers get the basic ideas at one glimpse 79 Can you make sense of these numbers?
  • 80. How to choose a type of visualization? • Data type • Categorical • Quantitative / Numerical • Rank / Ordinal • Relation • Distribution: e.g. histogram, box plot, violin plot, density plot • Correlation: e.g. scatter plot, heatmap • Ranking: e.g. bar plot • Timeseries: line chart 80
  • 81. Data visualization packages on R • ggplot2: versatile plotting library for various plot types • gplots • Specific libraries, e.g. pheatmap, for specific plot types 81
  • 82. Data visualization packages on Python • Matplotlib: versatile, can create any visualization & tune any detail • Seaborn: more advanced plot types; visually appealing preset styles • Bokeh / Plotly / HoloViews: visually appealing interactive plots 82
  • 84. Heatmap • Showing values with colors DESeq2 results Source: Differential gene expression induced by Verteporfin in endometrial cancer cells (Bang, et al., 2019) • often drawn using log fold change values 84 Conditional formatting for a glimpse
  • 85. 85 Packages for heatmap drawing • heatmap by ggplot2 geom_tile: • Native, more options for data normalization & clustering • More steps • https://www.r-graph-gallery.com/heatmap • pheatmap: • https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-phe atmap-package/ • gplots heatmap.2: • https://www.rdocumentation.org/packages/gplots/versions/3.1.1/topics/ heatmap.2
  • 86. Heatmap with Pairwise clustering Dendrogram Heatmap: - Showing valuers with colors Pairwise clustering: - Group items of high similarity (i.e. short distances) together e.g. are technical replicates clustered together?
  • 87. Volcano plot • A type of scatter plot • Visualizes differential expression results • Shows up / down expression regulation • y-axis: statistical significance (-logP) • x-axis: log fold change • Can input results from e.g. DESeq2 87 https://galaxyproject.github.io/training-m aterial/topics/transcriptomics/tutorials/rn a-seq-viz-with-volcanoplot/tutorial.html
  • 88. MA plot Yin, T., Cook, D. & Lawrence, M. ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biol 13, R77 (2012). https://doi.org/10.1186/gb- 2012-13-8-r77 • DE analysis results • Up & down regulation • y: log fold change • x: normalized mean • Color: Statistical significance • Available in DESeq2 88
  • 89. Genome-wide Variation: Manhattan plot • A type of scatter plot • Shows genome wide location of loci features, e.g. variants 89 https://www.researchgate.net/publication/272083683_Genome-wide_association_study_of_cl inically_defined_gout_identifies_multiple_risk_loci_and_its_association_with_clinical_subtype s/figures?lo=1
  • 90. Network • Connections show interaction, e.g. between genes • Can be drawn by e.g. Cytoscape 90 https://www.researchgate.net/pu blication/321256870_Differential_ gene_expression_in_heterophils_i solated_from_commercial_hybrid _and_Thai_indigenous_broiler_chi ckens_under_quercetin_suppleme ntation/figures?lo=1 Node Edge
  • 91. 1. On Galaxy, go to “Shared Data” > “History” >“2021-08-15_DE_analysis” 2. Download the featureCounts result files (Histories 27, 29, 31, 33) • Save as “20210819-SRR1210078_WT_rep1-featureCounts.txt” “20210819-SRR1210079_WT_rep2-featureCounts.txt” “20210819-SRR1210084_C24_rep1-featureCounts.txt” “20210819-SRR1210085_C24_rep2-featureCounts.txt” 3. Upload to the Jupyter notebook 91 Download data for input
  • 92. ggplot2 basics (hands-on) Basics *Plot element layers & settings can be added by chaining function() with + Facet’s Panels Labels Breaks *Layers of Genometic Objects *Legend Plot Anatomy
  • 93. Further ggplot2 examples (Extended) From • Histogram https://www.r-graph-gallery.com/220-basic-ggplot2-histogram.html • Scatterplot • https://www.r-graph-gallery.com/274-map-a-variable-to-ggplot2-scatt erplot.html • Boxplot https://www.r-graph-gallery.com/boxplot.html
  • 94. Python visualization basics (Extended) • Matplotlib gallery • https://matplotlib.org/stable/gallery/index.html • Seaborn gallery • https://seaborn.pydata.org/examples/index.html • Python graph gallery • https://www.python-graph-gallery.com/
  • 96. Does it worth a figure • Key result? • Also depends on journal and current trend Vale, 2015 ; Accelerating scientific publication in biology 96
  • 97. Data visualization is where art meets science 97 BIG DATA V.01 (https://in.pinterest.com/pin/817614507335199087/)
  • 98. Adjusting a figure The most basic plot output before aesthetic tuning is NOT DONE DEAL
  • 100. Plot as visual aid: Perception • Jakob’s Law in UX web design: “People expect your website to work the same way as the other websites they're using” • sO thIs iS od d • Highlight what is important (you ARE distracted) • Same color for the same variable What Are Data Visualization Style Guidelines? by Amy Cesal https://medium.com/nightingale/style-guidelines-92ebe166addc 100
  • 101. Reader-friendliness • Aim: Be intuitive than fancy • Proper contrast (not like this) • Be color weakness-friendly (this may be not) • Proper contrast in hues • Paletter generator: https://color.adobe.com/create/color-accessibility Proof tool on Adobe Illustrator More resources: • https://www.color-blindness.com/coblis-color-blindness-simulator/ • https://helpx.adobe.com/creative-cloud/adobe-color-accessibility-tool s.html • https://creativepro.com/viewing-color-blind-previews-of-pages/
  • 102. 102 Proper color contrast for categorical data https://thenode.biologists.com/data-visualization-with-flying-c olors/research/ http://bconnelly.net/posts/creating_colorblind-friendly_figures/
  • 103. Smooth colormap for quantitative data • The ”scalebar” for heatmap • Visual “uniformity” preferred Visually “amplified” expected value changes Smooth visual gradience
  • 104. Post-processing • Export plot output from code in vector graphics, e.g. SVG, PDF • Editability: you can open with Illustrator, Inkscape, etc • Full resolution: export to at least 300 dpi for rasterized graphics, e.g. PNG • Only make necessary & scientifically allowed (i.e. not misleading) changes • E.g. Edit font size, add * to indicate statistical significance where necessary • Always aim to be clean & intuitive “Make up till you cannot be recognized is disguise” – Dayo Wong 104
  • 106. Choose your weapon • Excel: know formulas & functions well for simple handling • Bash: quick string / file view & manipulation; scripting, e.g. when looping through hundreds of files or tonnes of lines • Python / R: More advanced data analytics & visualization • At least be good enough in one that you can do everything with it • Use what you are confident with
  • 107. Google is your friend!! • StackOverflow (various StackExchange sites) • Quora/Reddit/CSDN/Qiita (whatever) • 🔍 {tool I use} {my purpose} • 🔍 {the error message} • 🔍 {tool idk how to use} manual • 🔍 {tool with hard-to-read official manual} tutorial • 🔍 {tool} cheatsheet
  • 108. Books on Data & Visualization Now Matplotlib 3.4.2 108
  • 109. More free online resources • RegExr: Learn, Build, & Test RegEx • https://regexr.com/ • regex101: build, test, and debug regex • https://regex101.com/ • R cheatsheet • https://www.rstudio.com/resources/cheatsheets/ • R Tidyverse tutorial by DataCamp • https://www.datacamp.com/community/tutorials/tidyverse-tutorial-r • Data Visualisation: the Good, the Bad and the Ugly (1) by Mina Pêcheux • https://minapecheux.com/website/2018/07/17/data-visualization-the-good-the-bad -and-the-ugly-1/ 109
  • 110. 110 More free online resources (Extended) • BASH programming • https://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html • Advanced Bash-Scripting Guide by Mendel Cooper • https://tldp.org/LDP/abs/html/ • Bash scripting cheatsheet • https://devhints.io/bash • Learnpython.org • https://www.learnpython.org/

Editor's Notes

  • #1: Hello, I’m Claire. Today I would like to share with you on data processing and visualization methods. You guys have done a great job learning the workflow of a basic RNA seq analysis with the guidance from my colleagues.
  • #2: Now that you are now more equipped, perhaps you may also be more eager to learn further. As the last session of this course of biocomputing, that is the use of computers to aid biological studies, I would like to invite you to go beyond a specific workflow and look from a wider perspective. So if you are interested in bioinformatics, you can acquire the mindset and some basic skills to kickstart your journey. In this session, I will talk about how we can and should view bioinformatics as a field of data science, leading into the importance of data processing & data visualization. I will also teach you some basic data processing & visualization skills, and discuss about some general principles & tips for data visualization
  • #3: Expectation
  • #4: Like what has been mentioned in earlier lessons, there are different “tastes” of bioinformatics The engineer that innovate tools, including algorithms and even hardware to compute The Or maybe you just need bioinformatics data (like PCR as Dr Chan said) You focus on the discovery Surely Just like you wont manually cycle your samples between three water baths to do PCR today, being proficient using the But you need not be the one Yet, one thing Ease your life, A LOT Learning programming
  • #5: Like what has been mentioned in earlier lessons, there are different “tastes” of bioinformatics The engineer that innovate tools, including algorithms and even hardware to compute The Or maybe you just need bioinformatics data (like PCR as Dr Chan said) You focus on the discovery Surely Just like you wont manually cycle your samples between three water baths to do PCR today, being proficient using the But you need not be the one Yet, one thing Ease your life, A LOT Learning programming
  • #18: https://github.com/rstudio/cheatsheets/blob/master/regex.pdf
  • #21: DESeq2: GeneID Base_meanlog2(FC) StdErr Wald-Stats P-value P-adj cat Drosophila_annotation.gff | awk '$3=="gene"' | wc -l  17807 cat Drosophila_annotation.gff | awk '$3=="gene"' | grep "CR" | wc -l  2847 cat Drosophila_annotation.gff | awk '$3=="gene"' | grep -e 'gene_name \SCR' | wc -l 273 cat Drosophila_annotation.gff | awk '$3=="gene"' | cut -f9 | cut -d';' -f5 | wc -l    17807 zcat gencode.v38.annotation.gtf.gz | awk '$3=="gene"' | grep -P "gene_name \"OR\d+F" chr11 HAVANA gene 86649 87586 . - . gene_id "ENSG00000224777.3"; gene_type "unprocessed_pseudogene"; gene_name "OR4F2P"; level 2; hgnc_id "HGNC:8299"; havana_gene "OTTHUMG00000154192.2"; chr11 HAVANA gene 4709569 4712421 . + . gene_id "ENSG00000272634.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "OR51F5P"; level 2; hgnc_id "HGNC:31283"; havana_gene "OTTHUMG00000186286.2"; chr11 HAVANA gene 4736178 4737078 . - . gene_id "ENSG00000272559.1"; gene_type "unprocessed_pseudogene"; gene_name "OR51F3P"; level 2; hgnc_id "HGNC:31281"; havana_gene "OTTHUMG00000186288.2"; chr11 HAVANA gene 4752046 4752944 . - . gene_id "ENSG00000273051.1"; gene_type "unprocessed_pseudogene"; gene_name "OR51F4P"; level 2; hgnc_id "HGNC:31282"; havana_gene "OTTHUMG00000186289.1"; chr11 HAVANA gene 4768979 4769917 . - . gene_id "ENSG00000280021.2"; gene_type "protein_coding"; gene_name "OR51F1"; level 2; hgnc_id "HGNC:15196"; havana_gene "OTTHUMG00000066503.3"; chr11 HAVANA gene 4821321 4822456 . + . gene_id "ENSG00000176925.8"; gene_type "protein_coding"; gene_name "OR51F2"; level 2; hgnc_id "HGNC:15197"; havana_gene "OTTHUMG00000066508.3"; chr11 HAVANA gene 55993681 55994625 . - . gene_id "ENSG00000149133.1"; gene_type "protein_coding"; gene_name "OR5F1"; level 2; hgnc_id "HGNC:8343"; havana_gene "OTTHUMG00000166825.2"; chr11 HAVANA gene 56015017 56015957 . - . gene_id "ENSG00000182365.4"; gene_type "unprocessed_pseudogene"; gene_name "OR5F2P"; level 2; hgnc_id "HGNC:15286"; havana_gene "OTTHUMG00000166837.1"; chr11 HAVANA gene 124207183 124208112 . + . gene_id "ENSG00000239426.4"; gene_type "unprocessed_pseudogene"; gene_name "OR8F1P"; level 2; hgnc_id "HGNC:14691"; havana_gene "OTTHUMG00000154391.1";
  • #23: https://www.laptopmag.com/articles/use-text-columns-excel
  • #30: That’s why I also showed you the usage of the Excel Data tab tools. They are not perfect, but they do help.
  • #31: DESeq2: GeneIDBase meanlog2(FC)StdErrWald-StatsP-valueP-adj zcat gencode.v38.annotation.gtf.gz | awk '$3=="gene"' | grep -P "gene_name \"OR\d+F" chr11 HAVANA gene 86649 87586 . - . gene_id "ENSG00000224777.3"; gene_type "unprocessed_pseudogene"; gene_name "OR4F2P"; level 2; hgnc_id "HGNC:8299"; havana_gene "OTTHUMG00000154192.2"; chr11 HAVANA gene 4709569 4712421 . + . gene_id "ENSG00000272634.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "OR51F5P"; level 2; hgnc_id "HGNC:31283"; havana_gene "OTTHUMG00000186286.2"; chr11 HAVANA gene 4736178 4737078 . - . gene_id "ENSG00000272559.1"; gene_type "unprocessed_pseudogene"; gene_name "OR51F3P"; level 2; hgnc_id "HGNC:31281"; havana_gene "OTTHUMG00000186288.2"; chr11 HAVANA gene 4752046 4752944 . - . gene_id "ENSG00000273051.1"; gene_type "unprocessed_pseudogene"; gene_name "OR51F4P"; level 2; hgnc_id "HGNC:31282"; havana_gene "OTTHUMG00000186289.1"; chr11 HAVANA gene 4768979 4769917 . - . gene_id "ENSG00000280021.2"; gene_type "protein_coding"; gene_name "OR51F1"; level 2; hgnc_id "HGNC:15196"; havana_gene "OTTHUMG00000066503.3"; chr11 HAVANA gene 4821321 4822456 . + . gene_id "ENSG00000176925.8"; gene_type "protein_coding"; gene_name "OR51F2"; level 2; hgnc_id "HGNC:15197"; havana_gene "OTTHUMG00000066508.3"; chr11 HAVANA gene 55993681 55994625 . - . gene_id "ENSG00000149133.1"; gene_type "protein_coding"; gene_name "OR5F1"; level 2; hgnc_id "HGNC:8343"; havana_gene "OTTHUMG00000166825.2"; chr11 HAVANA gene 56015017 56015957 . - . gene_id "ENSG00000182365.4"; gene_type "unprocessed_pseudogene"; gene_name "OR5F2P"; level 2; hgnc_id "HGNC:15286"; havana_gene "OTTHUMG00000166837.1"; chr11 HAVANA gene 124207183 124208112 . + . gene_id "ENSG00000239426.4"; gene_type "unprocessed_pseudogene"; gene_name "OR8F1P"; level 2; hgnc_id "HGNC:14691"; havana_gene "OTTHUMG00000154391.1";
  • #35: Fly through
  • #36: Btw resource consideration is also key (enough to start running? completion time feasible? What work can be started in parallel?)
  • #37: The components, the statements => the arguments
  • #40: http://uc-r.github.io/integer_double/
  • #48: https://www.bd.gov.hk/en/resources/codes-and-references/modular-integrated-construction/index.html
  • #49: https://www.bd.gov.hk/en/resources/codes-and-references/modular-integrated-construction/index.html
  • #50: https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html
  • #51: R packages designed for data science built with unified grammar and data structures
  • #52: R packages designed for data science built with unified grammar and data structures
  • #53: https://www.bd.gov.hk/en/resources/codes-and-references/modular-integrated-construction/index.html
  • #56: Read more: https://hub.docker.com/r/jupyter/datascience-notebook/ try.jupyter.org
  • #62: https://idc9.github.io/stor390/notes/dplyr/dplyr.html
  • #65: https://idc9.github.io/stor390/notes/dplyr/dplyr.html
  • #69: https://dplyr.tidyverse.org/
  • #71: https://dplyr.tidyverse.org/
  • #74: https://dplyr.tidyverse.org/ https://dplyr.tidyverse.org/reference/filter.html
  • #76: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf
  • #77: https://idc9.github.io/stor390/notes/dplyr/dplyr.html
  • #79: Like what has been mentioned in earlier lessons, there are different “tastes” of bioinformatics The engineer that innovate tools, including algorithms and even hardware to compute The Or maybe you just need bioinformatics data (like PCR as Dr Chan said) You focus on the discovery Surely Just like you wont manually cycle your samples between three water baths to do PCR today, being proficient using the But you need not be the one Yet, one thing Ease your life, A LOT Learning programming
  • #81: https://www.biostars.org/p/152291/
  • #84: https://avikarn.com/2020-07-02-RNAseq_DeSeq2/
  • #85: https://avikarn.com/2020-07-02-RNAseq_DeSeq2/ # defining a clustering function hclust_fun = function(x) hclust(x, method="complete") # defining a distance calculation function dist_fun = function(x) dist(x, method="euclidean") heatmap.2(as.matrix(vals), scale="row", trace="none", dendrogram="both", Rowv=TRUE, Colv=TRUE, distfun=dist_fun, hclustfun=hclust_fun, col=colors)
  • #87: Import workflow: https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-volcanoplot/workflows/rna-seq-viz-with-volcanoplot.ga Share & publish workflow on local Galaxy: https://137.189.51.116:2220/workflow/sharing?id=20dece60f529586f https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-volcanoplot/tutorial.html https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rna-seq-viz-with-volcanoplot-r/tutorial.html
  • #89: https://www.researchgate.net/publication/272083683_Genome-wide_association_study_of_clinically_defined_gout_identifies_multiple_risk_loci_and_its_association_with_clinical_subtypes/figures?lo=1
  • #91: https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once
  • #92: heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
  • #93: heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10)) Maybe homework
  • #94: heatmap(nba_matrix, Rowv=NA, Colv=NA, col = cm.colors(256), scale="column", margins=c(5,10))
  • #99: Common sense
  • #101: https://thenode.biologists.com/data-visualization-with-flying-colors/research/
  • #102: distinct