Data processing and visualization basics

Data processing &
visualization
methods
Claire Chung
2021/08/17
LSCI6101 – Techniques in Biocomputing

Overview
• Bioinformatics as data science
• Why data processing & data visualization are important
• Basic data processing skills
• Regular expression
• Basic programming concepts (Extended)
• Processing tabular data using R (Extended)
• Basic data visualization in bioinformatics
• Common plot types in bioinformatics
• Common data visualization packages
• General principles & tips for data visualization
Biocomputing
Techniques in
2

Aims
• The first taste: Learn through hands-on examples
• Get code snippets to play with and modify for future use
• Provide *keywords* and key resources to kick start onward learning
3

Different “tastes” of bioinformaticians
ENGINEER DATA SCIENTIST “WET LAB”-FOCUSED
Discovery-oriented
Feasibility & elegance of
methodology
4

Work of
bioinformatician
Understanding of
bioinformatics
study design & data
from data
generation
experiments
to resulted
data
Selection & usage
of proper tools to
handle data
May include
software
installation,
which can be
non-trivial
Communicate progress and
presentation of data analysis
results
5

6
Bioinformatics as Data Science
Ask
Get
Explore
Analyze
Comm-
unicate
an interesting question
• Understand the biology behind
• Design the study & experiments needed
the data
• Generate data via experiments
and/or
• Get relevant data from public db
the data
• Know the metadata (origin? Type?
Format specification?)
• Perform quality check
• Perform data cleaning
• Transform data where necessary
• Understand the data distribution
preliminarily <= aided by data viz
the data
• Choose the right tools
• Understand and interpret
the results
• Often needs data transformation
the results
• Effective communication
requires intuitive graphics
• Choose the right plot type
• Tune the aesthetics
• Add proper legend
• Clear writing
The data cycle
Data management & Operation
processes are non-trivial too

Application of skills to learn in this session
Data processing
• Data cleaning
• Data filtering
Data visualization
• Exploratory Data Analysis (EDA)
• Result presentation
• Check the number of data entries
• Check if the data contain irrelevant
entries, missing values, unsupported
characters, extra space
• Fix or remove erratic data
• Changing file formats to fit different tools
• Filter data for downstream analysis, e.g.
filter assembled transcripts by class code
• Check if the data distribution looks reasonable
• Look for trend and/or outliers preliminarily
7

Effective of Excel data tools is good for simple,
quick handling
Data tab
Sorting & Filtering
https://www.exceltip.com/basic-excel/data-tab.html
https://excelwithbusiness.com/blog/15-excel-data-analysis-functions-need/
Formula bar
8

Problem with “just” using Excel
• Limited rows
• Slow in opening large files
• Cannot streamline “pipe” input & output from and to other processing
• Often we just need one line of command to finish
• E.g. `cat XXX.gtf | awk '$3=="gene"'| cut -f9 | sed 's/.*gene_id=([^;]*)*.*/1/g' | sort -u`
extracts unique gene IDs from the entire GTF annotation
• On Excel, you may need many clicks and “save as" and open for the same action
• Less versatile processing functions
• May inadvertently have data changed automatically by Excel formatting
• e.g. “gene symbols such as SEPT2 (Septin 2) and
MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1,
E3 Ubiquitin Protein Ligase] are converted
by default to ‘2-Sep’ and ‘1-Mar’”
https://genomebiology.biomedcentral.co
m/articles/10.1186/s13059-016-1044-7 9

Common programming/scripting languages
in bioinformatics
Programming / scripting languages
• Bash: built-in with Unix-like system (MacOS, Linux)
• Python
• R
• Perl: phasing out; still commonly see in older packages
Other tools
• Microsoft Excel or equivalent spreadsheet software
• More advanced text editors
10

Data handling tip 1:
Regular expression
11

Regular expression- basics
• Pattern of strings (a series of characters / digits / symbols /
whitespace)
• Often abbreviated as “regex” or “regexp”
• Useful in searching and/or replacing string (e.g. changing ID formats)
• Available in most if not all programming languages, as well as more
advanced text editors (i.e. most other than Windows notepad)
• Different software may differ slightly in syntax, but mostly similar
12

Case study: why we need scripting / regular
expression commands?
• If I would like to get only the
DNA sequence into one line?
• What if I have >10k lines?
13

Example solution
1. Open “Find & Replace function”
• Windows: Ctrl + H
• MacOS: Command ⌘ + Option ⌥ + F
2. Select “Case sensitive”
3. Type “[A-Z]{3}s” in the “Find” blank
4. Select “Find All”
5. Copy and paste selection to a new file
14

Example solution
6. Type “s+n” to select all trailing
space and line break on each line
7. Select “Replace All”
DONE!!
15

Regular expression (example from Python)
• Digit (0-9): d
• Non-digit: D
• Whitespace (space, tab): s
• Non-whitespace: S
• Line break: n
https://docs.python.org/3/library/re.html
Check the exact syntax from the
documentation of the tool you use.
For instance, * in some tools only means
repeating the previous item, e.g. d*
means a series of digits, instead of any
number of any characters following a digit
16

Regular expression (example from Python)
• Start of line: ^ (when placed at the start of a pattern)
• End of line: $ (when placed at the end of a pattern)
• Present 0 or 1 time: ?
• Present 1 or more times: +
• Repeat n times: {n}
• Wildcard: *
https://docs.python.org/3/library/re.html
17

Common string / file manipulation operations on
MacOS & Linux
Function: Bash command
• Row counting: gc
• Sorting: sort
• Selection / Filtering by row: grep, awk, sed
• Replacement: awk, sed
• Column selection: cut
*slight syntax difference between MacOS and Linux sometimes,
e.g. grep -e vs grep -P, for selecting by regex patterns
Since so far not everyone have access to Linux servers or are using
MacOS/Linux computers, we will have our hands on using some more
advanced GUI text editors so you may process small datasets
on your own computers too.
19

More advanced text editors
• Notepad++
• https://notepad-plus-plus.org/downloads/v8.1.3/
• Entirely free; open source
• Sublime Text (for demo in class)
• https://www.sublimetext.com/download
• More functions; would prompt for license purchase
• Code-ready: Syntax highlighting
• Data-ready: can open larger text files quicker
by loading small chunks of the file once at a time
20

Hands-on practice
• Download Sublime Text
• On Galaxy, go to Shared data > History > 2021-08-15_DE_analysis
• Download history 1 Drosophila annotation.
• Save as Drosophila_annotation.gff
Task:
• Extract all “gene” features (not CDS / exon / 5’ or 3’ UTR / start or end
codon, etc.), with “gene_name”s starting with “CR” and followed by a 5-
digit ID
• Flybase: CG for protein-coding genes, CR for non-protein-coding genes
21

Hands-on practice
Method 1: Excel
1. Select the 9th
column of data
2. Select the “Data” tool tab
3. Select “Text to Columns”
4. Select “Delimited”
5. Select “semicolon”
6. Select ”Standard”
7. Click “Finish”
23

25
Gene features filtered
Filter for values equal to “gene”

26
Oh no… Excel filter is
not case sensitive
And am I going to add 10 more
rules to specify for digits?
Filter for values that
contains ‘gene_name “CR’

More caveats with using native Excel
functions for filtering
• Slow with many steps
• Not specific enough
• What if we have a file with non-uniform number of items for different
attribute rows? (It can happen)
• First check the number of rows containing the desired feature, e.g.
“gene_name” is the same as the total row number
• And in the last example, there can also be genes like CRXXX1?
• Not in Drosophila. How do I know? Surely, I didn’t eyeball them…
27

When you master regex…
• Open the ”Drosophila_annotation.gff” file again in sublime text
• Open “Find” function by Ctrl + F (Windows) / Command +F (Mac)
⌘
• Turn on “Regular Expression” and “Case sensitive” modes
• Input ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+"
• Click “Find All”
• Copy & Paste to a new file
• DONE!!!
28
You can even speed up
by using hotkeys than
clicking 

The regex pattern explained
• ^([^t]+t){2}genet([^t]+t){5}[^n]+gene_name "CRd+”
• Starting ^: a string that starts with
• (): a capture group of
• []: allowed symbols contained within
• ^ within []: not
• t: tab
• +: present for one or more times
• {n}: repeated n times
• n: line break
• a string that starts with two times a capture group of non-tab characters followed by a tab,
followed by the string “gene”, followed by 5 more times of the group of non-tab characters
followed by a tab, then some non-line end (i.e. basically any) characters, finally followed by
the string ‘gene_name “CRd+”’
• i.e. get tab-separated rows that have the string “gene” in 3rd
column, then in the 9th
column,
contains the string gene_name “CRxxx”, where xxx is any number of digits
29

Remarks before you feel like reading spells….
30
• This is not a contest. Accuracy is always the most important
• Surely it takes time to practice and master
• Searching for reference is just normal and often necessary even when we are
more experienced
• Just use the method you are most confident and comfortable with
• Before getting familiar with regex, just use any method to filter down to the
closest criteria to your target before eyeballing, and you already saved lots of
time
• But when you master the skill, it will save you tonnes of time, and provides a
systematic way to reduce human error

Homework (Extended)
• If you are using Mac or WSL on Windows, you may:
• Open the Terminal
• navigate to the directory you placed the annotation file using command `cd`
• Type ` cat Drosophila_annotation.gff | awk '$3=="gene"' | grep -e
'gene_name "CRd*”’` and get the results
31

Homework
1. Download GENCODE human genome annotation
• http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/ge
ncode.v38.annotation.gtf.gz
• Release 38 (GRCh38.p13)
• Comprehensive gene annotation
• GTF format
2. In your own way, extract rows that
• are “genes”
• from “chrX”
• have “level 1” transcript support level
32

Homework
3. Can you find out the number of genes with name starting with “OR_F”, where _
denotes a number of one or more digit(s)?
• Olfactory Receptor family ____ subfamily F member
4. Check the GFF/GTF File Format specification
• https://asia.ensembl.org/info/website/upload/gff.html
• Find what information can be found in each row
5. Check the meaning of “transcript support level”
• https://www.gencodegenes.org/pages/faq.html
6. (Extended) You may also download your transcript assembly history and try filtering
by different “class_code”
• Real question by your classmate and real-life use case!
33

Homework Remarks
1. Check the GFF/GTF File Format specification
• https://asia.ensembl.org/info/website/upload/gff.html
• Find what information can be found in each row
2. Check the meaning of “transcript support level”
• https://www.gencodegenes.org/pages/faq.html
34

Basic programming logic
35

Why we may want to do some coding?
• Involves handling high dimensional data (i.e. a whole lot of features)
• Each file can be large (of thousands to billions rows)
• Often need to operate on a large collection of large files
• Auto >>> manual
• Not all useful methods have
graphics user interface (GUI) versions
• Input data format do not fit
• The output may not match your goals
• e.g. overlaying plots, highlighting targets
Manual: Walk to each haystack,
look with bare eyes for needles

Coding/Programming is basically…. LOGIC
• Variables (to store values)
• Basic data structures (how we organize the data):
• integer, float/decimal, string, boolean, array/list
• 1D: Array / list
• >=2D: tables / matrix
• Operators
• Arithmetic: add +, minus -, multiply *, divide /
• Logical: AND &, OR |
• Equality: greater than >, smaller than <, equal ==
• Flow control: Conditions (IF-THEN-ELSE) & Loops (FOR/WHILE)

Variable
• a = 1 <= assign a value of 1 to the “namespace” of a
• Like how your name represents your being and can be used for calling
• Here, “a” is the variable, and 1 is its value
Assigning a variable
• Bash: a=1 (do not leave space)
• Python: a = 1 (can omit the space, just not tidy or conventional)
• R: a <- 1 (can also use = like Python, just not R-style)
Calling a variable
• Bash: $a
• Python / R: a
38

String
• A series of characters / digits / symbols / whitespace
• e.g. ‘abc’, ‘ATGCACGAG’, ‘12345’, ‘Hello, World’, ‘dawlekjr;alwejr’, ‘ ‘
• Can concatenate, search for part of the string (“substring”) by pattern, etc.
• Usually denoted by being put into single or double quotes
• Some programming languages have a separate class of data type for single
characters
• Bash: no data type, but you can specify by adding quotes
• Python: str; ‘123’ or str(123) creates a string of ‘123’
• R: character; ‘123’ or as.character(123) creates a string of ‘123’
39

Numeric, i.e. numbers
• Can perform arithmetic operations
• Integers
• Floats / decimal
• Bash: no data types
• Python: int, float
• R: integer, double
• Some programming languages have special data types for larger ranges
of integer, but it is out of scope here
40

Boolean
• True or False
• Bash: Not exist;
does have boolean expressions
(comparison & conditions)
• Python: bool; True, False
• R: logical; TRUE, FALSE
41
https://www.fallacies.ca/ttable.htm

List or Array
• A collection of elements
• Some collection data types allow collection of elements of different
data types, e.g. [‘a’, 1234, True], some only allow a uniform data type
• For simplicity, we only consider lists/arrays of one data type, such as
what we will use for data visualization later
• Bash array: declare -a my_array; my_array=(1 2 3)
• Python list: my_list = [1, 2, 3]
• R: c(1, 2, 3)
42

Operators
• Arithmetic: add +, minus -, multiply *, divide /
• Logical: AND &, OR |
• Equality: greater than >, smaller than <, equal ==
43

If-then(-else)
• “If I am hungry after class, (then) I go to coffee corner for dinner”
• “else I go home directly”
44

For-loop
• “For every day in this week, I eat an apple”
• for day in week:
• eat(apple)
• Enumerate (count) element in the for-condition, perform action(s) for
each element
• Bash: for i in `seq 1 20`; do echo $i; done
45

While-loop
• “While I live, I breathe”
• Looping non-stop until the predicate condition becomes false
• Beware of infinite loop: Add condition check within the loop
46

Processing tabular data
47

The Python language
• The most used programming language
in the world
• Versatile usage

The R language
• Designed for statistical analysis
• Bioconductor
• A market of biological analysis-related packages
• Mostly published: proven
• Unified downloading method
No need to reinvent the wheel
Ready-made packages:

https://www.bioconductor.org/install/
DESeq2 is in
Bioconductor
http://bioconductor.org/packag
es/release/bioc/vignettes/DESe
q2/inst/doc/DESeq2.html

Packages for data handling on R
• readr for data import
• dplyr for data processing
• ggplot2 for data visualization
51
Tidyverse:
R packages designed for data
science built with unified
grammar and data structures

Packages for data handling on Python
Data processing & calculation
• pandas
• numpy
• Scikit-learn
• StatsModels
Data visualization
• Matplotlib
• Seaborn/Bokeh(/Plotly)
52
• Plotly not available in jupyter/datascience-notebook Docker image
• Not necessary until more advanced usage such as interactive dashboard construction
• Name mentioned for popularity

Why show me both R & Python?
• Learning the language(s) allows customized data analysis & visualization
workflows
• To be future-proof, (and as a Pythonista), I strongly encourage you to
learn Python
• To be competent in bioinformatics, you should have good command of R
to use Bioconductor packages when moving outside Galaxy
• So…
• We will only use R tidyverse for demo today
• Basic syntax of Python equivalent is also provided along side for home
practice

Docker
• Application container
• Cross-platform portability
• Reduces installation hassles
• Avoid dependencies issues (missing or version clash)
55
https://www.docker.com/get-started > Download Docker Desktop
- “an open platform for developing, shipping, and running applications”

Jupyter notebook: interactive coding environment
• Former IPython: I for interactive
• Extended to support kernels of various programming languages
• Python, R, Julia, C++, Ruby, etc.
• Markdown for easy note jotting
• Portable & Runnable
• Commonly used for development and workflow sharing
• Installation: type in Terminal / PowerShell the following
56
docker pull jupyter/datascience-notebook

Finding your terminal / PowerShell
• MacOS / Linux users:
Find the app “Terminal” from LaunchPad
• Windows users:
Find “PowerShell” from start menu
57

Download data for input
58
Dataset
Download
1. On Galaxy, go to
“Shared Data” > “History” >“2021-08-15_DE_analysis”
2. Download the DESeq2 result file (History 35)
• Save as “20210819-DE-result-demo.txt”

Running data science Jupyter notebook on docker
• On your internet browser, go to http://localhost:8888
• When prompted for token,
copy & paste the string behind
“?token=” shown on the
Terminal / Command Prompt
• Different each time
59
docker run –p 8888:8888 jupyter/datascience-notebook

Using installed package/library
Python
• import some_package
• Import some_package as abbr
R
• library(some_package)
62

Import the tidyverse library (family)
63
Not errors: the more user friendly dplyr::filter() replaces the stats::filter() to be called by
filter(). You may still call stats::filter() by writing it in full as stats::filter(). Similar for lag()

”Dataframe” – the data type for your table
• 2 axes: columns (vertical; variable) & rows (horizontal; observation)
• col_names / header: name/ID of columns
• row_names / index: name/ID of rows
64
https://subscription.packtpub.com/book/data/9781784393878
/1/ch01lvl1sec03/dissecting-the-anatomy-of-a-dataframe

Importing data
pandas @Python
• import pandas as pd
• df = pd.read_csv(‘xxx.csv’,
header=None, comment=‘#’)
• df = pd.read_csv(‘xxx.tsv’, sep=‘
t’, header=None, comment=‘#’)
dplyr @R
• library(tidyverse)
• df <- read_csv(‘xxx.csv’,
comment=‘#’)
• df <- read_tsv(‘xxx.tsv’,
comment=‘#’)
65

Read our data file
66
Important to specify the absence of header
row, else you will miss a row of data entry!

Have a glimpse of the dataframe
pandas @Python
• df
• df.head()
• df.tail()
dplyr @R
• df
• head(df)
• tail(df)
67
head: first few rows
tail: last few rows

Renaming columns
pandas @Python
• df.columns = [‘iamcol1’,
‘iamcol2’, …]
R
• Base R way:
df <- colnames(df) = c(‘iamcol1’,
‘iamcol2’, …)
• dplyr way:
df <- rename(df, old_name =
new_name)
69
The more advanced `rename_with` function
that allows more complicated custom functions
and the use of regex is left for exploration
rename all
rename specific column(s)

70
!! Poor “machine readability” => might cause error
Better naming

Filtering rows
pandas @Python
• df[df[data_mask]]
• E.g.
de_result[de_result[pval < 0.05]
dplyr @R
• filter(df, data_mask)
• E.g.
filter(de_result, pval < 0.05)
71

Selecting columns
pandas @Python
• df[‘column_name’]
dplyr @R
• df$column_name
• Returns a collection
• df[‘column_name’]
• Returns a tibble dataframe
• df %>% select(A, B, E)
• Returns a tibble dataframe
73

Chaining
pandas @Python
• by dots
• df[data_mask].func().func()
dplyr @R
• by %>%
• starwars %>% group_by(gender)
%>% filter(mass > mean(mass,
na.rm = TRUE))
74
The more advanced `rename_with` function
that allows more complicated custom functions
and the use of regex is left for exploration
Example from: https://dplyr.tidyverse.org/reference/filter.html

New package installation (Extended)
Python
• On Terminal / Command Prompt
R
• Within R
77
python3 –m pip install –user some_package install.packages(”some_package")
Most necessary packages for basic data processing
& visualization are installed on jupyter/datascience-
notebook Docker image already

Why visualize data?
• Numbers are unintuitive
• Exploratory data analysis: Understand the data properties, QC
• Result presentation: Let readers get the basic ideas at one glimpse
79
Can you make sense of
these numbers?

How to choose a type of visualization?
• Data type
• Categorical
• Quantitative / Numerical
• Rank / Ordinal
• Relation
• Distribution: e.g. histogram, box plot, violin plot, density plot
• Correlation: e.g. scatter plot, heatmap
• Ranking: e.g. bar plot
• Timeseries: line chart
80

Data visualization packages on R
• ggplot2: versatile plotting library for various plot types
• gplots
• Specific libraries, e.g. pheatmap, for specific plot types
81

Data visualization packages on Python
• Matplotlib: versatile, can create any visualization & tune any detail
• Seaborn: more advanced plot types; visually appealing preset styles
• Bokeh / Plotly / HoloViews: visually appealing interactive plots
82

Data visualization in
bioinformatics
83

Heatmap
• Showing values with colors
DESeq2 results
Source: Differential gene expression induced by Verteporfin in endometrial cancer cells
(Bang, et al., 2019)
• often drawn using log fold change values
84
Conditional formatting for
a glimpse

85
Packages for heatmap drawing
• heatmap by ggplot2 geom_tile:
• Native, more options for data normalization & clustering
• More steps
• https://www.r-graph-gallery.com/heatmap
• pheatmap:
• https://davetang.org/muse/2018/05/15/making-a-heatmap-in-r-with-the-phe
atmap-package/
• gplots heatmap.2:
• https://www.rdocumentation.org/packages/gplots/versions/3.1.1/topics/
heatmap.2

Heatmap
with
Pairwise
clustering
Dendrogram
Heatmap:
- Showing valuers with colors
Pairwise clustering:
- Group items of high similarity
(i.e. short distances) together
e.g. are technical replicates
clustered together?

Volcano plot
• A type of scatter plot
• Visualizes differential expression results
• Shows up / down expression regulation
• y-axis: statistical significance (-logP)
• x-axis: log fold change
• Can input results from e.g. DESeq2
87
https://galaxyproject.github.io/training-m
aterial/topics/transcriptomics/tutorials/rn
a-seq-viz-with-volcanoplot/tutorial.html

MA plot
Yin, T., Cook, D. & Lawrence, M. ggbio: an R package for extending the grammar of
graphics for genomic data. Genome Biol 13, R77 (2012). https://doi.org/10.1186/gb-
2012-13-8-r77
• DE analysis results
• Up & down regulation
• y: log fold change
• x: normalized mean
• Color:
Statistical significance
• Available in DESeq2
88

Genome-wide Variation: Manhattan plot
• A type of scatter plot
• Shows genome wide location
of loci features, e.g. variants
89
https://www.researchgate.net/publication/272083683_Genome-wide_association_study_of_cl
inically_defined_gout_identifies_multiple_risk_loci_and_its_association_with_clinical_subtype
s/figures?lo=1

Network
• Connections show
interaction, e.g. between
genes
• Can be drawn by e.g.
Cytoscape
90
https://www.researchgate.net/pu
blication/321256870_Differential_
gene_expression_in_heterophils_i
solated_from_commercial_hybrid
_and_Thai_indigenous_broiler_chi
ckens_under_quercetin_suppleme
ntation/figures?lo=1
Node
Edge

1. On Galaxy, go to
“Shared Data” > “History” >“2021-08-15_DE_analysis”
2. Download the featureCounts result files (Histories 27, 29, 31, 33)
• Save as
“20210819-SRR1210078_WT_rep1-featureCounts.txt”
“20210819-SRR1210079_WT_rep2-featureCounts.txt”
“20210819-SRR1210084_C24_rep1-featureCounts.txt”
“20210819-SRR1210085_C24_rep2-featureCounts.txt”
3. Upload to the Jupyter notebook
91
Download data for input

ggplot2 basics (hands-on)
Basics
*Plot element layers & settings can be
added by chaining function() with +
Facet’s Panels
Labels
Breaks
*Layers of
Genometic Objects
*Legend
Plot Anatomy

Further ggplot2 examples (Extended)
From
• Histogram
https://www.r-graph-gallery.com/220-basic-ggplot2-histogram.html
• Scatterplot
• https://www.r-graph-gallery.com/274-map-a-variable-to-ggplot2-scatt
erplot.html
• Boxplot
https://www.r-graph-gallery.com/boxplot.html

Python visualization basics (Extended)
• Matplotlib gallery
• https://matplotlib.org/stable/gallery/index.html
• Seaborn gallery
• https://seaborn.pydata.org/examples/index.html
• Python graph gallery
• https://www.python-graph-gallery.com/

Data visualization
general principles
95

Does it worth a figure
• Key result?
• Also depends on journal and
current trend
Vale, 2015 ; Accelerating scientific publication in biology
96

Data visualization is where art meets science
97
BIG DATA V.01
(https://in.pinterest.com/pin/817614507335199087/)

Adjusting a figure
The most basic plot output before
aesthetic tuning is NOT DONE DEAL

Gestalt
principles of
visual
perception
Matplotlib 2.x by Example, 2017
99

Plot as visual aid: Perception
• Jakob’s Law in UX web design: “People expect your website to work
the same way as the other websites they're using”
• sO thIs iS od
d
• Highlight what is important (you ARE distracted)
• Same color for the same variable
What Are Data Visualization Style Guidelines? by Amy Cesal
https://medium.com/nightingale/style-guidelines-92ebe166addc
100

Reader-friendliness
• Aim: Be intuitive than fancy
• Proper contrast (not like this)
• Be color weakness-friendly (this may be not)
• Proper contrast in hues
• Paletter generator:
https://color.adobe.com/create/color-accessibility
Proof tool on Adobe Illustrator
More resources:
• https://www.color-blindness.com/coblis-color-blindness-simulator/
• https://helpx.adobe.com/creative-cloud/adobe-color-accessibility-tool
s.html
• https://creativepro.com/viewing-color-blind-previews-of-pages/

102
Proper color contrast for categorical data
https://thenode.biologists.com/data-visualization-with-flying-c
olors/research/
http://bconnelly.net/posts/creating_colorblind-friendly_figures/

Smooth colormap for quantitative data
• The ”scalebar” for heatmap
• Visual “uniformity” preferred
Visually “amplified” expected value changes
Smooth visual gradience

Post-processing
• Export plot output from code in vector graphics, e.g. SVG, PDF
• Editability: you can open with Illustrator, Inkscape, etc
• Full resolution: export to at least 300 dpi for rasterized graphics, e.g. PNG
• Only make necessary & scientifically allowed (i.e. not misleading)
changes
• E.g. Edit font size, add * to indicate statistical significance where necessary
• Always aim to be clean & intuitive
“Make up till you cannot be recognized is disguise” – Dayo Wong
104

Choose your weapon
• Excel: know formulas & functions well for simple handling
• Bash: quick string / file view & manipulation; scripting, e.g. when
looping through hundreds of files or tonnes of lines
• Python / R: More advanced data analytics & visualization
• At least be good enough in one that you can do everything with it
• Use what you are confident with

Google is your friend!!
• StackOverflow (various StackExchange sites)
• Quora/Reddit/CSDN/Qiita (whatever)
• 🔍 {tool I use} {my purpose}
• 🔍 {the error message}
• 🔍 {tool idk how to use} manual
• 🔍 {tool with hard-to-read official manual} tutorial
• 🔍 {tool} cheatsheet

Books on Data &
Visualization
Now Matplotlib 3.4.2 108

More free online resources
• RegExr: Learn, Build, & Test RegEx
• https://regexr.com/
• regex101: build, test, and debug regex
• https://regex101.com/
• R cheatsheet
• https://www.rstudio.com/resources/cheatsheets/
• R Tidyverse tutorial by DataCamp
• https://www.datacamp.com/community/tutorials/tidyverse-tutorial-r
• Data Visualisation: the Good, the Bad and the Ugly (1) by Mina Pêcheux
• https://minapecheux.com/website/2018/07/17/data-visualization-the-good-the-bad
-and-the-ugly-1/
109

110
More free online resources (Extended)
• BASH programming
• https://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html
• Advanced Bash-Scripting Guide by Mendel Cooper
• https://tldp.org/LDP/abs/html/
• Bash scripting cheatsheet
• https://devhints.io/bash
• Learnpython.org
• https://www.learnpython.org/

Data processing and visualization basics

More Related Content

Similar to Data processing and visualization basics (20)

Recently uploaded (20)

Data processing and visualization basics

Editor's Notes