SlideShare a Scribd company logo
Data Exploration and Visualisation with R ∗
Yanchang Zhao
http://www.RDataMining.com
R and Data Mining Course
Beijing University of Posts and Telecommunications,
Beijing, China
July 2019
∗
Chapter 3: Data Exploration, in R and Data Mining: Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
2 / 45
Data Exploration and Visualisation with R
Data Exploration and Visualisation
Summary and stats
Various charts like pie charts and histograms
Exploration of multiple variables
Level plot, contour plot and 3D plot
Saving charts into files
3 / 45
Quiz: What’s the Name of This Flower?
Oleg Yunakov [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia
Commons.
4 / 45
The Iris Dataset
The iris dataset [Frank and Asuncion, 2010] consists of 50
samples from each of three classes of iris flowers. There are five
attributes in the dataset:
sepal length in cm,
sepal width in cm,
petal length in cm,
petal width in cm, and
class: Iris Setosa, Iris Versicolour, and Iris Virginica.
Detailed desription of the dataset can be found at the UCI
Machine Learning Repository †.
†
https://archive.ics.uci.edu/ml/datasets/Iris
5 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
6 / 45
Size and Variables Names of Data
# number of rows
nrow(iris)
## [1] 150
# number of columns
ncol(iris)
## [1] 5
# dimensionality
dim(iris)
## [1] 150 5
# column names
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"
7 / 45
Structure of Data
Below we have a look at the structure of the dataset with str().
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0...
## $ Species : Factor w/ 3 levels "setosa","versicolor",....
150 observations (records, or rows) and 5 variables (or
columns)
The first four variables are numeric.
The last one, Species, is categoric (called “factor” in R) and
has three levels of values.
8 / 45
Attributes of Data
attributes(iris)
## $names
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid...
## [5] "Species"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 ...
## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 ...
## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 ...
## [46] 46 47 48 49 50 51 52 53 54 55 56 57 58 ...
## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 ...
## [76] 76 77 78 79 80 81 82 83 84 85 86 87 88 ...
## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 1...
## [106] 106 107 108 109 110 111 112 113 114 115 116 117 118 1...
## [121] 121 122 123 124 125 126 127 128 129 130 131 132 133 1...
## [136] 136 137 138 139 140 141 142 143 144 145 146 147 148 1...
9 / 45
First/Last Rows of Data
iris[1:3, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
head(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
tail(iris, 3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Spe...
## 148 6.5 3.0 5.2 2.0 virgi...
## 149 6.2 3.4 5.4 2.3 virgi...
## 150 5.9 3.0 5.1 1.8 virgi...
10 / 45
A Single Column
The first 10 values of Sepal.Length
iris[1:10, "Sepal.Length"]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris$Sepal.Length[1:10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
11 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
12 / 45
Summary of Data
Function summary()
numeric variables: minimum, maximum, mean, median, and
the first (25%) and third (75%) quartiles
categorical variables (i.e., factors): frequency of every level
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Wid...
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0....
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0....
## Median :5.800 Median :3.000 Median :4.350 Median :1....
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1....
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1....
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2....
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
13 / 45
library(Hmisc)
# describe(iris) # check all columns
describe(iris[, c(1, 5)]) # check columns 1 and 5
## iris[, c(1, 5)]
##
## 2 Variables 150 Observations
## -----------------------------------------------------------...
## Sepal.Length
## n missing distinct Info Mean Gmd ...
## 150 0 35 0.998 5.843 0.9462 4....
## .10 .25 .50 .75 .90 .95
## 4.800 5.100 5.800 6.400 6.900 7.255
##
## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9
## -----------------------------------------------------------...
## Species
## n missing distinct
## 150 0 3
##
## Value setosa versicolor virginica
## Frequency 50 50 50
## Proportion 0.333 0.333 0.333
## -----------------------------------------------------------...
14 / 45
Mean, Median, Range and Quartiles
Mean, median and range: mean(), median(), range()
Quartiles and percentiles: quantile()
range(iris$Sepal.Length)
## [1] 4.3 7.9
quantile(iris$Sepal.Length)
## 0% 25% 50% 75% 100%
## 4.3 5.1 5.8 6.4 7.9
quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
## 10% 30% 65%
## 4.80 5.27 6.20
15 / 45
Variance and Histogram
var(iris$Sepal.Length)
## [1] 0.6856935
hist(iris$Sepal.Length)
Histogram of iris$Sepal.Length
iris$Sepal.Length
Frequency
4 5 6 7 8
051015202530
16 / 45
Density
library(magrittr) ## for pipe operations
iris$Sepal.Length %>% density() %>%
plot(main='Density of Sepal.Length')
4 5 6 7 8
0.00.10.20.30.4
Density of Sepal.Length
N = 150 Bandwidth = 0.2736
Density
17 / 45
Pie Chart
Frequency of factors: table()
library(dplyr)
iris2 <- iris %>% sample_n(50)
iris2$Species %>% table() %>% pie()
# add percentages
tab <- iris2$Species %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), 'n', precentages, '%')
pie(tab, labels=txt)
setosa
versicolor
virginica
setosa
38%
versicolor
36%
virginica
26%
18 / 45
Bar Chart
iris2$Species %>% table() %>% barplot()
# add colors and percentages
bb <- iris2$Species %>% table() %>%
barplot(axisnames=F, main='Species', ylab='Frequency',
col=c('pink', 'lightblue', 'lightgreen'))
text(bb, tab/2, labels=txt, cex=1.5)
setosa versicolor virginica
051015
Species
Frequency
051015
setosa
38%
versicolor
36%
virginica
26%
19 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
20 / 45
Correlation
Covariance and correlation: cov() and cor()
cov(iris$Sepal.Length, iris$Petal.Length)
## [1] 1.274315
cor(iris$Sepal.Length, iris$Petal.Length)
## [1] 0.8717538
cov(iris[, 1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
# cor(iris[,1:4])
21 / 45
Aggreation
Stats of Sepal.Length for every Species with aggregate()
aggregate(Sepal.Length ~ Species, summary, data = iris)
## Species Sepal.Length.Min. Sepal.Length.1st Qu.
## 1 setosa 4.300 4.800
## 2 versicolor 4.900 5.600
## 3 virginica 4.900 6.225
## Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu.
## 1 5.000 5.006 5.200
## 2 5.900 5.936 6.300
## 3 6.500 6.588 6.900
## Sepal.Length.Max.
## 1 5.800
## 2 7.000
## 3 7.900
22 / 45
Boxplot
The bar in the middle is median.
The box shows the interquartile range (IQR), i.e., range
between the 75% and 25% observation.
boxplot(Sepal.Length ~ Species, data = iris)
setosa versicolor virginica
4.55.05.56.06.57.07.58.0
23 / 45
Scatter Plot
with(iris, plot(Sepal.Length, Sepal.Width, col = Species,
pch = as.numeric(Species)))
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.02.53.03.54.0
Sepal.Length
Sepal.Width
24 / 45
Scatter Plot with Jitter
Function jitter(): add a small amount of noise to the data
with(iris, plot(jitter(Sepal.Length), jitter(Sepal.Width),
col=Species,pch=as.numeric(Species)))
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
2.02.53.03.54.0
jitter(Sepal.Length)
jitter(Sepal.Width)
25 / 45
A Matrix of Scatter Plots
pairs(iris)
Sepal.Length2.03.04.00.51.52.5
4.5 5.5 6.5 7.5
2.0 3.0 4.0
Sepal.Width
Petal.Length
1 2 3 4 5 6 7
0.5 1.5 2.5
Petal.Width
4.55.56.57.51234567
1.0 2.0 3.0
1.02.03.0
Species
26 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
27 / 45
3D Scatter plot
library(scatterplot3d)
scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
0.0 0.5 1.0 1.5 2.0 2.5
2.02.53.03.54.04.5
4
5
6
7
8
iris$Petal.Width
iris$Sepal.Length
iris$Sepal.Width
28 / 45
Interactive 3D Scatter Plot
Package rgl supports interactive 3D scatter plot with plot3d().
library(rgl)
plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
29 / 45
Heat Map
Calculate the similarity between different flowers in the iris data
with dist() and then plot it with a heat map
dist.matrix <- as.matrix(dist(iris[, 1:4]))
heatmap(dist.matrix)
422314943394132463674831634154561921322425274417333749112247202631303510385411250284082918111910612313211813110811013613010312610114412114561999458658081826383936860709054107855667627291899796100955276665755598869987586797492641091371051251411461421401131041381171161491291331151351121111487853518777841501471241341271281397173120122114102143
422314943394132463674831634154561921322425274417333749112247202631303510385411250284082918111910612313211813110811013613010312610114412114561999458658081826383936860709054107855667627291899796100955276665755598869987586797492641091371051251411461421401131041381171161491291331151351121111487853518777841501471241341271281397173120122114102143
30 / 45
Level Plot
Function rainbow() creates a vector of contiguous colors.
rev() reverses a vector.
library(lattice)
levelplot(Petal.Width ~ Sepal.Length * Sepal.Width,
data=iris, cuts=8)
Sepal.Length
Sepal.Width
2.0
2.5
3.0
3.5
4.0
5 6 7
0.0
0.5
1.0
1.5
2.0
2.5
31 / 45
Contour
contour() and filled.contour() in package graphics
contourplot() in package lattice
filled.contour(volcano, color=terrain.colors, asp=1,
plot.axes=contour(volcano, add=T))
100
120
140
160
180
100
100
100
110
110
110
110
120
130
140
150
160
160
170
170
180
180
190
32 / 45
3D Surface
persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue")
volcano
Y
Z
33 / 45
Parallel Coordinates
Visualising multiple dimensions
library(MASS)
parcoord(iris[1:4], col = iris$Species)
Sepal.Length Sepal.Width Petal.Length Petal.Width
34 / 45
Parallel Coordinates with Package lattice
library(lattice)
parallelplot(~iris[1:4] | Species, data = iris)
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Min Max
setosa versicolor
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
virginica
35 / 45
Visualisation with Package ggplot2
library(ggplot2)
qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .)
setosaversicolorvirginica
5 6 7 8
2.0
2.5
3.0
3.5
4.0
4.5
2.0
2.5
3.0
3.5
4.0
4.5
2.0
2.5
3.0
3.5
4.0
4.5
Sepal.Length
Sepal.Width
36 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
37 / 45
Save Charts to Files
Save charts to PDF and PS files: pdf() and postscript()
BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and
tiff()
Close files (or graphics devices) with graphics.off() or
dev.off() after plotting
# save as a PDF file
pdf("myPlot.pdf")
x <- 1:50
plot(x, log(x))
graphics.off()
# Save as a postscript file
postscript("myPlot2.ps")
x <- -20:20
plot(x, x^2)
graphics.off()
38 / 45
Save ggplot Charts to Files
ggsave(): by defult, saving the last plot that you displayed. It
also guesses the type of graphics device from the extension.
ggsave("myPlot3.png")
ggsave("myPlot4.pdf")
ggsave("myPlot5.jpg")
ggsave("myPlot6.bmp")
ggsave("myPlot7.ps")
ggsave("myPlot8.eps")
39 / 45
Contents
Introduction
Have a Look at Data
Explore Individual Variables
Explore Multiple Variables
More Explorations
Save Charts to Files
Further Readings and Online Resources
40 / 45
Further Readings
Examples of ggplot2 plotting:
https://ggplot2.tidyverse.org/
Package iplots: interactive scatter plot, histogram, bar plot, and parallel
coordinates plot (iplots)
http://rosuda.org/software/iPlots/
Package googleVis: interactive charts with the Google Visualisation API
http://cran.r-project.org/web/packages/googleVis/vignettes/
googleVis_examples.html
Package ggvis: interactive grammar of graphics
http://ggvis.rstudio.com/
Package rCharts: interactive javascript visualisations from R
https://ramnathv.github.io/rCharts/
41 / 45
Online Resources
Book titled R and Data Mining: Examples and Case Studies
http://www.rdatamining.com/docs/RDataMining-book.pdf
R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
Free online courses and documents
http://www.rdatamining.com/resources/
RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
Twitter (3,300+ followers)
@RDataMining
42 / 45
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
43 / 45
How to Cite This Work
Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
44 / 45
References I
Frank, A. and Asuncion, A. (2010).
UCI machine learning repository. university of california, irvine, school of information and computer sciences.
http://archive.ics.uci.edu/ml.
45 / 45

More Related Content

PDF
RDataMining slides-clustering-with-r
PDF
Data Exploration and Visualization with R
PDF
Data Clustering with R
PDF
Clustering and Visualisation using R programming
PDF
Regression and Classification with R
PDF
An Introduction to Data Mining with R
PDF
RDataMining slides-time-series-analysis
PDF
Time Series Analysis and Mining with R
RDataMining slides-clustering-with-r
Data Exploration and Visualization with R
Data Clustering with R
Clustering and Visualisation using R programming
Regression and Classification with R
An Introduction to Data Mining with R
RDataMining slides-time-series-analysis
Time Series Analysis and Mining with R

What's hot (20)

PDF
R Workshop for Beginners
PDF
RDataMining slides-regression-classification
PDF
R learning by examples
PPTX
Datamining with R
PDF
Data manipulation on r
PDF
R programming intro with examples
PDF
Data handling in r
PPTX
R programming language
PDF
Rsplit apply combine
PDF
Dplyr and Plyr
PDF
Table of Useful R commands.
PDF
R code for data manipulation
PPTX
R Language Introduction
PDF
Data manipulation with dplyr
PPT
Jarrar: Games
PDF
Data Manipulation Using R (& dplyr)
PDF
Grouping & Summarizing Data in R
PPTX
Sqlserver 2008 r2
PDF
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
PPTX
An Interactive Introduction To R (Programming Language For Statistics)
R Workshop for Beginners
RDataMining slides-regression-classification
R learning by examples
Datamining with R
Data manipulation on r
R programming intro with examples
Data handling in r
R programming language
Rsplit apply combine
Dplyr and Plyr
Table of Useful R commands.
R code for data manipulation
R Language Introduction
Data manipulation with dplyr
Jarrar: Games
Data Manipulation Using R (& dplyr)
Grouping & Summarizing Data in R
Sqlserver 2008 r2
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2
An Interactive Introduction To R (Programming Language For Statistics)
Ad

Similar to RDataMining slides-data-exploration-visualisation (20)

PPTX
R part iii
DOCX
Summerization notes for descriptive statistics using r
PPTX
Descriptive Statistics in R.pptx
PDF
[1062BPY12001] Data analysis with R / April 19
PDF
Graphics in R
PDF
Data Visualization using base graphics
PDF
01_introduction_lab.pdf
PDF
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
PDF
Irisdataanalysiswithr 140801203600-phpapp02
PDF
Iris data analysis example in R
PPTX
Create a Powerpoint using R software and ReporteRs package
PPTX
Create a PowerPoint document from template using R software and ReporteRs pac...
PPTX
Introduction to Data Visualization for Agriculture and Allied Sciences using ...
PPTX
Iris - Most loved dataset
PDF
Case Study: Prediction on Iris Dataset Using KNN Algorithm
PPTX
Exploratory Data Analysis
PPTX
visualisasi data praktik pakai excel, py
PPTX
r studio presentation.pptx
PPTX
r studio presentation.pptx
PPTX
Data Exploration in R.pptx
R part iii
Summerization notes for descriptive statistics using r
Descriptive Statistics in R.pptx
[1062BPY12001] Data analysis with R / April 19
Graphics in R
Data Visualization using base graphics
01_introduction_lab.pdf
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Irisdataanalysiswithr 140801203600-phpapp02
Iris data analysis example in R
Create a Powerpoint using R software and ReporteRs package
Create a PowerPoint document from template using R software and ReporteRs pac...
Introduction to Data Visualization for Agriculture and Allied Sciences using ...
Iris - Most loved dataset
Case Study: Prediction on Iris Dataset Using KNN Algorithm
Exploratory Data Analysis
visualisasi data praktik pakai excel, py
r studio presentation.pptx
r studio presentation.pptx
Data Exploration in R.pptx
Ad

More from Yanchang Zhao (10)

PDF
RDataMining slides-text-mining-with-r
PDF
RDataMining slides-r-programming
PDF
RDataMining slides-network-analysis-with-r
PDF
RDataMining slides-association-rule-mining-with-r
PDF
RDataMining-reference-card
PDF
Text Mining with R -- an Analysis of Twitter Data
PDF
Association Rule Mining with R
PDF
Introduction to Data Mining with R and Data Import/Export in R
PDF
Time series-mining-slides
PDF
R Reference Card for Data Mining
RDataMining slides-text-mining-with-r
RDataMining slides-r-programming
RDataMining slides-network-analysis-with-r
RDataMining slides-association-rule-mining-with-r
RDataMining-reference-card
Text Mining with R -- an Analysis of Twitter Data
Association Rule Mining with R
Introduction to Data Mining with R and Data Import/Export in R
Time series-mining-slides
R Reference Card for Data Mining

Recently uploaded (20)

PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Spectroscopy.pptx food analysis technology
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25-Week II
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...

RDataMining slides-data-exploration-visualisation

  • 1. Data Exploration and Visualisation with R ∗ Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 ∗ Chapter 3: Data Exploration, in R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 1 / 45
  • 2. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 2 / 45
  • 3. Data Exploration and Visualisation with R Data Exploration and Visualisation Summary and stats Various charts like pie charts and histograms Exploration of multiple variables Level plot, contour plot and 3D plot Saving charts into files 3 / 45
  • 4. Quiz: What’s the Name of This Flower? Oleg Yunakov [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons. 4 / 45
  • 5. The Iris Dataset The iris dataset [Frank and Asuncion, 2010] consists of 50 samples from each of three classes of iris flowers. There are five attributes in the dataset: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm, and class: Iris Setosa, Iris Versicolour, and Iris Virginica. Detailed desription of the dataset can be found at the UCI Machine Learning Repository †. † https://archive.ics.uci.edu/ml/datasets/Iris 5 / 45
  • 6. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 6 / 45
  • 7. Size and Variables Names of Data # number of rows nrow(iris) ## [1] 150 # number of columns ncol(iris) ## [1] 5 # dimensionality dim(iris) ## [1] 150 5 # column names names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid... ## [5] "Species" 7 / 45
  • 8. Structure of Data Below we have a look at the structure of the dataset with str(). str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0... ## $ Species : Factor w/ 3 levels "setosa","versicolor",.... 150 observations (records, or rows) and 5 variables (or columns) The first four variables are numeric. The last one, Species, is categoric (called “factor” in R) and has three levels of values. 8 / 45
  • 9. Attributes of Data attributes(iris) ## $names ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Wid... ## [5] "Species" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ## [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 ... ## [31] 31 32 33 34 35 36 37 38 39 40 41 42 43 ... ## [46] 46 47 48 49 50 51 52 53 54 55 56 57 58 ... ## [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 ... ## [76] 76 77 78 79 80 81 82 83 84 85 86 87 88 ... ## [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 1... ## [106] 106 107 108 109 110 111 112 113 114 115 116 117 118 1... ## [121] 121 122 123 124 125 126 127 128 129 130 131 132 133 1... ## [136] 136 137 138 139 140 141 142 143 144 145 146 147 148 1... 9 / 45
  • 10. First/Last Rows of Data iris[1:3, ] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa head(iris, 3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa tail(iris, 3) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Spe... ## 148 6.5 3.0 5.2 2.0 virgi... ## 149 6.2 3.4 5.4 2.3 virgi... ## 150 5.9 3.0 5.1 1.8 virgi... 10 / 45
  • 11. A Single Column The first 10 values of Sepal.Length iris[1:10, "Sepal.Length"] ## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 iris$Sepal.Length[1:10] ## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 11 / 45
  • 12. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 12 / 45
  • 13. Summary of Data Function summary() numeric variables: minimum, maximum, mean, median, and the first (25%) and third (75%) quartiles categorical variables (i.e., factors): frequency of every level summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Wid... ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.... ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.... ## Median :5.800 Median :3.000 Median :4.350 Median :1.... ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.... ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.... ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.... ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## 13 / 45
  • 14. library(Hmisc) # describe(iris) # check all columns describe(iris[, c(1, 5)]) # check columns 1 and 5 ## iris[, c(1, 5)] ## ## 2 Variables 150 Observations ## -----------------------------------------------------------... ## Sepal.Length ## n missing distinct Info Mean Gmd ... ## 150 0 35 0.998 5.843 0.9462 4.... ## .10 .25 .50 .75 .90 .95 ## 4.800 5.100 5.800 6.400 6.900 7.255 ## ## lowest : 4.3 4.4 4.5 4.6 4.7, highest: 7.3 7.4 7.6 7.7 7.9 ## -----------------------------------------------------------... ## Species ## n missing distinct ## 150 0 3 ## ## Value setosa versicolor virginica ## Frequency 50 50 50 ## Proportion 0.333 0.333 0.333 ## -----------------------------------------------------------... 14 / 45
  • 15. Mean, Median, Range and Quartiles Mean, median and range: mean(), median(), range() Quartiles and percentiles: quantile() range(iris$Sepal.Length) ## [1] 4.3 7.9 quantile(iris$Sepal.Length) ## 0% 25% 50% 75% 100% ## 4.3 5.1 5.8 6.4 7.9 quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65)) ## 10% 30% 65% ## 4.80 5.27 6.20 15 / 45
  • 16. Variance and Histogram var(iris$Sepal.Length) ## [1] 0.6856935 hist(iris$Sepal.Length) Histogram of iris$Sepal.Length iris$Sepal.Length Frequency 4 5 6 7 8 051015202530 16 / 45
  • 17. Density library(magrittr) ## for pipe operations iris$Sepal.Length %>% density() %>% plot(main='Density of Sepal.Length') 4 5 6 7 8 0.00.10.20.30.4 Density of Sepal.Length N = 150 Bandwidth = 0.2736 Density 17 / 45
  • 18. Pie Chart Frequency of factors: table() library(dplyr) iris2 <- iris %>% sample_n(50) iris2$Species %>% table() %>% pie() # add percentages tab <- iris2$Species %>% table() precentages <- tab %>% prop.table() %>% round(3) * 100 txt <- paste0(names(tab), 'n', precentages, '%') pie(tab, labels=txt) setosa versicolor virginica setosa 38% versicolor 36% virginica 26% 18 / 45
  • 19. Bar Chart iris2$Species %>% table() %>% barplot() # add colors and percentages bb <- iris2$Species %>% table() %>% barplot(axisnames=F, main='Species', ylab='Frequency', col=c('pink', 'lightblue', 'lightgreen')) text(bb, tab/2, labels=txt, cex=1.5) setosa versicolor virginica 051015 Species Frequency 051015 setosa 38% versicolor 36% virginica 26% 19 / 45
  • 20. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 20 / 45
  • 21. Correlation Covariance and correlation: cov() and cor() cov(iris$Sepal.Length, iris$Petal.Length) ## [1] 1.274315 cor(iris$Sepal.Length, iris$Petal.Length) ## [1] 0.8717538 cov(iris[, 1:4]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 ## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 ## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 ## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063 # cor(iris[,1:4]) 21 / 45
  • 22. Aggreation Stats of Sepal.Length for every Species with aggregate() aggregate(Sepal.Length ~ Species, summary, data = iris) ## Species Sepal.Length.Min. Sepal.Length.1st Qu. ## 1 setosa 4.300 4.800 ## 2 versicolor 4.900 5.600 ## 3 virginica 4.900 6.225 ## Sepal.Length.Median Sepal.Length.Mean Sepal.Length.3rd Qu. ## 1 5.000 5.006 5.200 ## 2 5.900 5.936 6.300 ## 3 6.500 6.588 6.900 ## Sepal.Length.Max. ## 1 5.800 ## 2 7.000 ## 3 7.900 22 / 45
  • 23. Boxplot The bar in the middle is median. The box shows the interquartile range (IQR), i.e., range between the 75% and 25% observation. boxplot(Sepal.Length ~ Species, data = iris) setosa versicolor virginica 4.55.05.56.06.57.07.58.0 23 / 45
  • 24. Scatter Plot with(iris, plot(Sepal.Length, Sepal.Width, col = Species, pch = as.numeric(Species))) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.02.53.03.54.0 Sepal.Length Sepal.Width 24 / 45
  • 25. Scatter Plot with Jitter Function jitter(): add a small amount of noise to the data with(iris, plot(jitter(Sepal.Length), jitter(Sepal.Width), col=Species,pch=as.numeric(Species))) 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.02.53.03.54.0 jitter(Sepal.Length) jitter(Sepal.Width) 25 / 45
  • 26. A Matrix of Scatter Plots pairs(iris) Sepal.Length2.03.04.00.51.52.5 4.5 5.5 6.5 7.5 2.0 3.0 4.0 Sepal.Width Petal.Length 1 2 3 4 5 6 7 0.5 1.5 2.5 Petal.Width 4.55.56.57.51234567 1.0 2.0 3.0 1.02.03.0 Species 26 / 45
  • 27. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 27 / 45
  • 28. 3D Scatter plot library(scatterplot3d) scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) 0.0 0.5 1.0 1.5 2.0 2.5 2.02.53.03.54.04.5 4 5 6 7 8 iris$Petal.Width iris$Sepal.Length iris$Sepal.Width 28 / 45
  • 29. Interactive 3D Scatter Plot Package rgl supports interactive 3D scatter plot with plot3d(). library(rgl) plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width) 29 / 45
  • 30. Heat Map Calculate the similarity between different flowers in the iris data with dist() and then plot it with a heat map dist.matrix <- as.matrix(dist(iris[, 1:4])) heatmap(dist.matrix) 422314943394132463674831634154561921322425274417333749112247202631303510385411250284082918111910612313211813110811013613010312610114412114561999458658081826383936860709054107855667627291899796100955276665755598869987586797492641091371051251411461421401131041381171161491291331151351121111487853518777841501471241341271281397173120122114102143 422314943394132463674831634154561921322425274417333749112247202631303510385411250284082918111910612313211813110811013613010312610114412114561999458658081826383936860709054107855667627291899796100955276665755598869987586797492641091371051251411461421401131041381171161491291331151351121111487853518777841501471241341271281397173120122114102143 30 / 45
  • 31. Level Plot Function rainbow() creates a vector of contiguous colors. rev() reverses a vector. library(lattice) levelplot(Petal.Width ~ Sepal.Length * Sepal.Width, data=iris, cuts=8) Sepal.Length Sepal.Width 2.0 2.5 3.0 3.5 4.0 5 6 7 0.0 0.5 1.0 1.5 2.0 2.5 31 / 45
  • 32. Contour contour() and filled.contour() in package graphics contourplot() in package lattice filled.contour(volcano, color=terrain.colors, asp=1, plot.axes=contour(volcano, add=T)) 100 120 140 160 180 100 100 100 110 110 110 110 120 130 140 150 160 160 170 170 180 180 190 32 / 45
  • 33. 3D Surface persp(volcano, theta = 25, phi = 30, expand = 0.5, col = "lightblue") volcano Y Z 33 / 45
  • 34. Parallel Coordinates Visualising multiple dimensions library(MASS) parcoord(iris[1:4], col = iris$Species) Sepal.Length Sepal.Width Petal.Length Petal.Width 34 / 45
  • 35. Parallel Coordinates with Package lattice library(lattice) parallelplot(~iris[1:4] | Species, data = iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min Max setosa versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width virginica 35 / 45
  • 36. Visualisation with Package ggplot2 library(ggplot2) qplot(Sepal.Length, Sepal.Width, data = iris, facets = Species ~ .) setosaversicolorvirginica 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 Sepal.Length Sepal.Width 36 / 45
  • 37. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 37 / 45
  • 38. Save Charts to Files Save charts to PDF and PS files: pdf() and postscript() BMP, JPEG, PNG and TIFF files: bmp(), jpeg(), png() and tiff() Close files (or graphics devices) with graphics.off() or dev.off() after plotting # save as a PDF file pdf("myPlot.pdf") x <- 1:50 plot(x, log(x)) graphics.off() # Save as a postscript file postscript("myPlot2.ps") x <- -20:20 plot(x, x^2) graphics.off() 38 / 45
  • 39. Save ggplot Charts to Files ggsave(): by defult, saving the last plot that you displayed. It also guesses the type of graphics device from the extension. ggsave("myPlot3.png") ggsave("myPlot4.pdf") ggsave("myPlot5.jpg") ggsave("myPlot6.bmp") ggsave("myPlot7.ps") ggsave("myPlot8.eps") 39 / 45
  • 40. Contents Introduction Have a Look at Data Explore Individual Variables Explore Multiple Variables More Explorations Save Charts to Files Further Readings and Online Resources 40 / 45
  • 41. Further Readings Examples of ggplot2 plotting: https://ggplot2.tidyverse.org/ Package iplots: interactive scatter plot, histogram, bar plot, and parallel coordinates plot (iplots) http://rosuda.org/software/iPlots/ Package googleVis: interactive charts with the Google Visualisation API http://cran.r-project.org/web/packages/googleVis/vignettes/ googleVis_examples.html Package ggvis: interactive grammar of graphics http://ggvis.rstudio.com/ Package rCharts: interactive javascript visualisations from R https://ramnathv.github.io/rCharts/ 41 / 45
  • 42. Online Resources Book titled R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 42 / 45
  • 44. How to Cite This Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 44 / 45
  • 45. References I Frank, A. and Asuncion, A. (2010). UCI machine learning repository. university of california, irvine, school of information and computer sciences. http://archive.ics.uci.edu/ml. 45 / 45