SlideShare a Scribd company logo
Basics of R Programming
Yanchang Zhao
http://www.RDataMining.com
R and Data Mining Course
Beijing University of Posts and Telecommunications,
Beijing, China
July 2019
1 / 44
Quiz
Have you used R before?
2 / 44
Quiz
Have you used R before?
Are you familiar with data mining and machine learning
techniques and algorithms?
2 / 44
Quiz
Have you used R before?
Are you familiar with data mining and machine learning
techniques and algorithms?
Have you used R for data mining and analytics in your
study/research/work?
2 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
3 / 44
What is R?
R ∗ is a free software environment for statistical computing
and graphics.
R can be easily extended with 14,000+ packages available on
CRAN† (as of July 2019).
Many other packages provided on Bioconductor‡, R-Forge§,
GitHub¶, etc.
R manuals on CRAN
An Introduction to R
The R Language Definition
R Data Import/Export
. . .
∗
http://www.r-project.org/
†
http://cran.r-project.org/
‡
http://www.bioconductor.org/
§
http://r-forge.r-project.org/
¶
https://github.com/
http://cran.r-project.org/manuals.html
4 / 44
Why R?
R is widely used in both academia and industry.
R is one of the most popular tools for data science and
analytics, ranked #1 from 2011 to 2016, but sadly overtaken
by Python since 2017, :-( ∗∗.
The CRAN Task Views †† provide collections of packages for
different tasks.
Machine learning & atatistical learning
Cluster analysis & finite mixture models
Time series analysis
Multivariate statistics
Analysis of spatial data
. . .
∗∗
The KDnuggets polls on Top Analytics, Data Science software
https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
††
http://cran.r-project.org/web/views/
5 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
6 / 44
RStudio‡‡
An integrated development environment (IDE) for R
Runs on various operating systems like Windows, Mac OS X
and Linux
Suggestion: always using an RStudio project, with subfolders
code: source code
data: raw data, cleaned data
figures: charts and graphs
docs: documents and reports
models: analytics models
‡‡
https://www.rstudio.com/products/rstudio/
7 / 44
RStudio
8 / 44
RStudio Keyboard Shortcuts
Run current line or selection: Ctrl + enter
Comment / uncomment selection: Ctrl + Shift + C
Clear console: Ctrl + L
Reindent selection: Ctrl + I
9 / 44
Writing Reports and Papers
Sweave + LaTex: for academic publications
beamer + LaTex: for presentations
knitr + R Markdown: generating reports and slides in HTML,
PDF and WORD formats
Notebooks: R notebook, Jupiter notebook
10 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
11 / 44
Pipe Operations
Load library magrittr for pipe operations
Avoid nested function calls
Make code easy to understand
Supported by dplyr and ggplot2
library(magrittr) ## for pipe operations
## traditional way
b <- fun3(fun2(fun1(a), b), d)
## the above can be rewritten to
b <- a %>% fun1() %>% fun2(b) %>% fun3(d)
12 / 44
Pipe Operations
Load library magrittr for pipe operations
Avoid nested function calls
Make code easy to understand
Supported by dplyr and ggplot2
library(magrittr) ## for pipe operations
## traditional way
b <- fun3(fun2(fun1(a), b), d)
## the above can be rewritten to
b <- a %>% fun1() %>% fun2(b) %>% fun3(d)
Quiz: Why not use ’c’ in above example?
12 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
13 / 44
Data Types and Structures
Data types
Integer
Numeric
Character
Factor
Logical
Date
Data structures
Vector
Matrix
Data frame
List
14 / 44
Vector
## integer vector
x <- 1:10
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
## numeric vector, generated randomly from a uniform distribution
y <- runif(5)
y
## [1] 0.95724678 0.02629283 0.49250477 0.07112317 0.93636358
## character vector
(z <- c("abc", "d", "ef", "g"))
## [1] "abc" "d" "ef" "g"
15 / 44
Matrix
## create a matrix with 4 rows, from a vector of 1:20
m <- matrix(1:20, nrow = 4, byrow = T)
m
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## matrix subtraction
m - diag(nrow = 4, ncol = 5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 2 3 4 5
## [2,] 6 6 8 9 10
## [3,] 11 12 12 14 15
## [4,] 16 17 18 18 20
16 / 44
Data Frame
library(magrittr)
age <- c(45, 22, 61, 14, 37)
gender <- c("Female", "Male", "Male", "Female", "Male")
height <- c(1.68, 1.85, 1.8, 1.66, 1.72)
married <- c(T, F, T, F, F)
df <- data.frame(age, gender, height, married) %>% print()
## age gender height married
## 1 45 Female 1.68 TRUE
## 2 22 Male 1.85 FALSE
## 3 61 Male 1.80 TRUE
## 4 14 Female 1.66 FALSE
## 5 37 Male 1.72 FALSE
str(df)
## 'data.frame': 5 obs. of 4 variables:
## $ age : num 45 22 61 14 37
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2
## $ height : num 1.68 1.85 1.8 1.66 1.72
## $ married: logi TRUE FALSE TRUE FALSE FALSE
17 / 44
Data Slicing
df$age
## [1] 45 22 61 14 37
df[, 1]
## [1] 45 22 61 14 37
df[1, ]
## age gender height married
## 1 45 Female 1.68 TRUE
df[1, 1]
## [1] 45
df$gender[1]
## [1] Female
## Levels: Female Male
18 / 44
Data Subsetting and Sorting
df %>% subset(gender == "Male")
## age gender height married
## 2 22 Male 1.85 FALSE
## 3 61 Male 1.80 TRUE
## 5 37 Male 1.72 FALSE
idx <- order(df$age) %>% print()
## [1] 4 2 5 1 3
df[idx, ]
## age gender height married
## 4 14 Female 1.66 FALSE
## 2 22 Male 1.85 FALSE
## 5 37 Male 1.72 FALSE
## 1 45 Female 1.68 TRUE
## 3 61 Male 1.80 TRUE
19 / 44
List
x <- 1:10
y <- c("abc", "d", "ef", "g")
ls <- list(x, y) %>% print()
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [1] "abc" "d" "ef" "g"
## retrieve an element in a list
ls[[2]]
## [1] "abc" "d" "ef" "g"
ls[[2]][1]
## [1] "abc"
20 / 44
Character
x <- c("apple", "orange", "pear", "banana")
## search for a pattern
grep(pattern = "an", x)
## [1] 2 4
## search for a pattern and return matched elements
grep(pattern = "an", x, value = T)
## [1] "orange" "banana"
## replace a pattern
gsub(pattern = "an", replacement = "**", x)
## [1] "apple" "or**ge" "pear" "b****a"
21 / 44
Date
library(lubridate)
x <- ymd("2019-07-08")
class(x)
## [1] "Date"
year(x)
## [1] 2019
# month(x)
day(x)
## [1] 8
weekdays(x)
## [1] "Monday"
Date parsing functions: ymd(), ydm(), mdy(), myd(), dmy(),
dym(), yq() in package lubridate
22 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
23 / 44
Conditional Control
if . . . else . . .
score <- 4
if (score >= 3) {
print("pass")
} else {
print("fail")
}
## [1] "pass"
ifelse()
score <- 1:5
ifelse(score >= 3, "pass", "fail")
## [1] "fail" "fail" "pass" "pass" "pass"
24 / 44
Loop Control
for, while, repeat
break, next
for (i in 1:5) {
print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
25 / 44
Apply Functions
apply(): apply a function to margins of an array or matrix
lapply(): apply a function to every item in a list or vector
and return a list
sapply(): similar to lapply, but return a vector or matrix
vapply(): similar to sapply, but as a pre-specified type of
return value
26 / 44
Loop vs lapply
## for loop
x <- 1:10
y <- rep(NA, 10)
for (i in 1:length(x)) {
y[i] <- log(x[i])
}
y
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79...
## [7] 1.9459101 2.0794415 2.1972246 2.3025851
## apply a function (log) to every element of x
tmp <- lapply(x, log)
y <- do.call("c", tmp) %>% print()
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79...
## [7] 1.9459101 2.0794415 2.1972246 2.3025851
## same as above
sapply(x, log)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79...
## [7] 1.9459101 2.0794415 2.1972246 2.3025851
27 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
28 / 44
Parallel Computing
## on Linux or Mac machines
library(parallel)
n.cores <- detectCores() - 1 %>% print()
tmp <- mclapply(x, log, mc.cores=n.cores)
y <- do.call("c", tmp)
## on Windows machines
library(parallel)
## set up cluster
cluster <- makeCluster(n.cores)
## run jobs in parallel
tmp <- parLapply(cluster, x, log)
## stop cluster
stopCluster(cluster)
# collect results
y <- do.call("c", tmp)
29 / 44
Parallel Computing (cont.)
On Windows machines, libraries and global variables used by a
function to run in parallel have to be explicited exported to all
nodes.
## on Windows machines
library(parallel)
## set up cluster
cluster <- makeCluster(n.cores)
## load required libraries, if any, on all nodes
tmp <- clusterEvalQ(cluster, library(igraph))
## export required variables, if any, to all nodes
clusterExport(cluster, "myvar")
## run jobs in parallel
tmp <- parLapply(cluster, x, myfunc)
## stop cluster
stopCluster(cluster)
# collect results
y <- do.call("c", tmp)
30 / 44
Parallel Computing (cont.)
On Windows machines, libraries and global variables used by a
function to run in parallel have to be explicited exported to all
nodes.
## on Windows machines
library(parallel)
## set up cluster
cluster <- makeCluster(n.cores)
## load required libraries, if any, on all nodes
tmp <- clusterEvalQ(cluster, library(igraph))
## export required variables, if any, to all nodes
clusterExport(cluster, "myvar")
## run jobs in parallel
tmp <- parLapply(cluster, x, myfunc)
## stop cluster
stopCluster(cluster)
# collect results
y <- do.call("c", tmp)
30 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
31 / 44
Functions
Define your own function: calculate the arithmetic average of a
numeric vector
average <- function(x) {
y <- sum(x)
n <- length(x)
z <- y/n
return(z)
}
## calcuate the average of 1:10
average(1:10)
## [1] 5.5
32 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
33 / 44
Data Import and Export
Read data from and write data to
R native formats (incl. Rdata and RDS)
CSV files
EXCEL files
ODBC databases
SAS databases
R Data Import/Export:
http://cran.r-project.org/doc/manuals/R-data.pdf
Chapter 2: Data Import and Export, in book R and Data Mining:
Examples and Case Studies.
http://www.rdatamining.com/docs/RDataMining-book.pdf
34 / 44
Save and Load R Objects
save(): save R objects into a .Rdata file
load(): read R objects from a .Rdata file
rm(): remove objects from R
a <- 1:10
save(a, file = "../data/dumData.Rdata")
rm(a)
a
## Error in eval(expr, envir, enclos): object ’a’ not found
load("../data/dumData.Rdata")
a
## [1] 1 2 3 4 5 6 7 8 9 10
35 / 44
Save and Load R Objects - More Functions
save.image():
save current workspace to a file
It saves everything!
readRDS():
read a single R object from a .rds file
saveRDS():
save a single R object to a file
Advantage of readRDS() and saveRDS():
You can restore the data under a different object name.
Advantage of load() and save():
You can save multiple R objects to one file.
36 / 44
Import from and Export to .CSV Files
write.csv(): write an R object to a .CSV file
read.csv(): read an R object from a .CSV file
# create a data frame
var1 <- 1:5
var2 <- (1:5)/10
var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies")
df1 <- data.frame(var1, var2, var3)
names(df1) <- c("VarInt", "VarReal", "VarChar")
# save to a csv file
write.csv(df1, "../data/dummmyData.csv", row.names = FALSE)
# read from a csv file
df2 <- read.csv("../data/dummmyData.csv")
print(df2)
## VarInt VarReal VarChar
## 1 1 0.1 R
## 2 2 0.2 and
## 3 3 0.3 Data Mining
## 4 4 0.4 Examples
## 5 5 0.5 Case Studies
37 / 44
Import from and Export to EXCEL Files
Package openxlsx: read, write and edit XLSX files
library(openxlsx)
xlsx.file <- "../data/dummmyData.xlsx"
write.xlsx(df2, xlsx.file, sheetName = "sheet1", row.names = F)
df3 <- read.xlsx(xlsx.file, sheet = "sheet1")
df3
## VarInt VarReal VarChar
## 1 1 0.1 R
## 2 2 0.2 and
## 3 3 0.3 Data Mining
## 4 4 0.4 Examples
## 5 5 0.5 Case Studies
38 / 44
Read from Databases
Package RODBC: provides connection to ODBC databases.
Function odbcConnect(): sets up a connection to database
sqlQuery(): sends an SQL query to the database
odbcClose() closes the connection.
library(RODBC)
db <- odbcConnect(dsn = "servername", uid = "userid",
pwd = "******")
sql <- "SELECT * FROM lib.table WHERE ..."
# or read query from file
sql <- readChar("myQuery.sql", nchars=99999)
myData <- sqlQuery(db, sql, errors=TRUE)
odbcClose(db)
39 / 44
Read from Databases
Package RODBC: provides connection to ODBC databases.
Function odbcConnect(): sets up a connection to database
sqlQuery(): sends an SQL query to the database
odbcClose() closes the connection.
library(RODBC)
db <- odbcConnect(dsn = "servername", uid = "userid",
pwd = "******")
sql <- "SELECT * FROM lib.table WHERE ..."
# or read query from file
sql <- readChar("myQuery.sql", nchars=99999)
myData <- sqlQuery(db, sql, errors=TRUE)
odbcClose(db)
Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write
or update a table in an ODBC database
39 / 44
Import Data from SAS
Package foreign provides function read.ssd() for importing SAS
datasets (.sas7bdat files) into R.
library(foreign) # for importing SAS data
# the path of SAS on your computer
sashome <- "C:/Program Files/SAS/SASFoundation/9.4"
filepath <- "./data"
# filename should be no more than 8 characters, without extension
fileName <- "dumData"
# read data from a SAS dataset
a <- read.ssd(file.path(filepath), fileName,
sascmd=file.path(sashome, "sas.exe"))
40 / 44
Import Data from SAS
Package foreign provides function read.ssd() for importing SAS
datasets (.sas7bdat files) into R.
library(foreign) # for importing SAS data
# the path of SAS on your computer
sashome <- "C:/Program Files/SAS/SASFoundation/9.4"
filepath <- "./data"
# filename should be no more than 8 characters, without extension
fileName <- "dumData"
# read data from a SAS dataset
a <- read.ssd(file.path(filepath), fileName,
sascmd=file.path(sashome, "sas.exe"))
Alternatives:
function read.xport(): read a file in SAS Transport
(XPORT) format
RStudio : Environment Panel : Import Dataset from
SPSS/SAS/Stata
40 / 44
Contents
Introduction to R
RStudio
Pipe Operations
Data Objects
Control Flow
Parallel Computing
Functions
Data Import and Export
Online Resources
41 / 44
Online Resources
Book titled R and Data Mining: Examples and Case Studies
http://www.rdatamining.com/docs/RDataMining-book.pdf
R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
Free online courses and documents
http://www.rdatamining.com/resources/
RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
Twitter (3,300+ followers)
@RDataMining
42 / 44
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
43 / 44
How to Cite This Work
Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
44 / 44

More Related Content

PPTX
R Programming Language
PDF
R programming groundup-basic-section-i
PPTX
R Programming Tutorial for Beginners - -TIB Academy
PDF
R programming & Machine Learning
PPTX
Workshop presentation hands on r programming
PPTX
R language
PDF
Introduction to R
PPT
R-programming-training-in-mumbai
R Programming Language
R programming groundup-basic-section-i
R Programming Tutorial for Beginners - -TIB Academy
R programming & Machine Learning
Workshop presentation hands on r programming
R language
Introduction to R
R-programming-training-in-mumbai

What's hot (20)

PPTX
R programming Fundamentals
PDF
Introduction to R Programming
PDF
R basics
 
PDF
Data Analysis with R (combined slides)
ODP
Introduction to the language R
PPT
R tutorial for a windows environment
PPTX
Programming in R
PDF
R programming for data science
PPTX
Introduction To R Language
PPT
Best corporate-r-programming-training-in-mumbai
PDF
RDataMining slides-network-analysis-with-r
PDF
R - the language
PDF
R programming language: conceptual overview
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PDF
Functional Programming in R
PDF
RDataMining slides-regression-classification
PDF
Introduction to Data Mining with R and Data Import/Export in R
PDF
R tutorial
PPTX
R programming language
PPTX
R language introduction
R programming Fundamentals
Introduction to R Programming
R basics
 
Data Analysis with R (combined slides)
Introduction to the language R
R tutorial for a windows environment
Programming in R
R programming for data science
Introduction To R Language
Best corporate-r-programming-training-in-mumbai
RDataMining slides-network-analysis-with-r
R - the language
R programming language: conceptual overview
2. R-basics, Vectors, Arrays, Matrices, Factors
Functional Programming in R
RDataMining slides-regression-classification
Introduction to Data Mining with R and Data Import/Export in R
R tutorial
R programming language
R language introduction
Ad

Similar to RDataMining slides-r-programming (20)

PDF
R basics
PPT
R Programming for Statistical Applications
PPT
R-programming with example representation.ppt
PPT
Advanced Data Analytics with R Programming.ppt
PPT
Basics of R-Programming with example.ppt
PPT
Basocs of statistics with R-Programming.ppt
PPT
R-Programming.ppt it is based on R programming language
PDF
R-Language-Lab-Manual-lab-1.pdf
PDF
R-Language-Lab-Manual-lab-1.pdf
PDF
R-Language-Lab-Manual-lab-1.pdf
PPTX
DATA MINING USING R (1).pptx
PDF
Introduction to R programming
PDF
Introduction to R Short course Fall 2016
PPT
Basics of R
PPT
PPTX
Coding and Cookies: R basics
PPT
How to obtain and install R.ppt
PPTX
R language tutorial
PPT
Introduction to R for Data Science Technology
PPT
introduction to R with example, Data science
R basics
R Programming for Statistical Applications
R-programming with example representation.ppt
Advanced Data Analytics with R Programming.ppt
Basics of R-Programming with example.ppt
Basocs of statistics with R-Programming.ppt
R-Programming.ppt it is based on R programming language
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
DATA MINING USING R (1).pptx
Introduction to R programming
Introduction to R Short course Fall 2016
Basics of R
Coding and Cookies: R basics
How to obtain and install R.ppt
R language tutorial
Introduction to R for Data Science Technology
introduction to R with example, Data science
Ad

More from Yanchang Zhao (15)

PDF
RDataMining slides-time-series-analysis
PDF
RDataMining slides-text-mining-with-r
PDF
RDataMining slides-data-exploration-visualisation
PDF
RDataMining slides-clustering-with-r
PDF
RDataMining slides-association-rule-mining-with-r
PDF
RDataMining-reference-card
PDF
Text Mining with R -- an Analysis of Twitter Data
PDF
Association Rule Mining with R
PDF
Time Series Analysis and Mining with R
PDF
Regression and Classification with R
PDF
Data Clustering with R
PDF
Data Exploration and Visualization with R
PDF
An Introduction to Data Mining with R
PDF
Time series-mining-slides
PDF
R Reference Card for Data Mining
RDataMining slides-time-series-analysis
RDataMining slides-text-mining-with-r
RDataMining slides-data-exploration-visualisation
RDataMining slides-clustering-with-r
RDataMining slides-association-rule-mining-with-r
RDataMining-reference-card
Text Mining with R -- an Analysis of Twitter Data
Association Rule Mining with R
Time Series Analysis and Mining with R
Regression and Classification with R
Data Clustering with R
Data Exploration and Visualization with R
An Introduction to Data Mining with R
Time series-mining-slides
R Reference Card for Data Mining

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Cloud computing and distributed systems.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Cloud computing and distributed systems.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
MIND Revenue Release Quarter 2 2025 Press Release
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Assigned Numbers - 2025 - Bluetooth® Document

RDataMining slides-r-programming

  • 1. Basics of R Programming Yanchang Zhao http://www.RDataMining.com R and Data Mining Course Beijing University of Posts and Telecommunications, Beijing, China July 2019 1 / 44
  • 2. Quiz Have you used R before? 2 / 44
  • 3. Quiz Have you used R before? Are you familiar with data mining and machine learning techniques and algorithms? 2 / 44
  • 4. Quiz Have you used R before? Are you familiar with data mining and machine learning techniques and algorithms? Have you used R for data mining and analytics in your study/research/work? 2 / 44
  • 5. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 3 / 44
  • 6. What is R? R ∗ is a free software environment for statistical computing and graphics. R can be easily extended with 14,000+ packages available on CRAN† (as of July 2019). Many other packages provided on Bioconductor‡, R-Forge§, GitHub¶, etc. R manuals on CRAN An Introduction to R The R Language Definition R Data Import/Export . . . ∗ http://www.r-project.org/ † http://cran.r-project.org/ ‡ http://www.bioconductor.org/ § http://r-forge.r-project.org/ ¶ https://github.com/ http://cran.r-project.org/manuals.html 4 / 44
  • 7. Why R? R is widely used in both academia and industry. R is one of the most popular tools for data science and analytics, ranked #1 from 2011 to 2016, but sadly overtaken by Python since 2017, :-( ∗∗. The CRAN Task Views †† provide collections of packages for different tasks. Machine learning & atatistical learning Cluster analysis & finite mixture models Time series analysis Multivariate statistics Analysis of spatial data . . . ∗∗ The KDnuggets polls on Top Analytics, Data Science software https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html †† http://cran.r-project.org/web/views/ 5 / 44
  • 8. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 6 / 44
  • 9. RStudio‡‡ An integrated development environment (IDE) for R Runs on various operating systems like Windows, Mac OS X and Linux Suggestion: always using an RStudio project, with subfolders code: source code data: raw data, cleaned data figures: charts and graphs docs: documents and reports models: analytics models ‡‡ https://www.rstudio.com/products/rstudio/ 7 / 44
  • 11. RStudio Keyboard Shortcuts Run current line or selection: Ctrl + enter Comment / uncomment selection: Ctrl + Shift + C Clear console: Ctrl + L Reindent selection: Ctrl + I 9 / 44
  • 12. Writing Reports and Papers Sweave + LaTex: for academic publications beamer + LaTex: for presentations knitr + R Markdown: generating reports and slides in HTML, PDF and WORD formats Notebooks: R notebook, Jupiter notebook 10 / 44
  • 13. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 11 / 44
  • 14. Pipe Operations Load library magrittr for pipe operations Avoid nested function calls Make code easy to understand Supported by dplyr and ggplot2 library(magrittr) ## for pipe operations ## traditional way b <- fun3(fun2(fun1(a), b), d) ## the above can be rewritten to b <- a %>% fun1() %>% fun2(b) %>% fun3(d) 12 / 44
  • 15. Pipe Operations Load library magrittr for pipe operations Avoid nested function calls Make code easy to understand Supported by dplyr and ggplot2 library(magrittr) ## for pipe operations ## traditional way b <- fun3(fun2(fun1(a), b), d) ## the above can be rewritten to b <- a %>% fun1() %>% fun2(b) %>% fun3(d) Quiz: Why not use ’c’ in above example? 12 / 44
  • 16. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 13 / 44
  • 17. Data Types and Structures Data types Integer Numeric Character Factor Logical Date Data structures Vector Matrix Data frame List 14 / 44
  • 18. Vector ## integer vector x <- 1:10 print(x) ## [1] 1 2 3 4 5 6 7 8 9 10 ## numeric vector, generated randomly from a uniform distribution y <- runif(5) y ## [1] 0.95724678 0.02629283 0.49250477 0.07112317 0.93636358 ## character vector (z <- c("abc", "d", "ef", "g")) ## [1] "abc" "d" "ef" "g" 15 / 44
  • 19. Matrix ## create a matrix with 4 rows, from a vector of 1:20 m <- matrix(1:20, nrow = 4, byrow = T) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15 ## [4,] 16 17 18 19 20 ## matrix subtraction m - diag(nrow = 4, ncol = 5) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0 2 3 4 5 ## [2,] 6 6 8 9 10 ## [3,] 11 12 12 14 15 ## [4,] 16 17 18 18 20 16 / 44
  • 20. Data Frame library(magrittr) age <- c(45, 22, 61, 14, 37) gender <- c("Female", "Male", "Male", "Female", "Male") height <- c(1.68, 1.85, 1.8, 1.66, 1.72) married <- c(T, F, T, F, F) df <- data.frame(age, gender, height, married) %>% print() ## age gender height married ## 1 45 Female 1.68 TRUE ## 2 22 Male 1.85 FALSE ## 3 61 Male 1.80 TRUE ## 4 14 Female 1.66 FALSE ## 5 37 Male 1.72 FALSE str(df) ## 'data.frame': 5 obs. of 4 variables: ## $ age : num 45 22 61 14 37 ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 ## $ height : num 1.68 1.85 1.8 1.66 1.72 ## $ married: logi TRUE FALSE TRUE FALSE FALSE 17 / 44
  • 21. Data Slicing df$age ## [1] 45 22 61 14 37 df[, 1] ## [1] 45 22 61 14 37 df[1, ] ## age gender height married ## 1 45 Female 1.68 TRUE df[1, 1] ## [1] 45 df$gender[1] ## [1] Female ## Levels: Female Male 18 / 44
  • 22. Data Subsetting and Sorting df %>% subset(gender == "Male") ## age gender height married ## 2 22 Male 1.85 FALSE ## 3 61 Male 1.80 TRUE ## 5 37 Male 1.72 FALSE idx <- order(df$age) %>% print() ## [1] 4 2 5 1 3 df[idx, ] ## age gender height married ## 4 14 Female 1.66 FALSE ## 2 22 Male 1.85 FALSE ## 5 37 Male 1.72 FALSE ## 1 45 Female 1.68 TRUE ## 3 61 Male 1.80 TRUE 19 / 44
  • 23. List x <- 1:10 y <- c("abc", "d", "ef", "g") ls <- list(x, y) %>% print() ## [[1]] ## [1] 1 2 3 4 5 6 7 8 9 10 ## ## [[2]] ## [1] "abc" "d" "ef" "g" ## retrieve an element in a list ls[[2]] ## [1] "abc" "d" "ef" "g" ls[[2]][1] ## [1] "abc" 20 / 44
  • 24. Character x <- c("apple", "orange", "pear", "banana") ## search for a pattern grep(pattern = "an", x) ## [1] 2 4 ## search for a pattern and return matched elements grep(pattern = "an", x, value = T) ## [1] "orange" "banana" ## replace a pattern gsub(pattern = "an", replacement = "**", x) ## [1] "apple" "or**ge" "pear" "b****a" 21 / 44
  • 25. Date library(lubridate) x <- ymd("2019-07-08") class(x) ## [1] "Date" year(x) ## [1] 2019 # month(x) day(x) ## [1] 8 weekdays(x) ## [1] "Monday" Date parsing functions: ymd(), ydm(), mdy(), myd(), dmy(), dym(), yq() in package lubridate 22 / 44
  • 26. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 23 / 44
  • 27. Conditional Control if . . . else . . . score <- 4 if (score >= 3) { print("pass") } else { print("fail") } ## [1] "pass" ifelse() score <- 1:5 ifelse(score >= 3, "pass", "fail") ## [1] "fail" "fail" "pass" "pass" "pass" 24 / 44
  • 28. Loop Control for, while, repeat break, next for (i in 1:5) { print(i^2) } ## [1] 1 ## [1] 4 ## [1] 9 ## [1] 16 ## [1] 25 25 / 44
  • 29. Apply Functions apply(): apply a function to margins of an array or matrix lapply(): apply a function to every item in a list or vector and return a list sapply(): similar to lapply, but return a vector or matrix vapply(): similar to sapply, but as a pre-specified type of return value 26 / 44
  • 30. Loop vs lapply ## for loop x <- 1:10 y <- rep(NA, 10) for (i in 1:length(x)) { y[i] <- log(x[i]) } y ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851 ## apply a function (log) to every element of x tmp <- lapply(x, log) y <- do.call("c", tmp) %>% print() ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851 ## same as above sapply(x, log) ## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.79... ## [7] 1.9459101 2.0794415 2.1972246 2.3025851 27 / 44
  • 31. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 28 / 44
  • 32. Parallel Computing ## on Linux or Mac machines library(parallel) n.cores <- detectCores() - 1 %>% print() tmp <- mclapply(x, log, mc.cores=n.cores) y <- do.call("c", tmp) ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## run jobs in parallel tmp <- parLapply(cluster, x, log) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp) 29 / 44
  • 33. Parallel Computing (cont.) On Windows machines, libraries and global variables used by a function to run in parallel have to be explicited exported to all nodes. ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## load required libraries, if any, on all nodes tmp <- clusterEvalQ(cluster, library(igraph)) ## export required variables, if any, to all nodes clusterExport(cluster, "myvar") ## run jobs in parallel tmp <- parLapply(cluster, x, myfunc) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp) 30 / 44
  • 34. Parallel Computing (cont.) On Windows machines, libraries and global variables used by a function to run in parallel have to be explicited exported to all nodes. ## on Windows machines library(parallel) ## set up cluster cluster <- makeCluster(n.cores) ## load required libraries, if any, on all nodes tmp <- clusterEvalQ(cluster, library(igraph)) ## export required variables, if any, to all nodes clusterExport(cluster, "myvar") ## run jobs in parallel tmp <- parLapply(cluster, x, myfunc) ## stop cluster stopCluster(cluster) # collect results y <- do.call("c", tmp) 30 / 44
  • 35. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 31 / 44
  • 36. Functions Define your own function: calculate the arithmetic average of a numeric vector average <- function(x) { y <- sum(x) n <- length(x) z <- y/n return(z) } ## calcuate the average of 1:10 average(1:10) ## [1] 5.5 32 / 44
  • 37. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 33 / 44
  • 38. Data Import and Export Read data from and write data to R native formats (incl. Rdata and RDS) CSV files EXCEL files ODBC databases SAS databases R Data Import/Export: http://cran.r-project.org/doc/manuals/R-data.pdf Chapter 2: Data Import and Export, in book R and Data Mining: Examples and Case Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf 34 / 44
  • 39. Save and Load R Objects save(): save R objects into a .Rdata file load(): read R objects from a .Rdata file rm(): remove objects from R a <- 1:10 save(a, file = "../data/dumData.Rdata") rm(a) a ## Error in eval(expr, envir, enclos): object ’a’ not found load("../data/dumData.Rdata") a ## [1] 1 2 3 4 5 6 7 8 9 10 35 / 44
  • 40. Save and Load R Objects - More Functions save.image(): save current workspace to a file It saves everything! readRDS(): read a single R object from a .rds file saveRDS(): save a single R object to a file Advantage of readRDS() and saveRDS(): You can restore the data under a different object name. Advantage of load() and save(): You can save multiple R objects to one file. 36 / 44
  • 41. Import from and Export to .CSV Files write.csv(): write an R object to a .CSV file read.csv(): read an R object from a .CSV file # create a data frame var1 <- 1:5 var2 <- (1:5)/10 var3 <- c("R", "and", "Data Mining", "Examples", "Case Studies") df1 <- data.frame(var1, var2, var3) names(df1) <- c("VarInt", "VarReal", "VarChar") # save to a csv file write.csv(df1, "../data/dummmyData.csv", row.names = FALSE) # read from a csv file df2 <- read.csv("../data/dummmyData.csv") print(df2) ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 37 / 44
  • 42. Import from and Export to EXCEL Files Package openxlsx: read, write and edit XLSX files library(openxlsx) xlsx.file <- "../data/dummmyData.xlsx" write.xlsx(df2, xlsx.file, sheetName = "sheet1", row.names = F) df3 <- read.xlsx(xlsx.file, sheet = "sheet1") df3 ## VarInt VarReal VarChar ## 1 1 0.1 R ## 2 2 0.2 and ## 3 3 0.3 Data Mining ## 4 4 0.4 Examples ## 5 5 0.5 Case Studies 38 / 44
  • 43. Read from Databases Package RODBC: provides connection to ODBC databases. Function odbcConnect(): sets up a connection to database sqlQuery(): sends an SQL query to the database odbcClose() closes the connection. library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db) 39 / 44
  • 44. Read from Databases Package RODBC: provides connection to ODBC databases. Function odbcConnect(): sets up a connection to database sqlQuery(): sends an SQL query to the database odbcClose() closes the connection. library(RODBC) db <- odbcConnect(dsn = "servername", uid = "userid", pwd = "******") sql <- "SELECT * FROM lib.table WHERE ..." # or read query from file sql <- readChar("myQuery.sql", nchars=99999) myData <- sqlQuery(db, sql, errors=TRUE) odbcClose(db) Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write or update a table in an ODBC database 39 / 44
  • 45. Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.4" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe")) 40 / 44
  • 46. Import Data from SAS Package foreign provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. library(foreign) # for importing SAS data # the path of SAS on your computer sashome <- "C:/Program Files/SAS/SASFoundation/9.4" filepath <- "./data" # filename should be no more than 8 characters, without extension fileName <- "dumData" # read data from a SAS dataset a <- read.ssd(file.path(filepath), fileName, sascmd=file.path(sashome, "sas.exe")) Alternatives: function read.xport(): read a file in SAS Transport (XPORT) format RStudio : Environment Panel : Import Dataset from SPSS/SAS/Stata 40 / 44
  • 47. Contents Introduction to R RStudio Pipe Operations Data Objects Control Flow Parallel Computing Functions Data Import and Export Online Resources 41 / 44
  • 48. Online Resources Book titled R and Data Mining: Examples and Case Studies http://www.rdatamining.com/docs/RDataMining-book.pdf R Reference Card for Data Mining http://www.rdatamining.com/docs/RDataMining-reference-card.pdf Free online courses and documents http://www.rdatamining.com/resources/ RDataMining Group on LinkedIn (27,000+ members) http://group.rdatamining.com Twitter (3,300+ followers) @RDataMining 42 / 44
  • 50. How to Cite This Work Citation Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN 978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256 pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf. BibTex @BOOK{Zhao2012R, title = {R and Data Mining: Examples and Case Studies}, publisher = {Academic Press, Elsevier}, year = {2012}, author = {Yanchang Zhao}, pages = {256}, month = {December}, isbn = {978-0-123-96963-7}, keywords = {R, data mining}, url = {http://www.rdatamining.com/docs/RDataMining-book.pdf} } 44 / 44