SlideShare a Scribd company logo
Next generation programming in R
Florian Uhlitz
uhlitz@hu-berlin.de
uhlitz.github.io
%>%
magrittr
readr
tidyr
dplyr
%>%
load data
reshape data
manipulate data
Stefan Milton Bache,
University of Southern Denmark
Hadley Wickham,
Rice University, RStudio
Recent developments in the R environment
magrittr
readr tidyr dplyr
%>%
load reshape manipulate%>% %>%
Toolbox for data wrangling in R
data wrangling
adapted from H. Wickham
magrittr
readr tidyr dplyr
%>%
load reshape manipulate%>% %>%
Toolbox for data wrangling in R
data wrangling
model
visualise
adapted from H. Wickham
report
magrittr
readr tidyr dplyr
%>%
load reshape manipulate%>% %>%
Toolbox for data wrangling in R
data wrangling
model
visualise
adapted from H. Wickham
report
magrittr
readr tidyr dplyr
%>%
load reshape manipulate%>% %>%
Toolbox for data wrangling in R
data wrangling
model
visualise
base
ggplot2
rmarkdown
broom
adapted from H. Wickham
data analysis
report
magrittr
readr tidyr dplyr
%>%
load reshape manipulate%>% %>%
Toolbox for data wrangling in R
data wrangling
model
visualise
base
ggplot2
rmarkdown
broom
adapted from H. Wickham
magrittr
In a pipe, the result of the left hand statement is handed
over to the function on the right hand side:
…similar to Unix pipe operator |
f(x, y)
x %>% f(y)
f(x, y, z)
x %>% f(y, z)
f2(f1(x), y)
f1(x) %>% f2(y)
magrittr
nested 

functions
magrittr
nested 

functions
chain of

functions
readr, readxl, haven
readr::read_csv()
readr::read_tsv()
readr::read_log()
readr::read_delim()
readr::read_fwf()
readr::read_table()
readxl::read_excel()
haven::read_sas()
haven::read_spss()
haven::read_stata()
tidyr
gather() spread()
Reshaping
adapted from rstudio.com/resources/cheatsheets/
tidyr
gather() spread()
separate() unite()
Reshaping
adapted from rstudio.com/resources/cheatsheets/
dplyr
filter(x > 1) select(B, C, E)
A B C D E B C Ex
1
2
3
1
x
2
3
Subsetting
adapted from rstudio.com/resources/cheatsheets/
dplyr
Transforming Summarising
1
2
3
x
4
5
6
y
1
2
3
x
4
5
6
y
5
7
9
z
mutate(z = x + y) summarise(A = sum(x), B = sum(y))
1
2
3
x
4
5
6
y
6
A
15
B
adapted from rstudio.com/resources/cheatsheets/
dplyr
Transforming Summarising
1
2
3
x
4
5
6
y
1
2
3
x
4
5
6
y
5
7
9
z
mutate(z = x + y) summarise(A = sum(x), B = sum(y))
1
2
3
x
4
5
6
y
6
A
15
B
group_by() %>% mutate() group_by() %>% summarise()
adapted from rstudio.com/resources/cheatsheets/
What`s tidy data?
KEEP

CALMAND
TIDY

UP
»Happy families are all alike; every unhappy
family is unhappy in its own way.«




Leo Tolstoy
Anna Karenina principle
»Tidy data sets are all alike; every messy
data set is messy in its own way.«




Hadley Wickham
Tidy data principle
Tidy data definition
Wickham, H. (2014). Tidy Data. Journal of Statistical Software
Next Generation Programming in R
Next Generation Programming in R
Next Generation Programming in R
Next Generation Programming in R
Next Generation Programming in R
Next Generation Programming in R
read_excel(“untidy_data.xlsx”) %>%
set_colnames(mynames) %>%
slice(1:36) %>%
fill(group, condition) %>%
separate(group, into = c(“Gene”, “Mutation”, “clone”), sep = “_”) %>%
write_tsv(“tidy_data.tsv”)
read_excel(“untidy_data.xlsx”) %>%
set_colnames(mynames) %>%
slice(1:36) %>%
fill(group, condition) %>%
separate(group, into = c(“Gene”, “Mutation”, “clone”), sep = “_”) %>%
write_tsv(“tidy_data.tsv”)
read_excel
read_excel %>% set_colnames
read_excel %>% set_colnames %>% tail
read_excel %>% set_colnames
read_excel %>% set_colnames %>% slice
read_excel %>% set_colnames %>% slice %>% fill
read_excel %>% set_colnames %>% slice %>% fill %>% select
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct %>%

separate
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct %>%

separate
Caution!

readr, tidy & dplyr do “clever” stuff.
(heuristics like predicting a column class by
looking at the first 1000 entries)
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct

separate
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct

separate %>% unite
read_excel %>% set_colnames %>% slice %>% fill %>% select %>% distinct

separate %>% unite
Next Generation Programming in R
Tidy data definition
Wickham, H. (2014). Tidy Data. Journal of Statistical Software
read_tsv
read_tsv %>% gather(key, value, -variable)
read_tsv %>% gather %>% spread(key, value)
read_tsv %>% gather
read_tsv %>% gather %>% filter
read_tsv %>% gather %>% filter %>% group_by
read_tsv %>% gather %>% filter %>% group_by %>% summarise %>% arrange
read_tsv %>% gather %>% filter %>% group_by %>% summarise %>% arrange
read_tsv %>% gather %>% filter %>% group_by %>% summarise %>% arrange
Data Wrangling
with dplyr and tidyr
Cheat Sheet
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • info@rstudio.com • 844-448-1212 • rstudio.com
Syntax - Helpful conventions for wrangling
dplyr::tbl_df(iris)
Converts data to tbl class. tbl’s are easier to examine than
data frames. R displays only the data that fits onscreen:
dplyr::glimpse(iris)
Information dense summary of tbl data.
utils::View(iris)
View data set in spreadsheet-like display (note capital V).
Source: local data frame [150 x 5]
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
.. ... ... ...
Variables not shown: Petal.Width (dbl),
Species (fctr)
dplyr::%>%
Passes object on left hand side as first argument (or .
argument) of function on righthand side.
"Piping" with %>% makes code more readable, e.g.
iris %>%
group_by(Species) %>%
summarise(avg = mean(Sepal.Width)) %>%
arrange(avg)
x %>% f(y) is the same as f(x, y)
y %>% f(x, ., z) is the same as f(x, y, z )
Reshaping Data - Change the layout of a data set
Subset Observations (Rows) Subset Variables (Columns)
F M A
Each variable is saved
in its own column
F M A
Each observation is
saved in its own row
In a tidy
data set: &
Tidy Data - A foundation for wrangling in R
Tidy data complements R’s vectorized
operations. R will automatically preserve
observations as you manipulate variables.
No other format works as intuitively with R.
FAM
M * A
*
tidyr::gather(cases, "year", "n", 2:4)
Gather columns into rows.
tidyr::unite(data, col, ..., sep)
Unite several columns into one.
dplyr::data_frame(a = 1:3, b = 4:6)
Combine vectors into data frame
(optimized).
dplyr::arrange(mtcars, mpg)
Order rows by values of a column
(low to high).
dplyr::arrange(mtcars, desc(mpg))
Order rows by values of a column
(high to low).
dplyr::rename(tb, y = year)
Rename the columns of a data
frame.
tidyr::spread(pollution, size, amount)
Spread rows into columns.
tidyr::separate(storms, date, c("y", "m", "d"))
Separate one column into several.
wwwwwwA1005A1013A1010A1010
wwp110110100745451009
wwp110110100745451009 wwp110110100745451009wwp110110100745451009
wppw11010071007110451009100945
wwwww110110110110110 wwww
dplyr::filter(iris, Sepal.Length > 7)
Extract rows that meet logical criteria.
dplyr::distinct(iris)
Remove duplicate rows.
dplyr::sample_frac(iris, 0.5, replace = TRUE)
Randomly select fraction of rows.
dplyr::sample_n(iris, 10, replace = TRUE)
Randomly select n rows.
dplyr::slice(iris, 10:15)
Select rows by position.
dplyr::top_n(storms, 2, date)
Select and order top n entries (by group if grouped data).
< Less than != Not equal to
> Greater than %in% Group membership
== Equal to is.na Is NA
<= Less than or equal to !is.na Is not NA
>= Greater than or equal to &,|,!,xor,any,all Boolean operators
Logic in R - ?Comparison, ?base::Logic
dplyr::select(iris, Sepal.Width, Petal.Length, Species)
Select columns by name or helper function.
Helper functions for select - ?select
select(iris, contains("."))
Select columns whose name contains a character string.
select(iris, ends_with("Length"))
Select columns whose name ends with a character string.
select(iris, everything())
Select every column.
select(iris, matches(".t."))
Select columns whose name matches a regular expression.
select(iris, num_range("x", 1:5))
Select columns named x1, x2, x3, x4, x5.
select(iris, one_of(c("Species", "Genus")))
Select columns whose names are in a group of names.
select(iris, starts_with("Sepal"))
Select columns whose name starts with a character string.
select(iris, Sepal.Length:Petal.Width)
Select all columns between Sepal.Length and Petal.Width (inclusive).
select(iris, -Species)
Select all columns except Species.
Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15
wwwwwwA1005A1013A1010A1010
devtools::install_github("rstudio/EDAWR") for data sets
rstudio.com/resources/cheatsheets/
Next Generation Programming in R

More Related Content

PDF
Data manipulation with dplyr
PDF
4 R Tutorial DPLYR Apply Function
PDF
Grouping & Summarizing Data in R
PDF
Data Manipulation Using R (& dplyr)
PPTX
R language introduction
PDF
Chunked, dplyr for large text files
PDF
Rsplit apply combine
PDF
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Data manipulation with dplyr
4 R Tutorial DPLYR Apply Function
Grouping & Summarizing Data in R
Data Manipulation Using R (& dplyr)
R language introduction
Chunked, dplyr for large text files
Rsplit apply combine
January 2016 Meetup: Speeding up (big) data manipulation with data.table package

What's hot (20)

PDF
R Programming: Importing Data In R
PPTX
R seminar dplyr package
PPTX
R Language Introduction
PPTX
R programming language
PDF
5 R Tutorial Data Visualization
PPTX
Merge Multiple CSV in single data frame using R
KEY
Presentation R basic teaching module
PDF
3 R Tutorial Data Structure
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PPTX
Introduction to pandas
PDF
Python for R Users
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PDF
Morel, a Functional Query Language
ODP
Data Analysis in Python
PDF
R data-import, data-export
 
PDF
Introduction to data.table in R
PDF
Stata Programming Cheat Sheet
PDF
R Workshop for Beginners
PDF
R code for data manipulation
R Programming: Importing Data In R
R seminar dplyr package
R Language Introduction
R programming language
5 R Tutorial Data Visualization
Merge Multiple CSV in single data frame using R
Presentation R basic teaching module
3 R Tutorial Data Structure
2. R-basics, Vectors, Arrays, Matrices, Factors
Introduction to pandas
Python for R Users
Introduction to Pandas and Time Series Analysis [PyCon DE]
Spark 4th Meetup Londond - Building a Product with Spark
Morel, a Functional Query Language
Data Analysis in Python
R data-import, data-export
 
Introduction to data.table in R
Stata Programming Cheat Sheet
R Workshop for Beginners
R code for data manipulation
Ad

Viewers also liked (18)

PPTX
Data and donuts: Data Visualization using R
PDF
Fast data munging in R
PPTX
Self Learning Credit Scoring Model Presentation
PDF
Generating random primes
PDF
Aire - Alternative Credit Scoring (TechStars DemoDay - Sep 2014)
PPTX
20160611 kintone Café 高知 Vol.3 LT資料
PPTX
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
PDF
Rlecturenotes
PPT
R Brown-bag seminars : Seminar-8
PDF
Análisis espacial con R (asignatura de Master - UPM)
PDF
Paquete ggplot - Potencia y facilidad para generar gráficos en R
PDF
Presentation DataScoring: Big Data and credit score
PPTX
Learn to use dplyr (Feb 2015 Philly R User Meetup)
PPTX
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
PPTX
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
PDF
R Programming: Learn To Manipulate Strings In R
PDF
Reproducible Research in R and R Studio
PDF
Dplyr and Plyr
Data and donuts: Data Visualization using R
Fast data munging in R
Self Learning Credit Scoring Model Presentation
Generating random primes
Aire - Alternative Credit Scoring (TechStars DemoDay - Sep 2014)
20160611 kintone Café 高知 Vol.3 LT資料
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
Rlecturenotes
R Brown-bag seminars : Seminar-8
Análisis espacial con R (asignatura de Master - UPM)
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Presentation DataScoring: Big Data and credit score
Learn to use dplyr (Feb 2015 Philly R User Meetup)
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
R Programming: Learn To Manipulate Strings In R
Reproducible Research in R and R Studio
Dplyr and Plyr
Ad

Similar to Next Generation Programming in R (20)

PDF
Broom: Converting Statistical Models to Tidy Data Frames
PDF
Data Wrangling with dplyr and tidyr Cheat Sheet
PPTX
Unit I - introduction to r language 2.pptx
PDF
tidyr.pdf
PDF
R gráfico
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PPTX
description description description description
PDF
Data import-cheatsheet
PDF
Introduction to R Short course Fall 2016
PDF
Introduction to r studio on aws 2020 05_06
PDF
Data transformation-cheatsheet
PPTX
Basic data analysis using R.
PDF
R programming & Machine Learning
PPTX
Coding and Cookies: R basics
PPTX
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
PDF
Data manipulation on r
PPTX
R Introduction
PDF
R_CheatSheet.pdf
PDF
Basic R Data Manipulation
Broom: Converting Statistical Models to Tidy Data Frames
Data Wrangling with dplyr and tidyr Cheat Sheet
Unit I - introduction to r language 2.pptx
tidyr.pdf
R gráfico
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
description description description description
Data import-cheatsheet
Introduction to R Short course Fall 2016
Introduction to r studio on aws 2020 05_06
Data transformation-cheatsheet
Basic data analysis using R.
R programming & Machine Learning
Coding and Cookies: R basics
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
Data manipulation on r
R Introduction
R_CheatSheet.pdf
Basic R Data Manipulation

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Understanding Prototyping in Design and Development
PPTX
Logistic Regression ml machine learning.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Global journeys: estimating international migration
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Computer network topology notes for revision
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
1_Introduction to advance data techniques.pptx
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
climate analysis of Dhaka ,Banglades.pptx
Understanding Prototyping in Design and Development
Logistic Regression ml machine learning.pptx
Clinical guidelines as a resource for EBP(1).pdf
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
Introduction-to-Cloud-ComputingFinal.pptx
Global journeys: estimating international migration
Supervised vs unsupervised machine learning algorithms
Computer network topology notes for revision
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx

Next Generation Programming in R