SlideShare a Scribd company logo
Data Wrangling
@JennyBryan
@jennybc


Data Wrangling
@JennyBryan
@jennybc


Rect
Big Data Borat:
80% time spent prepare data
20% time spent complain
about need for prepare data.
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
atomic
vector
list
data cleaning
data wrangling
descriptive stats
inferential stats
reporting
data cleaning
data wrangling
descriptive stats
inferential stats
reporting
data cleaning
data wrangling
descriptive stats
inferential stats
reporting
programming
difficulty
better exp. design simpler stats
better data model simpler analysis
https://cran.r-project.org/package=purrr
https://github.com/hadley/purrr
+ dplyr
+ tidyr
+ tibble
+ broom
Hadley Wickham
Lionel Henry
Lessons from my fall 2016 teaching:
https://jennybc.github.io/purrr-tutorial/
repurrrsive package (non-boring examples):
https://github.com/jennybc/repurrrsive
I am the Annie Leibovitz of lego mini-figures:
https://github.com/jennybc/lego-rstats
x[[i]]
x[i]x
from
http://r4ds.had.co.nz/vectors.html#lists-of-condiments
http://legogradstudent.tumblr.com
#rstats lists via lego
atomic vectors
logical factor
integer, double
vectors of same length? DATA FRAME!
vectors don’t have to be atomic
works for lists too! LOVE THE LIST COLUMN!
this is a data frame!
atomic
vector
list
column
An API Of Ice And Fire | https://anapioficeandfire.com
{
"url": "http://www.anapioficeandfire.com/api/characters/1303",
"id": 1303,
"name": "Daenerys Targaryen",
"gender": "Female",
"culture": "Valyrian",
"born": "In 284 AC, at Dragonstone",
"died": "",
"alive": true,
"titles": [
"Queen of the Andals and the Rhoynar and the First Men,
Lord of the Seven Kingdoms",
"Khaleesi of the Great Grass Sea",
"Breaker of Shackles/Chains",
"Queen of Meereen",
"Princess of Dragonstone"
],
"aliases": [
"Dany",
"Daenerys Stormborn",
titles
#> # A tibble: 29 × 2

#> name titles

#> <chr> <list>

#> 1 Theon Greyjoy <chr [3]>

#> 2 Tyrion Lannister <chr [2]>

#> 3 Victarion Greyjoy <chr [2]>

#> 4 Will <list [0]>

#> 5 Areo Hotah <chr [1]>

#> 6 Chett <list [0]>

#> 7 Cressen <chr [1]>

#> 8 Arianne Martell <chr [1]>

#> 9 Daenerys Targaryen <chr [5]>

#> 10 Davos Seaworth <chr [4]>

#> # ... with 19 more rows
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
Why would you do this to yourself?
The list is forced on you by the problem.
•String processing, e.g., regex
•JSON or XML
•Split-Apply-Combine
But why lists in a data frame?
All the usual reasons!
• Keep multiple vectors intact and “in sync”
• Use existing toolkit for filter, select, ….
What happens in the
data frame
Stays in the data frame
you have a list-column
congratulations!
🎉
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
1 inspect
2 query
3 modify
4 simplify
inspect
my_list[1:3]
my_list[[2]]
View()
str(my_list, max.level = 1)
str(my_list[[i]], list.len = 10)
listviewer::jsonedit()
1 inspect
2 query
3 modify
4 simplify
map(.x, .f, ...)
purrr::
map(.x, .f, ...)
for every element of .x
apply .f
return results like so
.x = minis
map(minis, antennate)
.x = minis
map(minis, "pants")
.y = hair
.x = minis
map2(minis, hair, enhair)
.y = weapons
.x = minis
map2(minis, weapons, arm)
minis %>%
map2(hair, enhair) %>%
map2(weapons, arm)
df <- tibble(pants, torso, head)
embody <- function(pants, torso, head)
insert(insert(pants, torso), head)
pmap(df, embody)
map_df(minis, `[`,
c("pants", "torso", "head")
map(got_chars, "name")
#> [[1]]

#> [1] "Theon Greyjoy"

#> 

#> [[2]]

#> [1] "Tyrion Lannister"

#> 

#> [[3]]

#> [1] "Victarion Greyjoy"
query
map_chr(got_chars, "name")
#> [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy" 

#> [4] "Will" "Areo Hotah" "Chett" 

#> [7] "Cressen" "Arianne Martell" "Daenerys Targaryen"

#> [10] "Davos Seaworth" "Arya Stark" "Arys Oakheart" 

#> [13] "Asha Greyjoy" "Barristan Selmy" "Varamyr" 

#> [16] "Brandon Stark" "Brienne of Tarth" "Catelyn Stark" 

#> [19] "Cersei Lannister" "Eddard Stark" "Jaime Lannister" 

#> [22] "Jon Connington" "Jon Snow" "Aeron Greyjoy" 

#> [25] "Kevan Lannister" "Melisandre" "Merrett Frey" 

#> [28] "Quentyn Martell" "Sansa Stark"
simplify
> map_df(got_chars, `[`,
c("name", "culture", "gender", "born"))
#> # A tibble: 29 × 4
#> name culture gender born
#> <chr> <chr> <chr> <chr>
#> 1 Theon Greyjoy Ironborn Male In 278 AC or 279 AC, at Pyke
#> 2 Tyrion Lannister Male In 273 AC, at Casterly Rock
#> 3 Victarion Greyjoy Ironborn Male In 268 AC or before, at Pyke
#> 4 Will Male
#> 5 Areo Hotah Norvoshi Male In 257 AC or before, at Norvos
#> 6 Chett Male At Hag's Mire
#> 7 Cressen Male In 219 AC or 220 AC
#> 8 Arianne Martell Dornish Female In 276 AC, at Sunspear
#> 9 Daenerys Targaryen Valyrian Female In 284 AC, at Dragonstone
#> 10 Davos Seaworth Westeros Male In 260 AC or before, at King's Landing
#> # ... with 19 more rows
simplify
got_chars %>% {
tibble(name = map_chr(., "name"),
houses = map(., "allegiances"))
} %>%
filter(lengths(houses) > 1) %>%
unnest()
#> # A tibble: 15 × 2
#> name houses
#> <chr> <chr>
#> 1 Davos Seaworth House Baratheon of Dragonstone
#> 2 Davos Seaworth House Seaworth of Cape Wrath
#> 3 Asha Greyjoy House Greyjoy of Pyke
#> 4 Asha Greyjoy House Ironmaker
simplify
@JennyBryan
@jennybc

 http://stat545.com
@STAT545

data frame nested data frame
gap_nested <- gapminder %>%
group_by(country, continent) %>%
nest()
gap_nested
#> # A tibble: 142 × 3
#> country continent data
#> <fctr> <fctr> <list>
#> 1 Afghanistan Asia <tibble [12 × 4]>
#> 2 Albania Europe <tibble [12 × 4]>
#> 3 Algeria Africa <tibble [12 × 4]>
#> 4 Angola Africa <tibble [12 × 4]>
#> 5 Argentina Americas <tibble [12 × 4]>
#> 6 Australia Oceania <tibble [12 × 4]>
#> 7 Austria Europe <tibble [12 × 4]>
#> 8 Bahrain Asia <tibble [12 × 4]>
#> 9 Bangladesh Asia <tibble [12 × 4]>
#> 10 Belgium Europe <tibble [12 × 4]>
#> # ... with 132 more rows
modify
gap_nested %>%
mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x))) %>%
filter(continent == "Oceania") %>%
mutate(coefs = map(fit, coef))
#> # A tibble: 2 × 5
#> country continent data fit coefs
#> <fctr> <fctr> <list> <list> <list>
#> 1 Australia Oceania <tibble [12 × 4]> <S3: lm> <dbl [2]>
#> 2 New Zealand Oceania <tibble [12 × 4]> <S3: lm> <dbl [2]>
simplify
gap_nested %>%
…
mutate(intercept = map_dbl(coefs, 1),
slope = map_dbl(coefs, 2)) %>%
select(country, continent,
intercept, slope)
#> # A tibble: 2 × 4
#> country continent intercept slope
#> <fctr> <fctr> <dbl> <dbl>
#> 1 Australia Oceania -376.1163 0.2277238
#> 2 New Zealand Oceania -307.6996 0.1928210
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
maybe you don’t, because you don’t know how 😔
for loops
apply(), [slvmt]apply(), split(), by()
with plyr: [adl][adl_]ply()
with dplyr: df %>% group_by() %>% do()
How are you doing such things today?
map(.x, .f, ...)
.x is a vector
lists are vectors
data frames are lists
map(.x, .f, ...)
.f is function to apply
name & position shortcuts
concise ~ formula syntax
“return results like so”
map_lgl(.x, .f, ...)
map_chr(.x, .f, ...)
map_int(.x, .f, ...)
map_dbl(.x, .f, …)
map(.x, .f, …)
can be thought of as
map_list(.x, .f, …)
map_df(.x, .f, …)
walk(.x, .f, …)
can be thought of as
map_nothing(.x, .f, …)
map2(.x, .y, .f, …)
f(.x[[i]], .y[[i]], …)
pmap(.l, .f, …)
f(tuple of i-th elements of the vectors in .l, …)
friends don’t let friends
use do.call()
1 do something easy with the iterative machine
2 do the real, hard thing with one representative unit
3 insert logic from 2 into template from 1
workflow

More Related Content

PDF
Palestra sobre Collections com Python
PDF
Clustering com numpy e cython
PDF
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
PDF
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
PDF
令和から本気出す
PDF
Elixir & Phoenix – fast, concurrent and explicit
PDF
Elixir & Phoenix – fast, concurrent and explicit
PDF
Codigos
Palestra sobre Collections com Python
Clustering com numpy e cython
{tidytext}と{RMeCab}によるモダンな日本語テキスト分析
{tidygraph}と{ggraph}による モダンなネットワーク分析(未公開ver)
令和から本気出す
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicit
Codigos

What's hot (20)

PDF
{tidygraph}と{ggraph}によるモダンなネットワーク分析
PDF
جلسه سوم پایتون برای هکر های قانونی دوره مقدماتی پاییز ۹۲
PDF
好みの日本酒を呑みたい! 〜さけのわデータで探す自分好みの酒〜
PDF
How fast ist it really? Benchmarking in practice
DOCX
ggplot2 extensions-ggtree.
PDF
Introduction to Search Systems - ScaleConf Colombia 2017
PDF
[1062BPY12001] Data analysis with R / week 4
PDF
Let’s Talk About Ruby
PDF
A Search Index is Not a Database Index - Full Stack Toronto 2017
RTF
Seistech SQL code
PDF
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
PPTX
Lecture5 my sql statements by okello erick
PDF
Regression and Classification with R
PDF
Debugging: A Senior's Skill
PDF
Bash Learning By Examples
PPTX
第二讲 Python基礎
PPTX
第二讲 预备-Python基礎
PDF
PPTX
Python chapter 2
PPTX
python chapter 1
{tidygraph}と{ggraph}によるモダンなネットワーク分析
جلسه سوم پایتون برای هکر های قانونی دوره مقدماتی پاییز ۹۲
好みの日本酒を呑みたい! 〜さけのわデータで探す自分好みの酒〜
How fast ist it really? Benchmarking in practice
ggplot2 extensions-ggtree.
Introduction to Search Systems - ScaleConf Colombia 2017
[1062BPY12001] Data analysis with R / week 4
Let’s Talk About Ruby
A Search Index is Not a Database Index - Full Stack Toronto 2017
Seistech SQL code
Some R Examples[R table and Graphics] -Advanced Data Visualization in R (Some...
Lecture5 my sql statements by okello erick
Regression and Classification with R
Debugging: A Senior's Skill
Bash Learning By Examples
第二讲 Python基礎
第二讲 预备-Python基礎
Python chapter 2
python chapter 1
Ad

Viewers also liked (17)

PPTX
PLOTCON NYC: The Future of Business Intelligence: Data Visualization
PDF
PLOTCON NYC: New Open Viz in R
PPTX
PLOTCON NYC: Building Products Out of Data
PDF
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PDF
PLOTCON NYC: Domain Specific Visualization
PPTX
PLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PDF
PLOTCON NYC: Custom Colormaps for Your Field
PDF
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PPTX
PLOTCON NYC: New Data Viz in Data Journalism
PDF
PLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PDF
PLOTCON NYC: Building a Flexible Analytics Stack
PPTX
PLOTCON NYC: Mapping Networked Attention: What We Learn from Social Data
PDF
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PDF
PLOTCON NYC: PlotlyJS.jl: Interactive plotting in Julia
PPTX
PLOTCON NYC: Text is data! Analysis and Visualization Methods
PPTX
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
PPTX
What’s New in the Berkeley Data Analytics Stack
PLOTCON NYC: The Future of Business Intelligence: Data Visualization
PLOTCON NYC: New Open Viz in R
PLOTCON NYC: Building Products Out of Data
PLOTCON NYC: Interactive Visual Statistics on Massive Datasets
PLOTCON NYC: Domain Specific Visualization
PLOTCON NYC: Enterprise Dataviz' Unicorn Problem
PLOTCON NYC: Custom Colormaps for Your Field
PLOTCON NYC: Get Your Point Across: The Art of Choosing the Right Visualizati...
PLOTCON NYC: New Data Viz in Data Journalism
PLOTCON NYC: Data Science in the Enterprise From Concept to Execution
PLOTCON NYC: Building a Flexible Analytics Stack
PLOTCON NYC: Mapping Networked Attention: What We Learn from Social Data
PLOTCON NYC: The Architecture of Jupyter: Protocols for Interactive Data Expl...
PLOTCON NYC: PlotlyJS.jl: Interactive plotting in Julia
PLOTCON NYC: Text is data! Analysis and Visualization Methods
SportsDataViz using Plotly, Shiny and Flexdashboard - PlotCon 2016
What’s New in the Berkeley Data Analytics Stack
Ad

Similar to PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling (20)

PDF
An overview of Python 2.7
PDF
A tour of Python
PPTX
Ggplot2 v3
PPTX
R programming language
PDF
Артём Акуляков - F# for Data Analysis
PPTX
The Tidyverse and the Future of the Monitoring Toolchain
PPTX
Introduction to python programming 1
PPTX
Introduction to R
PPTX
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
PDF
[1062BPY12001] Data analysis with R / week 2
PPTX
A quick introduction to R
PDF
Basic R Data Manipulation
PDF
R code for data manipulation
PDF
R code for data manipulation
PDF
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
PDF
Pre-Bootcamp introduction to Elixir
DOCX
Advanced Data Visualization in R- Somes Examples.
PDF
Next Generation Programming in R
PDF
PDF
Everything is composable
An overview of Python 2.7
A tour of Python
Ggplot2 v3
R programming language
Артём Акуляков - F# for Data Analysis
The Tidyverse and the Future of the Monitoring Toolchain
Introduction to python programming 1
Introduction to R
Murtaugh 2022 Appl Comp Genomics Tidyverse lecture.pptx-1.pptx
[1062BPY12001] Data analysis with R / week 2
A quick introduction to R
Basic R Data Manipulation
R code for data manipulation
R code for data manipulation
Analysis of Fatal Utah Avalanches with Python. From Scraping, Analysis, to In...
Pre-Bootcamp introduction to Elixir
Advanced Data Visualization in R- Somes Examples.
Next Generation Programming in R
Everything is composable

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Leprosy and NLEP programme community medicine
PDF
Business Analytics and business intelligence.pdf
PPTX
modul_python (1).pptx for professional and student
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Modelling in Business Intelligence , information system
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Mega Projects Data Mega Projects Data
Introduction to the R Programming Language
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
ISS -ESG Data flows What is ESG and HowHow
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Leprosy and NLEP programme community medicine
Business Analytics and business intelligence.pdf
modul_python (1).pptx for professional and student
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Modelling in Business Intelligence , information system
Introduction-to-Cloud-ComputingFinal.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
importance of Data-Visualization-in-Data-Science. for mba studnts
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Mega Projects Data Mega Projects Data

PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling