Manipulating string data with
a pattern in R
Speaker: CHANG, Lun-Hsien
Affiliation: Genetic Epidemiology, QIMR Berghofer Medical Research Institute
Meeting: R user group meeting #9
Time: 1:10-2:30 PM, 20190828
Place: Level 7, Bancroft building, QIMR, Brisbane, Australia
1
Outline
Download R script from my Google drive:
20190828_R-user-group_string-manipulation.R
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Subsetting files through their names or paths
● Subsetting groups
Summary 2
Manipulating string data is like
hand sewing
3
My string
dataR functions
Patterns
4
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Subsetting files through their names or paths
● Subsetting groups
Summary
5
What are special characters?
Special characters are characters with meanings. They get interpreted if not
being escaped.
 ^ $ . | ? * + ( ) [ ] { }
6
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Subsetting files through their names or paths
● Subsetting groups
Summary
7
When specifying a pattern in R:
(1) Escape special characters with double
backslashes 
(2) Use OR operators (pipe, |) to chain multiple
patterns
patterns <- "(|factor(|)"
If you want to match the string 1+1=2, the correct syntax is 1+1=2
8
Specifying patterns in R
● ^prefix Looks for string that starts with this prefix
● suffix$ Looks for string that ends with this suffix
● .* Looks for any character at any length (* in Linux)
●  Prevent special characters from being interpreted
● | Match multiple patterns (e.g. pattern 1 or pattern 2 or ….)
begin between end
9
Specifying patterns in R
● ^prefix My target string begins with prefix
● suffix$ My target string ends with suffix
● .* Means any character at any length (* in Linux)
●  Prevent special characters from being interpreted
● | Match pattern 1 or pattern 2 or ….
Is there an AND operator? It is not & nor &&
https://stackoverflow.com/questions/13187414/r-grep-is-there-an-and-operator
begin between end
10
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Subsetting files through their names or paths
● Subsetting groups
Summary
11
What my coefficients look like
linear.model.summary[["coefficients"]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.458333 1.842243 25.2183502 1.921811e-63
factor(race)2 11.541667 3.286129 3.5122376 5.515272e-04
factor(race)3 1.741667 2.732488 0.6373922 5.246133e-01
factor(race)4 7.596839 1.988870 3.8196768 1.792682e-04
12
What I would like my desired output look like
coefficients.dataFrame
Predictor Estimate SE t.value p.value
1 Intercept 46.458333 1.842243 25.2183502 1.921811e-63
2 race2 11.541667 3.286129 3.5122376 5.515272e-04
3 race3 1.741667 2.732488 0.6373922 5.246133e-01
4 race4 7.596839 1.988870 3.8196768 1.792682e-04
Old
13
Replace patterns in the Predictor column with
nothing using `gsub()`
# Remove unwanted string (, factor, ) in a column with
gsub()
patterns <- "(|factor(|)"
temp1 <- coefficients.dataFrame
temp1$Predictor <- gsub( x=temp1$Predictor
,pattern=patterns
,replacement="")
14
Find full code under the heading Scenario 1
Replace patterns in the Predictor column with
nothing using `str_replace_all()`
# Remove unwanted string (, factor, ) in a column with
stringr::str_replace_all
patterns <- "(|factor(|)"
temp2 <- coefficients.dataFrame
temp2$Predictor <- stringr::str_replace_all(string = temp2$Predictor
,pattern=patterns
,replacement="")
15
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Getting files through their names or paths
● Subsetting groups
Summary
16
What my files in a folder look like
17
TSV files that I am interested to import ( .tsv:
tab-separated values)
18
Getting full paths of TSV files with list.files() or
Sys.glob()
# Subset TSV files (positive filtering) with list.files()
patterns <- "harmonised-data.*.tsv$"
tsv.files <- list.files(path=source.files.path
,pattern = patterns
,full.names = TRUE) # length(tsv.files) 220
# Subset TSV files with Sys.glob()
patterns <- "harmonised-data*.tsv"
tsv.files <- Sys.glob(file.path(paste0(source.files.path,"/",patterns))) #
length(tsv.files) 220
Find full code under the heading Scenario 2
19
Patterns in list.files() versus Sys.glob()
patterns <- "harmonised-data.*.tsv$"
list.files(pattern=) reads an optional regular expression
(understandable to R)
patterns <- "harmonised-data*.tsv"
Sys.glob(patterns) expands wildcard (*) on file paths like Unix
20
Getting full paths of non TSV files with grep()
# Subset non tsv files (negative filtering)
patterns <- "harmonised-data.*.tsv$"
non.tsv.files <- grep(x=all.files
,pattern = patterns
,value = TRUE
,invert = TRUE) # length(non.tsv.files)
163
21
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Getting files through their names or paths
● Subsetting groups
Summary
22
Suppose your data are stratified by states, age
groups and sexes, how do you subset groups?
States: NSW, ACT, VIC, QLD, SA, WA, TAS, NT
Age groups: 4-20, 21-40, 41-60, 61+
Sex: males, females, both sexes together
Total number of groups: 96 (8*4*3)
23
Creating all groups with data.table::CJ()
# Create subgroups
group.1 <- c("NSW","ACT","VIC","QLD","SA","WA","TAS","NT") #
length(group.1) 8
group.2 <- paste0("age",c("4-20","21-40","41-60","61+"))
group.3 <- c("males","females","bothSexes")
# Create all combinations from the 3 vectors
## data.table::CJ creates a Join data table
all.groups.subgroups <- data.table::CJ(group.1, group.2, group.3,
sorted = FALSE)[, paste(group.1, group.2, group.3, sep ="_")] #
length(all.groups.subgroups) 96
24
Find full code under the heading Scenario 3
Subsetting males with grep()
# Subset males
males <- grep(x=all.groups.subgroups,pattern = "_males$", value =
TRUE) # length(males) 32
25
Subsetting females aged over 61 from eastern states
# Specify patterns
pattern.1 <- "^NSW|^QLD|^VIC|^ACT|^TAS"
pattern.2 <- "_females$"
pattern.3 <- "61+"
# Subset data from females 61+ in Eastern states
eastern.states.females.61plus <- grep(x=all.groups.subgroups, pattern =
pattern.1, value = TRUE) %>%
grep(., pattern = pattern.2, value=T) %>%
grep(. , pattern=pattern.3, value=T) #
length(eastern.states.females.61plus) 5
26
Outline
What is it like to manipulate string?
What are special characters?
How to specify a pattern?
Scenarios that you will handle string
● Manipulating output from a R object
● Getting files through their names or paths
● Subsetting groups
Summary
27
My string data
R objects
File paths
vectors
R functions
gsub(pattern = )
str_replace_all(pattern = )
list.files(pattern=)
grep(pattern = )
Sys.glob()
Patterns
^
$
.*

|
28
Summary
Removing unwanted string with gsub(), stringr::str_replace_all()
Selecting files with list.files(), Sys.glob() and grep(invert=TRUE)
Subsetting groups with grep()
gsub(pattern = )
str_replace_all(pattern = )
list.files(pattern=)
grep(pattern = )
Sys.glob()
29

More Related Content

PDF
Tackling repetitive tasks with serial or parallel programming in R
PPTX
Advance python
PPTX
Introduction to Haskell: 2011-04-13
PPTX
SQL Server Select Topics
PDF
Communicating State Machines
PDF
Introduction to Functional Programming
PPTX
Chapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYA
PPTX
Python 3.6 Features 20161207
Tackling repetitive tasks with serial or parallel programming in R
Advance python
Introduction to Haskell: 2011-04-13
SQL Server Select Topics
Communicating State Machines
Introduction to Functional Programming
Chapter 5 - THREADING & REGULAR exp - MAULIK BORSANIYA
Python 3.6 Features 20161207

What's hot (20)

PDF
C interview-questions-techpreparation
PPTX
Python Interview Questions | Python Interview Questions And Answers | Python ...
PPTX
Introduction to the basics of Python programming (part 3)
PPTX
Session 02 python basics
PPTX
Introduction to Python and TensorFlow
PPTX
Introduction to the basics of Python programming (part 1)
PDF
Why we cannot ignore Functional Programming
PPT
Python
PPTX
Session 05 cleaning and exploring
PDF
Matlab and Python: Basic Operations
PDF
Haskell for data science
PPTX
Python advance
PPT
9780538745840 ppt ch03
PPT
Introduction to Python - Part Three
PPTX
Dynamic memory allocation in c++
PDF
Python Workshop. LUG Maniapl
PDF
High-Performance Haskell
PPTX
Python programing
PDF
Python Basics
C interview-questions-techpreparation
Python Interview Questions | Python Interview Questions And Answers | Python ...
Introduction to the basics of Python programming (part 3)
Session 02 python basics
Introduction to Python and TensorFlow
Introduction to the basics of Python programming (part 1)
Why we cannot ignore Functional Programming
Python
Session 05 cleaning and exploring
Matlab and Python: Basic Operations
Haskell for data science
Python advance
9780538745840 ppt ch03
Introduction to Python - Part Three
Dynamic memory allocation in c++
Python Workshop. LUG Maniapl
High-Performance Haskell
Python programing
Python Basics
Ad

Similar to Manipulating string data with a pattern in R (20)

PDF
Lk module3
PPTX
Best C++ Programming Homework Help
PPTX
Datamining with R
PDF
R basics
PDF
User biglm
ODP
Introduction to R
PDF
CS4200 2019 | Lecture 4 | Syntactic Services
PPT
String and string manipulation
PDF
Matlab strings
DOCX
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
PPT
Language Technology Enhanced Learning
PPTX
2. R-basics, Vectors, Arrays, Matrices, Factors
PDF
Module 3 - Regular Expressions, Dictionaries.pdf
DOCX
Summerization notes for descriptive statistics using r
PDF
R Traning-Session-I 21-23 May 2025 Updated Alpha.pdf
PDF
Data Structures Mastery: Sample Paper for Practice"
PPT
Compiler design lessons notes from Semester
PDF
Strings part2
DOCX
Data Manipulation with Numpy and Pandas in PythonStarting with N
Lk module3
Best C++ Programming Homework Help
Datamining with R
R basics
User biglm
Introduction to R
CS4200 2019 | Lecture 4 | Syntactic Services
String and string manipulation
Matlab strings
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
Language Technology Enhanced Learning
2. R-basics, Vectors, Arrays, Matrices, Factors
Module 3 - Regular Expressions, Dictionaries.pdf
Summerization notes for descriptive statistics using r
R Traning-Session-I 21-23 May 2025 Updated Alpha.pdf
Data Structures Mastery: Sample Paper for Practice"
Compiler design lessons notes from Semester
Strings part2
Data Manipulation with Numpy and Pandas in PythonStarting with N
Ad

Recently uploaded (20)

PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPTX
chrmotography.pptx food anaylysis techni
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPTX
Fundementals of R Programming_Class_2.pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
recommendation Project PPT with details attached
PPTX
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
A biomechanical Functional analysis of the masitary muscles in man
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPT
Image processing and pattern recognition 2.ppt
PPTX
machinelearningoverview-250809184828-927201d2.pptx
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PDF
Navigating the Thai Supplements Landscape.pdf
Hushh Hackathon for IIT Bombay: Create your very own Agents
chrmotography.pptx food anaylysis techni
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
Fundementals of R Programming_Class_2.pptx
MBA JAPAN: 2025 the University of Waseda
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
recommendation Project PPT with details attached
Statisticsccdxghbbnhhbvvvvvvvvvv. Dxcvvvhhbdzvbsdvvbbvv ccc
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
The Data Security Envisioning Workshop provides a summary of an organization...
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
A biomechanical Functional analysis of the masitary muscles in man
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Image processing and pattern recognition 2.ppt
machinelearningoverview-250809184828-927201d2.pptx
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Session 11 - Data Visualization Storytelling (2).pdf
Navigating the Thai Supplements Landscape.pdf

Manipulating string data with a pattern in R

  • 1. Manipulating string data with a pattern in R Speaker: CHANG, Lun-Hsien Affiliation: Genetic Epidemiology, QIMR Berghofer Medical Research Institute Meeting: R user group meeting #9 Time: 1:10-2:30 PM, 20190828 Place: Level 7, Bancroft building, QIMR, Brisbane, Australia 1
  • 2. Outline Download R script from my Google drive: 20190828_R-user-group_string-manipulation.R What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Subsetting files through their names or paths ● Subsetting groups Summary 2
  • 3. Manipulating string data is like hand sewing 3
  • 5. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Subsetting files through their names or paths ● Subsetting groups Summary 5
  • 6. What are special characters? Special characters are characters with meanings. They get interpreted if not being escaped. ^ $ . | ? * + ( ) [ ] { } 6
  • 7. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Subsetting files through their names or paths ● Subsetting groups Summary 7
  • 8. When specifying a pattern in R: (1) Escape special characters with double backslashes (2) Use OR operators (pipe, |) to chain multiple patterns patterns <- "(|factor(|)" If you want to match the string 1+1=2, the correct syntax is 1+1=2 8
  • 9. Specifying patterns in R ● ^prefix Looks for string that starts with this prefix ● suffix$ Looks for string that ends with this suffix ● .* Looks for any character at any length (* in Linux) ● Prevent special characters from being interpreted ● | Match multiple patterns (e.g. pattern 1 or pattern 2 or ….) begin between end 9
  • 10. Specifying patterns in R ● ^prefix My target string begins with prefix ● suffix$ My target string ends with suffix ● .* Means any character at any length (* in Linux) ● Prevent special characters from being interpreted ● | Match pattern 1 or pattern 2 or …. Is there an AND operator? It is not & nor && https://stackoverflow.com/questions/13187414/r-grep-is-there-an-and-operator begin between end 10
  • 11. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Subsetting files through their names or paths ● Subsetting groups Summary 11
  • 12. What my coefficients look like linear.model.summary[["coefficients"]] Estimate Std. Error t value Pr(>|t|) (Intercept) 46.458333 1.842243 25.2183502 1.921811e-63 factor(race)2 11.541667 3.286129 3.5122376 5.515272e-04 factor(race)3 1.741667 2.732488 0.6373922 5.246133e-01 factor(race)4 7.596839 1.988870 3.8196768 1.792682e-04 12
  • 13. What I would like my desired output look like coefficients.dataFrame Predictor Estimate SE t.value p.value 1 Intercept 46.458333 1.842243 25.2183502 1.921811e-63 2 race2 11.541667 3.286129 3.5122376 5.515272e-04 3 race3 1.741667 2.732488 0.6373922 5.246133e-01 4 race4 7.596839 1.988870 3.8196768 1.792682e-04 Old 13
  • 14. Replace patterns in the Predictor column with nothing using `gsub()` # Remove unwanted string (, factor, ) in a column with gsub() patterns <- "(|factor(|)" temp1 <- coefficients.dataFrame temp1$Predictor <- gsub( x=temp1$Predictor ,pattern=patterns ,replacement="") 14 Find full code under the heading Scenario 1
  • 15. Replace patterns in the Predictor column with nothing using `str_replace_all()` # Remove unwanted string (, factor, ) in a column with stringr::str_replace_all patterns <- "(|factor(|)" temp2 <- coefficients.dataFrame temp2$Predictor <- stringr::str_replace_all(string = temp2$Predictor ,pattern=patterns ,replacement="") 15
  • 16. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Getting files through their names or paths ● Subsetting groups Summary 16
  • 17. What my files in a folder look like 17
  • 18. TSV files that I am interested to import ( .tsv: tab-separated values) 18
  • 19. Getting full paths of TSV files with list.files() or Sys.glob() # Subset TSV files (positive filtering) with list.files() patterns <- "harmonised-data.*.tsv$" tsv.files <- list.files(path=source.files.path ,pattern = patterns ,full.names = TRUE) # length(tsv.files) 220 # Subset TSV files with Sys.glob() patterns <- "harmonised-data*.tsv" tsv.files <- Sys.glob(file.path(paste0(source.files.path,"/",patterns))) # length(tsv.files) 220 Find full code under the heading Scenario 2 19
  • 20. Patterns in list.files() versus Sys.glob() patterns <- "harmonised-data.*.tsv$" list.files(pattern=) reads an optional regular expression (understandable to R) patterns <- "harmonised-data*.tsv" Sys.glob(patterns) expands wildcard (*) on file paths like Unix 20
  • 21. Getting full paths of non TSV files with grep() # Subset non tsv files (negative filtering) patterns <- "harmonised-data.*.tsv$" non.tsv.files <- grep(x=all.files ,pattern = patterns ,value = TRUE ,invert = TRUE) # length(non.tsv.files) 163 21
  • 22. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Getting files through their names or paths ● Subsetting groups Summary 22
  • 23. Suppose your data are stratified by states, age groups and sexes, how do you subset groups? States: NSW, ACT, VIC, QLD, SA, WA, TAS, NT Age groups: 4-20, 21-40, 41-60, 61+ Sex: males, females, both sexes together Total number of groups: 96 (8*4*3) 23
  • 24. Creating all groups with data.table::CJ() # Create subgroups group.1 <- c("NSW","ACT","VIC","QLD","SA","WA","TAS","NT") # length(group.1) 8 group.2 <- paste0("age",c("4-20","21-40","41-60","61+")) group.3 <- c("males","females","bothSexes") # Create all combinations from the 3 vectors ## data.table::CJ creates a Join data table all.groups.subgroups <- data.table::CJ(group.1, group.2, group.3, sorted = FALSE)[, paste(group.1, group.2, group.3, sep ="_")] # length(all.groups.subgroups) 96 24 Find full code under the heading Scenario 3
  • 25. Subsetting males with grep() # Subset males males <- grep(x=all.groups.subgroups,pattern = "_males$", value = TRUE) # length(males) 32 25
  • 26. Subsetting females aged over 61 from eastern states # Specify patterns pattern.1 <- "^NSW|^QLD|^VIC|^ACT|^TAS" pattern.2 <- "_females$" pattern.3 <- "61+" # Subset data from females 61+ in Eastern states eastern.states.females.61plus <- grep(x=all.groups.subgroups, pattern = pattern.1, value = TRUE) %>% grep(., pattern = pattern.2, value=T) %>% grep(. , pattern=pattern.3, value=T) # length(eastern.states.females.61plus) 5 26
  • 27. Outline What is it like to manipulate string? What are special characters? How to specify a pattern? Scenarios that you will handle string ● Manipulating output from a R object ● Getting files through their names or paths ● Subsetting groups Summary 27
  • 28. My string data R objects File paths vectors R functions gsub(pattern = ) str_replace_all(pattern = ) list.files(pattern=) grep(pattern = ) Sys.glob() Patterns ^ $ .* | 28
  • 29. Summary Removing unwanted string with gsub(), stringr::str_replace_all() Selecting files with list.files(), Sys.glob() and grep(invert=TRUE) Subsetting groups with grep() gsub(pattern = ) str_replace_all(pattern = ) list.files(pattern=) grep(pattern = ) Sys.glob() 29