Merge Multiple files into single
dataframe using R
Yogesh Khandelwal
Problem Description
• The zip file contains 332 comma-separated-value (CSV) files
containing pollution monitoring data for fine particulate
matter (PM) air pollution at 332 locations in the United States.
Each file contains data from a single monitor and the ID
number for each monitor is contained in the file name. For
example, data for monitor 200 is contained in the file
"200.csv".
• Data Source: http://spark-
public.s3.amazonaws.com/compdata/data/specdata.zip
Merge Multiple CSV in single data frame using R
Variable Name
Variables in file
• Date: the date of observation in YYYY-MM-DD format
(year-month-day) ,Datatype:factor
• sulfate: the level of sulfate PM in the air on that date
(measured in micrograms per cubic
meter),Datatype:num
• nitrate: the level of nitrate PM in the air on that date
(measured in micrograms per cubic
meter),Datatype:num
• Id:location id,Datatype:int
Before we start we should know
• Functions in R
• How to merge data files
Functions in R
Functions in R
Functions are created using the function() directive and are
stored as R objects just like anything else. In particular, they are R
objects of class “function”.
f <- function(<arguments>) {
## Do something interesting
}
• Functions in R are “first class objects”, which means that they can
be treated much like any other R object. Importantly,
• Functions can be passed as arguments to other functions.
• Functions can be nested, so that you can define a function
inside of another function
• The return value of a function is the last expression in the function
• body to be evaluated.
Function contd..
• For ex:
Function name
Function defination
Function call
Our objective
• How we can merge no. of files into single data
frame?
• How to apply same function to different files
in efficient way?
How to merge two different files?
• No.of options available like
1. Use merge() function
2. Use rbind(),cbind() etc.
How to merge no.of files as a single
data frame
• Approach 1
files<-list.files("specdata",full.names = TRUE)
dat<-NULL
for(i in 1:332)
{
dat<-rbind(dat,read.csv(files[i]))
}
• Further we can run various command on merged file object as per our need some are like:
1. Str(dat)
2. Head(dat)
3. Tail(dat) etc.
Notes:full.names= a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE,
the file names (rather than paths) are returned.
How to handle missing value in R ?
contd.
• In R, NA is used to represent any value that is 'not available' or 'missing' (in
the | statistical sense)
• Missing values play an important role in statistics and data analysis. Often,
missing values must not be ignored, but rather they should be carefully
studied to see if there's an underlying pattern or cause for their
missingness.
• For ex:
• X<-c(1,2,NA,4)
• Y<-c(NA,2,3,1)
• >x+y
• [1] NA 4 NA 5
• Multiple options are available in R to handle NA values like
• Is.NA()
• Set na.rm=TRUE as a function argument
> mean(X) [1] NA
> mean(X,na.rm = TRUE) [1] 2.333333
Apply what we learn to our dataset
Function defination
Function call
pollutantmean('specdata','nitrate',1:10)
[1] 0.7976266
Thank You!!

More Related Content

PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
PPT
Custom Controls in ASP.net
PPTX
OneNote Overview
PPT
Basics of Microsoft Word
PPT
Database connectivity and web technologies
PDF
R Programming: Importing Data In R
PDF
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
PDF
Data analystics with R module 3 cseds vtu
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Custom Controls in ASP.net
OneNote Overview
Basics of Microsoft Word
Database connectivity and web technologies
R Programming: Importing Data In R
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
Data analystics with R module 3 cseds vtu

Similar to Merge Multiple CSV in single data frame using R (20)

PPT
Dublin Core In Practice
PPTX
FILE AND OBJECT<,ITS PROPERTIES IN PYTHON
PPT
Digital Object Identifiers for EOSDIS data
PPT
Mba admission in india
PPTX
04 pig data operations
PPT
Tthornton code4lib
PPTX
Data Life Cycle
PPT
Basics R.ppt
PDF
IRE- Algorithm Name Detection in Research Papers
PPT
File Handling Btech computer science and engineering ppt
PDF
Normalisation in Database management System (DBMS)
PDF
Introduction to HDF5 Data Model, Programming Model and Library APIs
PPTX
Python UNIT-III-part-1.pptx File Handling
PPT
Understanding EDP (Electronic Data Processing) Environment
PDF
Authoring Tool of AAT with DADT
PPT
PPT
Basics.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPTX
01 file handling for class use class pptx
PDF
Active directory interview_questions
Dublin Core In Practice
FILE AND OBJECT<,ITS PROPERTIES IN PYTHON
Digital Object Identifiers for EOSDIS data
Mba admission in india
04 pig data operations
Tthornton code4lib
Data Life Cycle
Basics R.ppt
IRE- Algorithm Name Detection in Research Papers
File Handling Btech computer science and engineering ppt
Normalisation in Database management System (DBMS)
Introduction to HDF5 Data Model, Programming Model and Library APIs
Python UNIT-III-part-1.pptx File Handling
Understanding EDP (Electronic Data Processing) Environment
Authoring Tool of AAT with DADT
Basics.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
01 file handling for class use class pptx
Active directory interview_questions

Recently uploaded (20)

PPTX
MBA JAPAN: 2025 the University of Waseda
DOCX
Factor Analysis Word Document Presentation
PPT
statistics analysis - topic 3 - describing data visually
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PPTX
Machine Learning and working of machine Learning
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
SET 1 Compulsory MNH machine learning intro
PPT
statistic analysis for study - data collection
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
An essential collection of rules designed to help businesses manage and reduc...
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
MBA JAPAN: 2025 the University of Waseda
Factor Analysis Word Document Presentation
statistics analysis - topic 3 - describing data visually
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
statsppt this is statistics ppt for giving knowledge about this topic
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Machine Learning and working of machine Learning
Navigating the Thai Supplements Landscape.pdf
SET 1 Compulsory MNH machine learning intro
statistic analysis for study - data collection
eGramSWARAJ-PPT Training Module for beginners
Tapan_20220802057_Researchinternship_final_stage.pptx
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
An essential collection of rules designed to help businesses manage and reduc...
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
retention in jsjsksksksnbsndjddjdnFPD.pptx

Merge Multiple CSV in single data frame using R

  • 1. Merge Multiple files into single dataframe using R Yogesh Khandelwal
  • 2. Problem Description • The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". • Data Source: http://spark- public.s3.amazonaws.com/compdata/data/specdata.zip
  • 5. Variables in file • Date: the date of observation in YYYY-MM-DD format (year-month-day) ,Datatype:factor • sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter),Datatype:num • nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter),Datatype:num • Id:location id,Datatype:int
  • 6. Before we start we should know • Functions in R • How to merge data files
  • 8. Functions in R Functions are created using the function() directive and are stored as R objects just like anything else. In particular, they are R objects of class “function”. f <- function(<arguments>) { ## Do something interesting } • Functions in R are “first class objects”, which means that they can be treated much like any other R object. Importantly, • Functions can be passed as arguments to other functions. • Functions can be nested, so that you can define a function inside of another function • The return value of a function is the last expression in the function • body to be evaluated.
  • 9. Function contd.. • For ex: Function name Function defination Function call
  • 10. Our objective • How we can merge no. of files into single data frame? • How to apply same function to different files in efficient way?
  • 11. How to merge two different files?
  • 12. • No.of options available like 1. Use merge() function 2. Use rbind(),cbind() etc.
  • 13. How to merge no.of files as a single data frame • Approach 1 files<-list.files("specdata",full.names = TRUE) dat<-NULL for(i in 1:332) { dat<-rbind(dat,read.csv(files[i])) } • Further we can run various command on merged file object as per our need some are like: 1. Str(dat) 2. Head(dat) 3. Tail(dat) etc. Notes:full.names= a logical value. If TRUE, the directory path is prepended to the file names to give a relative file path. If FALSE, the file names (rather than paths) are returned.
  • 14. How to handle missing value in R ?
  • 15. contd. • In R, NA is used to represent any value that is 'not available' or 'missing' (in the | statistical sense) • Missing values play an important role in statistics and data analysis. Often, missing values must not be ignored, but rather they should be carefully studied to see if there's an underlying pattern or cause for their missingness. • For ex: • X<-c(1,2,NA,4) • Y<-c(NA,2,3,1) • >x+y • [1] NA 4 NA 5 • Multiple options are available in R to handle NA values like • Is.NA() • Set na.rm=TRUE as a function argument > mean(X) [1] NA > mean(X,na.rm = TRUE) [1] 2.333333
  • 16. Apply what we learn to our dataset Function defination

Editor's Notes

  • #17: lapply() applies a given function for each element in a list,so there will be several function calls. do.call() applies a given function to the list as a whole,so there is only one function call.