How to merge data in R using R merge, dplyr or data.table
Last Updated :
25 Apr, 2025
Merging data is a common task in data analysis and data manipulation. It enables to combine information from different sources based on shared keys, creating richer datasets for exploration and modeling. Choosing the right merge method lets one balance speed, flexibility and ease of use.
Different Methods to Merge Data
We will explore three most common methods used in R programming language to merge data.
1. Using merge() function
The merge() function in R helps us to combine two or more data frames based on common columns. It performs various types of joins such as inner join, left join, right join and full join.
Syntax:
merged_df <- merge(x,y,by = "common_column",..)
- 'x' and 'y' are the data frames that you want to merge.
- 'by' specifies the common columns on which the merge will be performed.
- Additional arguments like 'all.x',all.y' and 'all' control the type of join that is to be performed.
Example:
Consider two data frames df1 and df2
R
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))
df2 <- data.frame(ID = c(2, 3, 4, 5),
Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))
Various types of joins using the 'merge()' function
There are four types of Joins that can be done using the merge() function.
1. Inner join (default behavior):
R
inner_join <- merge(df1, df2, by = "ID")
print(inner_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 5000
2 3 C 35 Teacher 4000
3 4 D 40 Doctor 6000
The resulting inner_join dataframe will only include the common rows where 'ID' is present in both df1 and df2.
2. Left join('all.x=TRUE'):
R
left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print (left_join)
Output:
ID Name Age Occupation Salary
1 1 A 25 <NA> NA
2 2 B 30 Engineer 5000
3 3 C 35 Teacher 4000
4 4 D 40 Doctor 6000
The resulting left_join data frame will include all rows from df1 and the matching rows from df2. Non matching rows from df2 will have an NA value
3. Right join ('all.y=TRUE'):
R
right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 5000
2 3 C 35 Teacher 4000
3 4 D 40 Doctor 6000
4 5 <NA> NA Lawyer 7000
The resulting right_join data frame will include all rows from df2 and the matching rows from df1. Non matching rows from df1 will have NA values.
4. Full outer join('all =TRUE')
R
full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)
Output:
ID Name Age Occupation Salary
1 1 A 25 <NA> NA
2 2 B 30 Engineer 5000
3 3 C 35 Teacher 4000
4 4 D 40 Doctor 6000
5 5 <NA> NA Lawyer 7000
The resulting full_join data frame will include all rows from both df1 and df2. Non matching values will have NA values.
2. Using 'dplyr' package:
The 'dplyr' package provides a set of functions for data manipulation, including merging data frames. The primary function for merging in dplyr is join() function, which supports various types of joins.
Syntax:
merged_df<- join(x,y,by="common_column",type="type_of_join")
- 'x' and 'y' are the data frames to be merged.
- 'by' specifies the common columns on which the merge is to be performed
- 'type_of_join' can be 'inner', 'left',' right' or 'full' to specify the type of join.
Example:
Install the dplyr() package and create two data frames, df1 and df2.
R
install.packages("dplyr")
library(dplyr)
df1 <- data.frame(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(20, 30, 40, 50))
df2 <- data.frame(ID = c(2, 3, 4, 5),
Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(2000, 4000, 6000, 7000))
Various types of joins using the join() function from dplyr
We can perform four types of join , using the join() function from dplyr.
1. Inner join:
R
inner_join <- inner_join(df1, df2, by = "ID")
print(inner_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 2000
2 3 C 40 Teacher 4000
3 4 D 50 Doctor 6000
The resulting inner_join data frame will only include the common rows where 'ID' is present in both df1 and df2.
2. Left join:
R
left_join <- left_join(df1, df2, by = "ID")
print(left_join)
Output:
ID Name Age Occupation Salary
1 1 A 20 <NA> NA
2 2 B 30 Engineer 2000
3 3 C 40 Teacher 4000
4 4 D 50 Doctor 6000
The resulting left_join data frame will include all rows from df1 and the matching rows from df2. Non matching rows from df2 will have NA values.
3. Right join:
R
right_join <- right_join(df1, df2, by = "ID")
print(right_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 2000
2 3 C 40 Teacher 4000
3 4 D 50 Doctor 6000
4 5 <NA> NA Lawyer 7000
The resulting right_join dataframe will include all rows from df2 and the matching rows from df1. Non matching rows of df1 will have NA values.
4. Full outer join:
R
full_join <- full_join(df1, df2, by = "ID")
print(full_join)
Output:
ID Name Age Occupation Salary
1 1 A 20 <NA> NA
2 2 B 30 Engineer 2000
3 3 C 40 Teacher 4000
4 4 D 50 Doctor 6000
5 5 <NA> NA Lawyer 7000
The resulting full_join data frame will include all rows from both df1 and df2. Non matching rows will have NA values.
3. Using data.table package:
The data.table package offers an efficient and fast approach to data manipulation. It provides the merge()' function. It is similar to the one in R but optimized for speed.
Syntax:
merged_dt <- merge(x, y, by = "common_column", ...)
- 'x' and 'y' are the data frames that are to be merged.
- 'by' specifies the common columns on which the merge will be performed.
- Additional arguments like 'all.x', 'all.y' and 'all' that controls the type of join.
Example:
Install the data.table library and create two data tables, df1 and df2.
R
install.packages("data.table")
library(data.table)
df1 <- data.table(ID = c(1, 2, 3, 4),
Name = c("A", "B", "C", "D"),
Age = c(25, 30, 35, 40))
df2 <- data.table(ID = c(2, 3, 4, 5),
Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
Salary = c(5000, 4000, 6000, 7000))
Various types of merges using the merge() function from data.table package
We can perform four types of join , using the merge() function from data.table.
1. Inner join( default behaviour):
R
inner_join <- merge(df1, df2, by = "ID")
print(inner_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 2000
2 3 C 40 Teacher 4000
3 4 D 50 Doctor 6000
The resulting inner_join data frame will only include the common rows where 'ID' is present in both df1 and df2.
2. Left join( 'all.x = TRUE'):
R
left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_join)
Output:
ID Name Age Occupation Salary
1 1 A 20 <NA> NA
2 2 B 30 Engineer 2000
3 3 C 40 Teacher 4000
4 4 D 50 Doctor 6000
The resulting left_join data frame will include all Non matching from df1 and the matching rows from df2. Non matching rows from df2 will have NA values.
3. Right join( 'all.y = TRUE'):
R
right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)
Output:
ID Name Age Occupation Salary
1 2 B 30 Engineer 2000
2 3 C 40 Teacher 4000
3 4 D 50 Doctor 6000
4 5 <NA> NA Lawyer 7000
The resulting right_join data frame will include all Non matching rows from df2 and the matching rows from df1. Non matching rows from df1 will have NA values.
3. Full outer join( 'all = TRUE'):
R
full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)
Output:
ID Name Age Occupation Salary
1 1 A 20 <NA> NA
2 2 B 30 Engineer 2000
3 3 C 40 Teacher 4000
4 4 D 50 Doctor 6000
5 5 <NA> NA Lawyer 7000
The resulting full_join data frame will include all Non matching rows from both df1 and df2. Non matching rows will have NA values.
In this article, we explored three approaches to merging data frames in R: base R’s merge()
, dplyr’s join functions and data.table’s fast join.
Similar Reads
How to Remove a Column using Dplyr package in R In this article, we are going to remove a column(s) in the R programming language using dplyr library. Dataset in use: Remove column using column nameHere we will use select() method to select and remove column by its name. Syntax: select(dataframe,-column_name) Here, dataframe is the input datafram
3 min read
How to Delete a Row by Reference in data.table in R? In R Language the data.table package is highly efficient for data manipulation, especially for large datasets. One of its powerful features is the ability to modify data by reference, which avoids copying the entire dataset and thus improves performance. This article will guide you through the proce
3 min read
How to Create Frequency Table by Group using Dplyr in R In this article, we will be looking at the approach to creating a frequency table group with its working examples in the R programming language. Create Frequency Table by Group using dplyr package: In this approach to create the frequency table by group, the user first needs to import and install th
2 min read
How to Remove a Column by name and index using Dplyr Package in R In this article, we are going to remove columns by name and index in the R programming language using dplyr package. Dataset in use: Remove a column by using column name We can remove a column with select() method by its column name. Syntax: select(dataframe,-column_name) Where, dataframe is the inp
2 min read
How to merge date and time in R? The date and time objects in R programming language can be represented using character strings in R. The date and time objects can be merged together using POSIX format or in the form of datetime objects. The POSIXlt class stores date and time information. Discussed below are various approaches to c
3 min read
data.table vs data.frame in R Programming data.table in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. The purpose of data.table is to create tabular data same as a data frame but the syntax varies. In the below example let we can see the syntax for the data table:
3 min read
How to Use dplyr to Generate a Frequency Table in R The frequency table in R is used to create a table with a respective count for both the discrete values and the grouped intervals. It indicates the counts of each segment of the table. It is helpful for constructing the probabilities and drawing an idea about the data distribution. The dplyr package
3 min read
How to use data.table within functions and loops in R? data. table is the R package that can provide the enhanced version of the data. frame for the fast aggregation, fast ordered joins, fast add/modify/delete of the columns by the reference, and fast file reading. It can be designed to provide a high-performance version of the base R's data. frame with
3 min read
Get the summary of dataset in R using Dply In this article, we will discuss how to get a summary of the dataset in the R programming language using Dplyr package. To get the summary of a dataset summarize() function of this module is used. This function basically gives the summary based on some required action for a group or ungrouped data,
2 min read
Merge Two data.table Objects in R data.table is a package that is used for working with tabular data in R. It provides an enhanced version of "data.frames", which are the standard data structure for storing data in base R. Installation Installing "data.table" package is no different from other R packages. Its recommended to run "ins
2 min read