How to use XGBoost algorithm for regression in R?
Last Updated :
19 Jul, 2024
XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm based on gradient boosting that is widely used for classification and regression tasks. In this article, we will explain how to use XGBoost for regression in R.
What is XGBoost?
The XGBoost stands for "Extreme Gradient Boosting." It is a powerful machine learning technique that excels at combining the results of numerous basic models, often referred to as "trees," to produce exceptionally precise predictions. XGBoost is particularly well-known for its performance and speed, making it a favorite among data scientists and machine learning practitioners. Its broad use across numerous industries can be attributed to its capacity to manage big datasets and provide precise outcomes promptly.
What is Regression?
Regression is a class of issue in machine learning where the objective is to forecast a continuous variable. For example, we may like to determine the price of a house based on numerous variables, such as its location, size, and number of rooms. Due to their application in predicting outcomes and understanding relationships between variables, regression models are essential tools in fields such as biology, engineering, and economics.
Now we will discuss step by step implementation of How to use XGBoost algorithm for regression in R Programming Language.
Step 1: Data Preparation
We require data before we can construct our model. We'll make use of the "Boston Housing" dataset a publicly available dataset that includes details about Boston dwellings including square footage, crime statistics, and other details.
R
# Install necessary packages
install.packages("xgboost")
install.packages("caret")
install.packages("dplyr")
# Load packages
library(xgboost)
library(caret)
library(dplyr)
# Load the dataset directly from the URL
url <- "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data <- read.csv(url)
# Display the first few rows of the dataset
head(data)
Output:
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
The dataset includes a number of columns, including:
- crim: Crime rate per capita.
- rm: Average number of rooms per dwelling.
- medv: Median value of owner-occupied homes (this is our target variable).
Step 2: Model Building and Training
The data will be divided into a test set and a training set. The model is trained on the training set, and its performance is assessed on the test set.
R
# Split the data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(data$medv, p = .8,
list = FALSE,
times = 1)
trainData <- data[ trainIndex,]
testData <- data[-trainIndex,]
# Separate the features and target variable
train_x <- as.matrix(trainData %>% select(-medv))
train_y <- trainData$medv
test_x <- as.matrix(testData %>% select(-medv))
test_y <- testData$medv
Step 3: Training the XGBoost Model
Now we will Train the XGBoost Model.
R
# Train the model
xgb_model <- xgboost(data = train_x,
label = train_y,
nrounds = 100,
objective = "reg:squarederror")
Output:
[1] train-rmse:17.080285
[2] train-rmse:12.306821
[3] train-rmse:8.964714
[4] train-rmse:6.596005
[5] train-rmse:4.915064
[6] train-rmse:3.733244
.
.
.
.
[98] train-rmse:0.021423
[99] train-rmse:0.020608
[100] train-rmse:0.020092
Step 4: Model Evaluation
To see how well our model performs, we'll use the test set and calculate the Root Mean Square Error (RMSE).
R
# Predict the values for the test set
preds <- predict(xgb_model, test_x)
# Calculate RMSE
rmse <- sqrt(mean((preds - test_y)^2))
print(paste("RMSE:", rmse))
Output:
[1] "RMSE: 3.30851858196889"
Conclusion
We have now covered the fundamentals of using the XGBoost algorithm to R regression tasks. We have gone over every stage in depth, from knowing what XGBoost is and why it is effective to getting ready data, creating, and testing a model. With models that meet these requirements, you can now create trustworthy models that correctly predict continuous outcomes. Remember that the key to becoming a machine learning expert is consistent practice and experimenting with different datasets and parameters. This may encourage you to learn more about XGBoost, and other machine learning techniques.
Similar Reads
Confidence interval for xgboost regression in R
XGBoost is Designed to be highly efficient, versatile, and portable, it is an optimized distributed gradient boosting library. Under the Gradient Boosting framework, it puts machine learning techniques into practice. Many data science problems can be swiftly and precisely resolved with XGBoost's par
4 min read
XGBoost for Regression
The results of the regression problems are continuous or real values. Some commonly used regression algorithms are Linear Regression and Decision Trees. There are several metrics involved in regression like root-mean-squared error (RMSE) and mean-squared-error (MSE). These are some key members of XG
7 min read
How to change color of regression line in R ?
A regression line is basically used in statistical models which help to estimate the relationship between a dependent variable and at least one independent variable. In this article, we are going to see how to plot a regression line using ggplot2 in R programming language and different methods to ch
4 min read
Different Results: âxgboostâ vs. âcaretâ in R
When working with machine learning models in R, you may encounter different results depending on whether you use the xgboost package directly or through the caret package. This article explores why these differences occur and how to manage them to ensure consistent and reliable model performance.Int
4 min read
How to use Different Algorithms using Caret Package in R
The caret (Classification And Regression Training) package in R provides a unified framework for training, tuning, and evaluating a wide range of machine learning algorithms. Installing and Loading the caret PackageWe will install caret and load it along with any other necessary dependencies.Rinstal
3 min read
How to Plot a Logistic Regression Curve in R?
In this article, we will learn how to plot a Logistic Regression Curve in the R programming Language. Logistic regression is basically a supervised classification algorithm. That helps us in creating a differentiating curve that separates two classes of variables. To Plot the Logistic Regression cur
3 min read
How to Calculate Log-Linear Regression in R?
Logarithmic regression is a sort of regression that is used to simulate situations in which growth or decay accelerates quickly initially and then slows down over time. The graphic below, for example, shows an example of logarithmic decay: Â The relationship between a predictor variable and a respon
3 min read
Regression and its Types in R Programming
Regression analysis is a statistical tool to estimate the relationship between two or more variables. There is always one response variable and one or more predictor variables. Regression analysis is widely used to fit the data accordingly and further, predicting the data for forecasting. It helps b
5 min read
How to Plot the Linear Regression in R
In this article, we are going to learn to plot linear regression in R. But, to plot Linear regression, we first need to understand what exactly is linear regression. What is Linear Regression?Linear Regression is a supervised learning model, which computes and predicts the output implemented from th
8 min read
Ordinary Least Squares (OLS) Regression in R
Ordinary Least Squares (OLS) Regression allows researchers to understand the impact of independent variables on the dependent variable and make predictions based on the model. Ordinary Least Squares (OLS) Regression in ROrdinary Least Squares (OLS) regression is a powerful statistical method used to
6 min read