How to use XGBoost algorithm for regression in R?

Last Updated : 19 Jul, 2024

XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm based on gradient boosting that is widely used for classification and regression tasks. In this article, we will explain how to use XGBoost for regression in R.

What is XGBoost?

The XGBoost stands for "Extreme Gradient Boosting." It is a powerful machine learning technique that excels at combining the results of numerous basic models, often referred to as "trees," to produce exceptionally precise predictions. XGBoost is particularly well-known for its performance and speed, making it a favorite among data scientists and machine learning practitioners. Its broad use across numerous industries can be attributed to its capacity to manage big datasets and provide precise outcomes promptly.

What is Regression?

Regression is a class of issue in machine learning where the objective is to forecast a continuous variable. For example, we may like to determine the price of a house based on numerous variables, such as its location, size, and number of rooms. Due to their application in predicting outcomes and understanding relationships between variables, regression models are essential tools in fields such as biology, engineering, and economics.

Now we will discuss step by step implementation of How to use XGBoost algorithm for regression in R Programming Language.

Step 1: Data Preparation

We require data before we can construct our model. We'll make use of the "Boston Housing" dataset a publicly available dataset that includes details about Boston dwellings including square footage, crime statistics, and other details.

# Install necessary packages
install.packages("xgboost")
install.packages("caret")
install.packages("dplyr")

# Load packages
library(xgboost)
library(caret)
library(dplyr)

# Load the dataset directly from the URL
url <- "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data <- read.csv(url)

# Display the first few rows of the dataset
head(data)

Output:

     crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

The dataset includes a number of columns, including:

crim: Crime rate per capita.
rm: Average number of rooms per dwelling.
medv: Median value of owner-occupied homes (this is our target variable).

Step 2: Model Building and Training

The data will be divided into a test set and a training set. The model is trained on the training set, and its performance is assessed on the test set.

# Split the data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(data$medv, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- data[ trainIndex,]
testData  <- data[-trainIndex,]

# Separate the features and target variable
train_x <- as.matrix(trainData %>% select(-medv))
train_y <- trainData$medv
test_x <- as.matrix(testData %>% select(-medv))
test_y <- testData$medv

Step 3: Training the XGBoost Model

Now we will Train the XGBoost Model.

# Train the model
xgb_model <- xgboost(data = train_x, 
                     label = train_y, 
                     nrounds = 100, 
                     objective = "reg:squarederror")

Output:

[1]	train-rmse:17.080285 
[2]	train-rmse:12.306821 
[3]	train-rmse:8.964714 
[4]	train-rmse:6.596005 
[5]	train-rmse:4.915064 
[6]	train-rmse:3.733244 
.
.
.
.
[98]	train-rmse:0.021423 
[99]	train-rmse:0.020608 
[100]	train-rmse:0.020092

Step 4: Model Evaluation

To see how well our model performs, we'll use the test set and calculate the Root Mean Square Error (RMSE).

# Predict the values for the test set
preds <- predict(xgb_model, test_x)

# Calculate RMSE
rmse <- sqrt(mean((preds - test_y)^2))
print(paste("RMSE:", rmse))

Output:

[1] "RMSE: 3.30851858196889"

Conclusion

We have now covered the fundamentals of using the XGBoost algorithm to R regression tasks. We have gone over every stage in depth, from knowing what XGBoost is and why it is effective to getting ready data, creating, and testing a model. With models that meet these requirements, you can now create trustworthy models that correctly predict continuous outcomes. Remember that the key to becoming a machine learning expert is consistent practice and experimenting with different datasets and parameters. This may encourage you to learn more about XGBoost, and other machine learning techniques.