Open In App

K-fold Cross Validation in R Programming

Last Updated : 12 Jun, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

K-Fold Cross Validation is a powerful resampling method used to assess the performance of machine learning models. It ensures the model is evaluated on different subsets of data, making performance estimates more robust and unbiased.

Steps for K-fold Cross Validation in R 

  • The dataset is randomly split into K equal-sized folds.
  • For each iteration: One fold is used for validation. The remaining K-1 folds are used for training.
  • This process repeats K times, and performance metrics are averaged to get the final evaluation.

Implementation the K-fold Technique on Classification

Here Naive Bayes classifier will be used as a probabilistic classifier to predict the class label of the target variable.  

Step 1: Load Libraries and Dataset

R
install.packages("lattice")
library(lattice)

library(tidyverse)  # For data manipulation
library(caret)      # For model training and CV
library(ISLR)       # Dataset

# Load and clean the dataset
dataset <- Smarket[complete.cases(Smarket), ]

 Step 2: Exploring the dataset

The dataset has 1,250 rows and 9 columns. Direction (target variable) is categorical with balanced classes: Up (648) and Down (602).

R
glimpse(dataset)
table(dataset$Direction)  # Check class balance

Output:

Rows: 1,250

Columns: 9

$ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume    <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...

Down   Up 
602  648 

The independent variables are of type <dbl> (numeric), while the target variable is a <fct> (factor), which is ideal for classification. It has two balanced classes: Down and Up (about 1:1 ratio), which helps prevent model bias. If classes are imbalanced, techniques like Down Sampling, Up Sampling, or hybrid methods like SMOTE and ROSE can be used.

Step 3: Build Model with 10-Fold CV

In this step, the trainControl() function is defined to set the value of the K parameter and then the model is developed as per the steps involved in the K-fold technique.

R
set.seed(123)

train_control <- trainControl(method = "cv", number = 10)

model <- train(Direction ~ ., data = dataset,
               trControl = train_control,
               method = "nb")

Step 4: Evaluate Model Performance

After training and validation of the model, it is time to calculate the overall accuracy of the model.

R
print(model)

Output: 

Naive Bayes

1250 samples
8 predictor
2 classes: 'Down', 'Up'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1124, 1125, 1126, 1125, 1125, 1126, ...
Resampling results across tuning parameters:

usekernel Accuracy Kappa
FALSE 0.9528118 0.9051776
TRUE 0.9672186 0.9342359

Tuning parameter 'fL' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
= 1.

Output Summary:

  • Model: Naive Bayes
  • Classes: Down, Up
  • Best Accuracy: 96.72%

Implementation the K-fold Technique on Regression

Regression machine learning models are used to predict the target variable which is of continuous nature like the price of a commodity or sales of a firm. Below are the complete steps for implementing the K-fold cross-validation technique on regression models.

Step 1: Load Libraries and Dataset

R
install.packages("lattice")
library(lattice)

library(tidyverse)
library(caret)
install.packages("datarium")  
library(datarium)

data("marketing", package = "datarium")

Step 2: Loading and inspecting the dataset

The dataset includes continuous predictors (youtube, facebook, newspaper) and a continuous target (sales).

R
head(marketing)

Output: 

Marketing_data
Head of Marketing dataset

Step 3: Train Model with 10-Fold CV

The value of the K parameter is defined in the trainControl() function and the model is developed according to the steps mentioned in the algorithm of the K-fold cross-validation technique.

R
set.seed(125)

train_control <- trainControl(method = "cv", number = 10)

model <- train(sales ~ ., data = marketing,
               method = "lm",
               trControl = train_control)

Step 4: Evaluate the model performance

Below is the code to print the final score and overall summary of the model. 

R
print(model)

 Output: 

Linear Regression

200 samples
3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:

RMSE Rsquared MAE
2.027409 0.9041909 1.539866

Tuning parameter 'intercept' was held constant at a value of TRUE

Output Summary:

  • Model: Linear Regression
  • RMSE: 2.03, R²: 0.90, MAE: 1.54
  • Intercept: TRUE (fixed)

Advantages of K-fold Cross-Validation

  • Fast computation speed.
  • A very effective method to estimate the prediction error and the accuracy of a model.

Disadvantages of K-fold Cross-Validation

  • A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model (generally K = 5 and K = 10 is desirable).

Next Article

Similar Reads