K-fold Cross Validation in R Programming

Last Updated : 12 Jun, 2025

K-Fold Cross Validation is a powerful resampling method used to assess the performance of machine learning models. It ensures the model is evaluated on different subsets of data, making performance estimates more robust and unbiased.

Steps for K-fold Cross Validation in R

The dataset is randomly split into K equal-sized folds.
For each iteration: One fold is used for validation. The remaining K-1 folds are used for training.
This process repeats K times, and performance metrics are averaged to get the final evaluation.

Implementation the K-fold Technique on Classification

Here Naive Bayes classifier will be used as a probabilistic classifier to predict the class label of the target variable.

Step 1: Load Libraries and Dataset

install.packages("lattice")
library(lattice)

library(tidyverse)  # For data manipulation
library(caret)      # For model training and CV
library(ISLR)       # Dataset

# Load and clean the dataset
dataset <- Smarket[complete.cases(Smarket), ]

Step 2: Exploring the dataset

The dataset has 1,250 rows and 9 columns. Direction (target variable) is categorical with balanced classes: Up (648) and Down (602).

glimpse(dataset)
table(dataset$Direction)  # Check class balance

Output:

Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ...
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1...
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0...
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -...
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, ...
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, ...
$ Volume <dbl> 1.19130, 1.29650, 1.41120, 1.27600, 1.20570, 1.34910, 1.44500, 1.40780, 1.16400, 1.23260, 1.30900, 1.25800, 1.09800, 1.05310, ...
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.303, 0.287, -0.498, -0.189, 0.680, 0.701, -0.562, 0.546, -1.747, 0...
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up, Down, Up, Down, Up, Down, Down, Down, Down, Up, Down, Down, Up...
Down Up
602 648

The independent variables are of type <dbl> (numeric), while the target variable is a <fct> (factor), which is ideal for classification. It has two balanced classes: Down and Up (about 1:1 ratio), which helps prevent model bias. If classes are imbalanced, techniques like Down Sampling, Up Sampling, or hybrid methods like SMOTE and ROSE can be used.

Step 3: Build Model with 10-Fold CV

In this step, the trainControl() function is defined to set the value of the K parameter and then the model is developed as per the steps involved in the K-fold technique.

set.seed(123)

train_control <- trainControl(method = "cv", number = 10)

model <- train(Direction ~ ., data = dataset,
               trControl = train_control,
               method = "nb")

Step 4: Evaluate Model Performance

After training and validation of the model, it is time to calculate the overall accuracy of the model.

print(model)

Output:

Naive Bayes

1250 samples
8 predictor
2 classes: 'Down', 'Up'

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 1124, 1125, 1126, 1125, 1125, 1126, ...
Resampling results across tuning parameters:

usekernel Accuracy Kappa
FALSE 0.9528118 0.9051776
TRUE 0.9672186 0.9342359

Tuning parameter 'fL' was held constant at a value of 0
Tuning
parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
= 1.

Output Summary:

Model: Naive Bayes
Classes: Down, Up
Best Accuracy: 96.72%

Implementation the K-fold Technique on Regression

Regression machine learning models are used to predict the target variable which is of continuous nature like the price of a commodity or sales of a firm. Below are the complete steps for implementing the K-fold cross-validation technique on regression models.

Step 1: Load Libraries and Dataset

install.packages("lattice")
library(lattice)

library(tidyverse)
library(caret)
install.packages("datarium")  
library(datarium)

data("marketing", package = "datarium")

Step 2: Loading and inspecting the dataset

The dataset includes continuous predictors (youtube, facebook, newspaper) and a continuous target (sales).

head(marketing)

Output:

Marketing_data — Head of Marketing dataset

Step 3: Train Model with 10-Fold CV

The value of the K parameter is defined in the trainControl() function and the model is developed according to the steps mentioned in the algorithm of the K-fold cross-validation technique.

set.seed(125)

train_control <- trainControl(method = "cv", number = 10)

model <- train(sales ~ ., data = marketing,
               method = "lm",
               trControl = train_control)

Step 4: Evaluate the model performance

Below is the code to print the final score and overall summary of the model.

print(model)

Output:

Linear Regression

200 samples
3 predictor

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 180, 180, 179, 180, 180, ...
Resampling results:

RMSE Rsquared MAE
2.027409 0.9041909 1.539866

Tuning parameter 'intercept' was held constant at a value of TRUE

Output Summary:

Model: Linear Regression
RMSE: 2.03, R²: 0.90, MAE: 1.54
Intercept: TRUE (fixed)

Advantages of K-fold Cross-Validation

Fast computation speed.
A very effective method to estimate the prediction error and the accuracy of a model.

Disadvantages of K-fold Cross-Validation

A lower value of K leads to a biased model and a higher value of K can lead to variability in the performance metrics of the model. Thus, it is very important to use the correct value of K for the model (generally K = 5 and K = 10 is desirable).

K-fold Cross Validation in R Programming

RISHU_MISHRA

Improve

Article Tags :

K-fold Cross Validation in R Programming

Steps for K-fold Cross Validation in R

Implementation the K-fold Technique on Classification

Step 1: Load Libraries and Dataset

Step 2: Exploring the dataset

Step 3: Build Model with 10-Fold CV

Step 4: Evaluate Model Performance

Implementation the K-fold Technique on Regression

Step 1: Load Libraries and Dataset

Step 2: Loading and inspecting the dataset

Step 3: Train Model with 10-Fold CV

Step 4: Evaluate the model performance

Advantages of K-fold Cross-Validation

Disadvantages of K-fold Cross-Validation

Similar Reads

Linear Model Regression

Linear Model Classification

Regularization

K-Nearest Neighbors (KNN)

Support Vector Machines

Decision Tree

Ensemble Learning

Thank You!

What kind of Experience do you want to share?