Feature scaling is a technique to improve the accuracy of machine learning models. This can be done by removing unreliable data points from the training set so that the model can learn useful information about relevant features. Feature scaling is widely used in many fields, including business analytics and clinical data science.
Feature Scaling Using R Programming Language
Feature scaling in R refers to the process of standardizing or normalizing the range of independent variables or features in a dataset. It ensures that each feature contributes equally to the model, preventing features with larger values from disproportionately influencing the results.
There are mainly two types of feature scaling techniques.
1. Standardization
Standardization is the simplest form of scaling, in which all the values are standardized to have a mean of zero and a standard deviation of one. For example, if you had a dataset with two variables (age and height), then you would calculate their means and standard deviations before performing any statistical tests on them.
Feature Scaling Using R2. Normalization
Normalization involves calculating the mean and median of a dataset and assessing their difference. If they significantly differ, it suggests something unusual about the data, preventing misleading conclusions, such as assuming a sample is representative of the population without further analysis. (e.g., "My kid might be taller than average because he grew faster than most kids his age").
Feature Scaling Using RCreating a Dataset to apply feature scaling in R
First, we need to create a dataframe to apply feature scaling techniques on the dataframe. We will explore different methods and libraries to do so.
R
age <- c(19,20,21,22,23,24,24,26,27)
salary <- c(10000,20000,30000,40000,
50000,60000,70000,80000,90000)
df <- data.frame( "Age" = age,
"Salary" = salary,
stringsAsFactors = FALSE)
df
Output:
OutputOnce the dataset is created we can start implementing Feature Scaling.
We know the formulas for both standardization and normalization. Let's apply them one by one.
1. For Standardization
We are manually standardizing the dataset df using z-score normalization. Each column is transformed with the formula (x - \text{mean}) / \text{sd} and the result is saved as a new data frame scaled_data.
R
scaled_data <- as.data.frame(sapply(df, function(x)
(x-mean(x))/sd(x)))
scaled_data
Output:
Output2. For Normalization
We are manually normalizing the dataset df to a 0–1 range using the min-max formula. Each column is transformed using the expression (x - \text{min}) / (\text{max} - \text{min}) and the result is stored as a new data frame scaled_data2.
R
scaled_data2 <- as.data.frame(sapply(df, function(x)
(x-min(x))/(max(x)-min(x))))
scaled_data2
Output:
OutputUsing Caret Library
Let's import the library caret and then apply the Standardization and Normalisation.
1. Standardization Using Caret Library
We are standardizing the dataset df by centering and scaling its numeric features using the caret package. First, we create a preprocessing model with preProcess(), then apply it to the data using predict().
R
install.packages("caret")
library(caret)
data1.pre <- preProcess(df, method=c("center", "scale"))
data1<- predict(data1.pre, df)
data1
Output:
Output2. Normalisation Using Caret Library
We are normalizing the dataset df to a 0–1 range using the caret package. First, we create a preprocessing model with preProcess(method = "range"), then apply it to the data using predict().
R
library("caret")
data2.pre <- preProcess(df, method="range")
data2 <- predict(data2.pre, df)
data2
Output:
OutputUsing Dplyr Library
Let's import the library dplyr and then apply the Standardization and Normalisation.
1. Standardization Using Dplyr Library
We are standardizing the "Salary" column in the dataset df using the scale() function. After loading the dplyr package, we use mutate_at() to apply z-score normalization to the "Salary" column and store the result in data_s.
R
install.packages("dplyr")
library(dplyr)
data_s <- df %>%
mutate_at(vars("Salary"), scale)
data_s
Output:
Output2. Normalisation Using Dplyr Library
We are standardizing all columns in the dataset df using the scale() function. With the dplyr package, we use mutate_all() to apply z-score normalization to every column and save the result in data_n.
R
library(dplyr)
data_n <- df %>%
mutate_all(scale)
data_n
Output:
OutputUsing BBmisc package
BBmisc is an R package so with the help of it we can calculate the standardization and normalization.
1. Standardization Using BBmisc package
We are standardizing the entire dataset df using the BBmisc package. The normalize() function with method = "standardize" applies z-score normalization to all numeric columns and stores the result in df_standardized.
R
install.packages("BBmisc")
library(BBmisc)
df_standardized <- BBmisc::normalize(df, method = "standardize")
df_standardized
Output:
Output2. Normalization Using BBmisc package
We are normalizing the dataset df to a 0–1 range using the BBmisc package. The normalize() function with method = "range" scales all numeric columns and stores the result in df_normalized.
R
library(BBmisc)
df_normalized <- BBmisc::normalize(df, method = "range")
df_normalized
Output:
OutputIn this article, we explored various methods for performing feature scaling in R using different libraries and techniques.