SlideShare a Scribd company logo
R
Programming
Presentation By :
Aman Bhalla
(+91 8700246920)
(amanbhalla017@gmail.com)
AGENDA
• Introduction to R
• Packages Covered
• Datasets Covered
• Basics of R
• Looping in R
• Data Analysis in R
• Machine Learning Algorithms
Introduction to R
• A Programming Language & free software
environment for statistical computing.
• Most popular Graphical User Interface(GUI),
Widely used among Statisticians & Data
Miners for developing statistical software
and data analysis.
• Highly Extensible through the use of user-
submitted packages for specific functions.
R Script
R Console
Global Environment
R differs from RStudio. One can use R without using RStudio, but can't use RStudio without
using R, so R comes first.
Plots, Packages & Help Tab
Shortcuts Used
• Ctrl + L : Clears R Console.
• Alt + - : Assigns a name to a variable.
• Ctrl + Shift + M : Assigns Pipe Operator (%>%)
• Ctrl + Shift + N : Opens a new R Script.
• Ctrl + O : Opening an existing R Script.
• Ctrl + S : Saving the current R Script.
• Ctrl + Q : Quits the current R Session.
Packages
Covered
• Amelia : A program for missing data.
• animation : A Gallery of animations in Statistics & Utilities to create animations.
• car : Companion to Applied Regression.
• caret : Classification and Regression Training.
• caTools : Tools : Moving window statistics, GIF, Base64, ROC AUC etc.
• class : Functions for Classification.
• corrplot : Visualization of a Correlation Matrix.
• cowplot : Streamlined Plot Theme and Plot Annotations for ‘ggplot2’.
• dplyr : A Grammar of Data Manipulation.
• e1071 : Misc Functions of the Department of Statistics, Probability Theory.
• ggplot2 : Create Elegant Data Visualisations using the Grammar of Graphics.
• ggplot2movies : Movies Data
• ggrepel : Automatically position non-overlapping text labels with ggplot2.
• hflights : Flights that departed Houston in 2011.
• leaflet : Create Interactive Web Maps with the JavaScript ‘Leaflet’ library.
• magrittr : A Forward-Pipe Operator for R.
• Metrics : Evaluation Metrics for Machine Learning.
• mlr : Machine Learning in R.
• partykit : A Toolkit for Recursive Partytioning.
• plyr : Tools for Splitting, Applying and Combining Data.
• randomForest : Breiman & Cutler’s Random Forests for classification & regression.
• rpart : Recursive Partitioning and Regression Trees.
• scales : Scale Functions for Visualization.
• tibble : Simple Data Frames.
• tidyr : Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions.
• VIM : Visualization and Imputation of Missing values.
Datasets
Covered
• In-Built datasets like iris, mtcars, state.x77 are also used extensively apart from the following mentioned datasets :
1. Bike Data : A 121 x 9 dataset to understand basic indexing in R.
2. Future 500 : A 500 x 10 dataset to understand data visualization.
3. Movies Data : A 58788 x 24 dataset that comes under ggplot2movies package.
4. Weather (Australia) : A 144187 x 24 dataset to explore tools of dplyr package.
5. Flights Data : A 227496 x 21 dataset that comes under hflights package.
6. Meterology Data : Weather datasets of 4 cities in US to understand looping in R.
7. Big Mart Sales Data : A dataset to understand Exploratory Data Analysis in R.
8. Basketball, Wine Quality Data : Datasets to understand Decision Tree mechanism.
9. Titanic, Home Credit Data : Datasets downloaded from Kaggle to understand ML Basics.
10. Income Data : A dataset to further practice Supervised Machine Learning Algorithms.
11. Social Network Ads Data : A 400 x 5 dataset to understand k-NN Algorithm.
12. Uber Data : A dataset showing Month-wise details of customers boarding cabs & their location.
Basics
• weekend <- c(“Sat", “Sun") : Saves a variable named weekend in the environment containing 2 character
values : Sat, Sun
• data <- read.csv(file.choose(),header = T,sep = ",",na.strings = c(“ “)) : Standard Code for Reading csv file in R.
• class(data) : States the class type of the dataset.
• str(data) : Displays the structure/class of all the variables within the data.
• summary(data) : Shows the minimum, maximum, mean, median, 1st & 3rd Quartiles respectively.
• df <- as.data.frame(data) : Stores data into a data frame.
• matx <- as.matrix(data) : Stores data into a matrix.
• getwd() : Displays the working directory.
• rownames(data) : Shows all the row names of the data.
• colnames(data) : Shows all the column names of the data.
• nrow(data) : Shows the count of number of rows in the dataset.
• ncol(data) : Counts the number of columns in the dataset.
• length(data) : Shows the length (variables count) of the dataset.
• install.packages(“packagename”) : Installing a package in R
• library(package) : Activate a package to further perform functions.
• names(data) : Displays all the variables of the dataset.
• dim(data) : Shows the dimensions of the data (No. of rows & columns).
• sum(is.na(data)) : Shows the total NA values present in the dataset.
• attach(df) : Attaching a dataset named df in R.
• detach(df) : Detaching a dataset named df.
• head(data) : Prints first 6 rows of the dataset.
• tail(data) : Prints last 6 rows of the dataset.
• print(tibble, n=20) : Prints first 20 rows of the dataset.
• data$variable<-as.character(data$variable) : Saving the variable as character type.
• data[1,] : Prints 1st row of the dataset.
• data[,3] : Prints 3rd column of the dataset.
• table(data$variable) : Tabular view of the variable (Frequency Distribution).
• round(mean(data$variable,na.rm = T),2) : Rounds off the mean of variable(removing NA values) upto 2 decimals.
• sort(prop.table(data$variable),decreasing=T) : Tabular view of proportions of variable totals in descending order.
• complete.cases(variable) : ! is.na(variable) : Boolean Output (True/False) of Non-NA Values(Shows True for
Non-NA Values).
• list1 <- list(mtcars, iris) : Stores a variable named list1 containing list of 2 mentioned datasets in the global
environment.
• which(iris$Species == "setosa") : Displays all the row numbers where the given criteria is satisfied.
• rm(list = ls()) : Clears all the stored datasets and variables from the global environment.
Looping
A <- c("what","is","truth")
• COMMAND :
if("Truth" %in% A){
print("Truth is found")
}else{
print("Not Found")
}
• OUTPUT :
Not Found
A <- c(1:10)
• COMMAND :
for(i in 1:3){
print(A)
}
• OUTPUT :
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
apply(mtcars, 1, mean)
• OUTPUT :
Displays mean of all variables row-wise.
lapply(weather, t) [6th dataset (9th Slide)]
• OUTPUT :
Displays transpose of data in the form of a list.
lapply(weather, "[",1) : Access 1st column of all
datasets in the list.
• Creating a Function :
missingvalue <- function(x){
return(sum(is.na(x)))
}
sapply(weather, missingvalue)
• OUTPUT :
Displays count of missing values in the form of a
table.
• Alternative Code for the same task :
sapply(weather, function(x) sum(is.na(x)))
Data
Analysis
1. First step to deal with a typical business problem is by making hypothesis, then performing
Exploratory Data Analysis up till Step 6.
2. Data will be received - Extract all the Variables/Features from the data.
3. Understanding the Data in Detail - Finding Patterns, Outliers, Anomalies & Missing Values.
4. Performing Univariate Analysis - Single Variable Analysis.
5. Conduct Multivariate Analysis - Categorical Vs. Categorical, Numerical Vs. Categorical,
Numerical Vs. Numerical Variables.
6. Missing Values Imputation, Treatment of Outliers & Anomalies (if any).
7. Apply Feature Engineering - Variable Transformation.
8. Perform Scaling of the Data - Bringing the Data near to Normal Distribution.
9. Apply ML Algorithm, make predictions & test the Accuracy.
• Pipe Operator :
%>% : Passes object on left hand side as first argument of function on right hand side.
E.G : iris %>% names() : States all the variable names of the iris dataset.
• Tibble :
iris<- tbl_df(iris) : Converts data to tibble class, which are easier to examine than data frames.
• Select() :
iris %>% select(Sepal.Length, Petal.Length, Species) : Select columns by name
• Filter() :
filter(mtcars, cyl>4 & gear>4) : Extract rows that meet logical criteria.
• Rename() :
iris <- iris %>% rename(Sepal_Length = Sepal.Length) : Renaming a Variable.
• Select_If() :
iris.num <- iris %>% select_if(is.numeric) : Extracting the data basis condition.
• Match Operator ( %in% ) :
8 %in% c(1,2,9,5,3,6,7,4,5) : OUTPUT : FALSE
iris %>% filter(Species %in% c("setosa", "virginica")) %>% summary() : Summarises the data where
the condition is TRUE.
• Helper Functions :
select(iris, contains(".")) : Select columns whose name contains a character string.
select(iris, starts_with("Sepal")) : Select columns whose name starts with a character string.
select(iris, ends_with("Length")) : Select columns whose name ends with a character string.
select(iris, everything()) : Select every column.
select(iris, Sepal.Length:Petal.Width) : Select all columns between Sepal.Length and Petal.Width
(inclusive).
select(iris, -Species) : Select all columns except Species.
iris<- iris %>% select(Sepal_Length = Sepal.Length) : Renaming a variable name.
• Arrange() :
hflights %>% arrange(DepDelay) : Arranges the data in ascending order basis DepDelay.
hflights %>% arrange(desc(DepDelay)) : Arranges the data in descending order basis DepDelay.
• Group_By() :
iris %>% group_by(Species) : Group data into rows with the same value of Species.
• Summarise() :
iris %>% group_by(Species) %>% summarise(Count=n()) : Summarises the count of Species in
data.
hflights %>% group_by(Month) %>% summarise(dist = sum(Distance)) %>% summarise(
Minimum_Distance = min(dist), Maximum_Distance = max(dist), Mean_Distance = mean(dist),
Median_Distance = median(dist)) : Summarises the min., max., mean and median of total monthly
distances.
• Tally() :
hflights %>% group_by(Month) %>% tally(sort = T) : Adds a frequency column to the table.
• Mutate() :
mutate(iris, Sepal = Sepal.Length + Sepal.Width) : Compute and append one or more new columns.
• Transmute() :
transmute(iris, sepal = Sepal.Length + Sepal. Width) :Compute one or more new columns. Drop
original columns.
• Slice() :
slice(hflights, 100:106) : Slice rows by position.
• Distinct() :
iris %>% select(Species) %>% distinct() : Removes duplicates.
• If_Else() :
df <- data.frame(x=c(1,NA,6,5))
df <- df %>% mutate(New_Variable = if_else(x<5, x+1, x+2, 0)) : Last argument is to replace
missing values (NA).
• Union() :
union(y, z) : union(mtcars[1:16,], mtcars[17:32,]) : Rows that appear in either or both y and z.
• Intersect() :
intersect(y, z) : intersect(mtcars[1:16,], mtcars[16:32,]) : Rows that appear in both y and z.
• Between() :
hflights %>% filter(between(DepTime,600,605)) : Displays rows with variable value(DepTime) lying
between specified values.
• Count() :
hflights %>% count(Month, sort = T) : A shortcut that does grouping as well as creating a frequency
table.
• Bind_Rows() :
bind_rows(y, z) : bind_rows(mtcars[1:16,], mtcars[17:32,]) : Append z to y as new rows.
• User Input :
number <- as.integer(readline(prompt("Enter the number")))
• Scatter Plot (Along with Smooth Line):
ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5, shape=7, alpha=0.5) +
labs(title=“Scatter-Plot”, subtitle=“Mpg Vs. Hp”, x=“mpg”, y=“hp”) + geom_smooth(fill = NA, size = 1.5, method
= lm, color= "blue") + theme_bw()
• Box Plot :
ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot(aes(fill=factor(cyl)), alpha = 0.75) +
scale_fill_discrete(name = "Cyl") + labs(title = “Box-Plot”, x="Cyl", y="Mpg") + theme_classic()
• Bar Plot :
iris %>% group_by(Species) %>% summarise(Count=n()) %>% ggplot(aes(Species, Count)) +
geom_bar(stat="identity", fill = "green") + labs(title = “Bar-Plot”, x=“Species”, y=“Count”) +
geom_label(aes(Species, Count, label = Count)) + theme(axis.text.x = element_text(angle=45, hjust = 1))
• Histogram :
ggplot(movies, aes(rating)) + geom_histogram(aes(fill= ..count..), binwidth = 0.1) + ggtitle("Histogram") +
xlab("Ratings") + ylab("Count") + theme_minimal()
• Density Plot :
• ggplot(movies, aes(rating)) + geom_density(color = "red") + labs(title="Density Plot",
subtitle="Movies Data", x="Ratings", y="Density", caption="Source : Movies Data") + theme_grey()
• Heat Map (Bin2d Map) :
• ggplot(movies, aes(x=year, y=rating)) + geom_bin2d() + labs(title = "Heat Map", subtitle = "Year
Vs. Rating", x="Year", y="Rating", caption="Source : Movies Data") + theme_classic()
• Violin Plot :
• iris %>% ggplot(aes(Species, Sepal.Length)) + geom_violin(fill = "red", alpha = 0.75) + labs(title =
"Violin-Plot", subtitle = "Species Vs. Sepal.Length", caption = "Source : Iris Data")
• Correlation Plot :
• corrplot(cor(mtcars),method = "circle") : Works with data containing all numeric variables only.
• Cowplot :
• plots <- plot_grid(A, B, nrow = 1) : Here, A, B and C are 3 stored plots respectively.
• plot_grid(plots, C, ncol = 1) : This command will plot all 3 plots together.
• GG Repel :
• ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5) +
geom_text_repel(aes(label = rownames(mtcars)), color='blue') + theme_minimal()
• Faceting & Flipping :
• Facet divides a plot into sub-plots.
• ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))+ geom_point(aes(color=Species, size=Species))
+ facet_wrap(~Species)
• ggplot(mtcars, aes(x = mpg, y = hp))+geom_point(aes(color=factor(cyl),
size=5))+facet_grid(.~cyl)+coord_flip()
• Tidyr :
• future500$Revenue <- gsub("$",“=", future500$Revenue : Replacing $ symbol with = symbol in
all the Revenue entries.
• future500 <- separate(future500, State, into=c("State", "City"), sep = ",") : Separating “State” in 2
separate columns.
• Leaflet :
• leaflet() %>% addTiles() %>% addMarkers(lat = 40.7223, lng = -73.9887)
• OUTPUT : Displays map view of the location basis latitude and longitude entered.
• VIM :
• aggr(future500) : Calculate or plot the amount of missing/imputed values in each variable of the
dataset.
• Substr() Command :
• substr(iris$Species, 1, 3) : Extract or replace first 3 substrings in Species variable.
• Ifelse() Command :
• iris$Species_Num <- ifelse(iris$Species=="setosa", 1, ifelse(iris$Species=="versicolor", 2, 3))
• Revalue() Command :
• iris$Species <- revalue(iris$Species, c("setosa"=1,"versicolor"=2,"virginica"=3))
• Recode() Command :
• iris$Species <- recode(iris$Species, "c('setosa')='Type-1'")
• Writing a CSV File :
• write.csv(iris, file = "Iris Dataset.csv", row.names = F)
• Strsplit() Command :
• strsplit(future500$State, ",") :Splits the data whenever it sees , separator. Variable must be
character type only.
• Regression :
• summary(lm(target_variable~., data=dataset)) : Target Variable must be converted to numeric
class. Higher the Adjusted R^2, better the model.
• Impute() :
• imputeddata <- impute(future500, classes = list(factor = imputeMode(), numeric =
imputeMedian()))
• future500 <- imputeddata$data
• Na.Roughfix :
• future500<- na.roughfix(future500) : Impute missing values by Median/Mode.
• Skewness and Kurtosis :
• skewness(iris$Sepal.Length) : Gives skewness of the numeric variable.
• kurtosis(iris$Sepal.Width) : Gives kurtosis of the numeric variable.
• Splitting Iris Data : [caret Package]
index = createDataPartition(iris$Species, p=0.5, list = F)
train <- iris[index,]
test <- iris[-index,]
• Splitting Iris Data : [caTools Package]
split.data <- sample.split(iris$Species, SplitRatio = 0.75)
training_set <- subset(iris, split.data==T)
test_set <- subset(iris, split.data==F)
• Train Dataset – Data having predictor variables and the target variable.
• Test Dataset - Model is tested over this data for accuracy.
• combined<- bind_rows(train, test) : Dataset on which Feature Engineering is performed.
• train_new and test_new are 2 datasets extracted from combined having same dimensions as
train and test data with no missing values(except Target Variable in test_new) & additional
features.
Machine
Learning
Algorithms
• Setting Seed :
• set.seed(123) : Makes the selected sample STATIC, any number can be written.
This code line is written every time before running a model or before sampling.
• Decision Tree :
• model_dtree <- rpart(target_variable~., data = train_new, method = "class“, control =
rpart.control(minsplit =60, minbucket = 30, maxdepth = 4)) [Method can be class or anova]
• Visualizing Decision Tree :
• plot(as.party(model_dtree))
• Predicting Decision Tree Outcomes :
• predict_dtree <- predict(model_dtree, test_new, type = "class")
• Creating Confusion Matrix :
• DtreeCM <- confusionMatrix(predict_dtree, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeCM$overall[1]))
• Altering 3 parameters : minsplit, minbucket & maxdepth of a Decision tree : Hyper-Parametric
Tuning
• Random Forest :
• model_rf <- randomForest(factor(target_variable) ~.,data = train_new, method="rf")
• Predicting Random Forest Outcomes :
• predict_rf <- predict(model_rf, test_new, type = "response")
• Creating Confusion Matrix :
• DtreeRF <- confusionMatrix(predict_rf, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeRF$overall[1]))
• C Forest :
• model_cf <- cforest(as.factor(target_variable)~., data = train_new)
• Predicting C Forest Outcomes :
• predict_cf <- predict(model_cf, test_new, type = "response", OOB = T)
• Creating Confusion Matrix :
• DtreeCF <- confusionMatrix(predict_cf, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeCF$overall[1]))
• Linear Regression :
• model_lm <- lm(target_variable~., data = train_new)
• Visualizing Linear Regression Model :
• par(mfrow=c(2,2)) ; plot(model_lm)
• Predicting Linear Regression Outcomes :
• predict_lm <- predict(model_lm, test_new, type = "response")
• Checking Root Mean Square Error (RMSE) :
rmse(actual_data$target_variable, predict_lm) : Lower the RMSE, better the model.
• Logistic Regression : [Used for Classification Problem]
• model_glm <- glm(target_variable~., data = train_new)
• Visualizing Logistic Regression Model :
• par(mfrow=c(2,2)) ; plot(model_glm)
• Predicting Logistic Regression Outcomes :
• predict_glm <- predict(model_glm, test_new, type = “response”)
• Checking Accuracy & Kappa Value:
• confusionMatrix(actual_data$target_variable, predict_glm) : Higher the Kappa, better the model.
• k-NN Algorithm :
• k-NN model runs only for numerical variables, therefore we remove all categorical columns while
building the model.
• model.knn <- knn(train = training_set[,-5], test = test_set[,-5], cl = training_set[,5], k = 10)
• 5th column of Iris dataset being a factor type is excluded while building model.
• Tuning k-NN :
• summary(tune.knn(x=training_set[,-5], y=training_set[,5], k=1:20))
• k-Means Clustering :
• Forming Cluster :
• sepalclusters <- kmeans(iris[,1:2],3,nstart = 20)
• table(sepalclusters$cluster, iris$Species)
• Visualizing Clusters via Animation :
• kmeans.ani(iris[,1:2],centers = 3)

More Related Content

What's hot (20)

PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
PDF
서포트 벡터머신
Sunggon Song
 
PPT
3.7 outlier analysis
Krish_ver2
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPTX
Feature selection concepts and methods
Reza Ramezani
 
PPTX
Output primitives computer graphics c version
Marwa Al-Rikaby
 
PPT
Chapter 3 Image Processing: Basic Transformation
Varun Ojha
 
PPTX
Data preprocessing
Gajanand Sharma
 
PPT
Python command line_14_12_2020
Sugnan M
 
PPT
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
PPTX
Ml3 logistic regression-and_classification_error_metrics
ankit_ppt
 
PDF
Convolutional Neural Networks : Popular Architectures
ananth
 
PDF
Bayesian learning
Vignesh Saravanan
 
PDF
Machine Learning in R
Alexandros Karatzoglou
 
PPT
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
PDF
3 Data Structure in R
Dr Nisha Arora
 
PDF
Linear models for classification
Sung Yub Kim
 
PDF
R normal distribution
Learnbay Datascience
 
PDF
Linear discriminant analysis
Bangalore
 
PDF
Machine Learning Performance metrics for classification
Kuppusamy P
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
서포트 벡터머신
Sunggon Song
 
3.7 outlier analysis
Krish_ver2
 
Machine Learning with R
Barbara Fusinska
 
Feature selection concepts and methods
Reza Ramezani
 
Output primitives computer graphics c version
Marwa Al-Rikaby
 
Chapter 3 Image Processing: Basic Transformation
Varun Ojha
 
Data preprocessing
Gajanand Sharma
 
Python command line_14_12_2020
Sugnan M
 
Ml3 logistic regression-and_classification_error_metrics
ankit_ppt
 
Convolutional Neural Networks : Popular Architectures
ananth
 
Bayesian learning
Vignesh Saravanan
 
Machine Learning in R
Alexandros Karatzoglou
 
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
3 Data Structure in R
Dr Nisha Arora
 
Linear models for classification
Sung Yub Kim
 
R normal distribution
Learnbay Datascience
 
Linear discriminant analysis
Bangalore
 
Machine Learning Performance metrics for classification
Kuppusamy P
 

Similar to R programming & Machine Learning (20)

PDF
Practical data science_public
Long Nguyen
 
PPTX
Data Exploration in R.pptx
Ramakrishna Reddy Bijjam
 
PDF
Introduction to r studio on aws 2020 05_06
Barry DeCicco
 
PPTX
R language introduction
Shashwat Shriparv
 
PPT
How to obtain and install R.ppt
rajalakshmi5921
 
PPT
Introduction to R for Data Science Technology
gufranqureshi506
 
PDF
Introduction to R programming
Alberto Labarga
 
PPT
Basics of R-Progranmming with instata.ppt
geethar79
 
PPT
17641.ppt
AhmedAbdalla903058
 
PPT
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
PPT
17641.ppt
vikassingh569137
 
PPT
R for Statistical Computing
Mohammed El Rafie Tarabay
 
PDF
Basic and logical implementation of r language
Md. Mahedi Mahfuj
 
PDF
R basics
Sagun Baijal
 
PPT
Advanced Data Analytics with R Programming.ppt
Anshika865276
 
PPTX
R programming language
Alberto Minetti
 
PPTX
Aggregate.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Introduction to R
Stacy Irwin
 
PDF
R Programming Reference Card
Maurice Dawson
 
PDF
R - the language
Mike Martinez
 
Practical data science_public
Long Nguyen
 
Data Exploration in R.pptx
Ramakrishna Reddy Bijjam
 
Introduction to r studio on aws 2020 05_06
Barry DeCicco
 
R language introduction
Shashwat Shriparv
 
How to obtain and install R.ppt
rajalakshmi5921
 
Introduction to R for Data Science Technology
gufranqureshi506
 
Introduction to R programming
Alberto Labarga
 
Basics of R-Progranmming with instata.ppt
geethar79
 
Slides on introduction to R by ArinBasu MD
SonaCharles2
 
17641.ppt
vikassingh569137
 
R for Statistical Computing
Mohammed El Rafie Tarabay
 
Basic and logical implementation of r language
Md. Mahedi Mahfuj
 
R basics
Sagun Baijal
 
Advanced Data Analytics with R Programming.ppt
Anshika865276
 
R programming language
Alberto Minetti
 
Aggregate.pptx
Ramakrishna Reddy Bijjam
 
Introduction to R
Stacy Irwin
 
R Programming Reference Card
Maurice Dawson
 
R - the language
Mike Martinez
 
Ad

Recently uploaded (20)

PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
PPTX
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
PPT
Reliability Monitoring of Aircrfat commerce
Rizk2
 
PDF
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
PDF
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
PPTX
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PPTX
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
PPTX
Krezentios memories in college data.pptx
notknown9
 
PPTX
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
PDF
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PDF
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
Daily, Weekly, Monthly Report MTC March 2025.pptx
PanjiDewaPamungkas1
 
Reliability Monitoring of Aircrfat commerce
Rizk2
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
ilide.info-tg-understanding-culture-society-and-politics-pr_127f984d2904c57ec...
jed P
 
727325165-Unit-1-Data-Analytics-PPT-1.pptx
revathi148366
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Krezentios memories in college data.pptx
notknown9
 
Module-2_3-1eentzyssssssssssssssssssssss.pptx
ShahidHussain66691
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
A Web Repository System for Data Mining in Drug Discovery
IJDKP
 
Ad

R programming & Machine Learning

  • 2. AGENDA • Introduction to R • Packages Covered • Datasets Covered • Basics of R • Looping in R • Data Analysis in R • Machine Learning Algorithms
  • 3. Introduction to R • A Programming Language & free software environment for statistical computing. • Most popular Graphical User Interface(GUI), Widely used among Statisticians & Data Miners for developing statistical software and data analysis. • Highly Extensible through the use of user- submitted packages for specific functions.
  • 4. R Script R Console Global Environment R differs from RStudio. One can use R without using RStudio, but can't use RStudio without using R, so R comes first. Plots, Packages & Help Tab
  • 5. Shortcuts Used • Ctrl + L : Clears R Console. • Alt + - : Assigns a name to a variable. • Ctrl + Shift + M : Assigns Pipe Operator (%>%) • Ctrl + Shift + N : Opens a new R Script. • Ctrl + O : Opening an existing R Script. • Ctrl + S : Saving the current R Script. • Ctrl + Q : Quits the current R Session.
  • 7. • Amelia : A program for missing data. • animation : A Gallery of animations in Statistics & Utilities to create animations. • car : Companion to Applied Regression. • caret : Classification and Regression Training. • caTools : Tools : Moving window statistics, GIF, Base64, ROC AUC etc. • class : Functions for Classification. • corrplot : Visualization of a Correlation Matrix. • cowplot : Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. • dplyr : A Grammar of Data Manipulation. • e1071 : Misc Functions of the Department of Statistics, Probability Theory. • ggplot2 : Create Elegant Data Visualisations using the Grammar of Graphics. • ggplot2movies : Movies Data • ggrepel : Automatically position non-overlapping text labels with ggplot2.
  • 8. • hflights : Flights that departed Houston in 2011. • leaflet : Create Interactive Web Maps with the JavaScript ‘Leaflet’ library. • magrittr : A Forward-Pipe Operator for R. • Metrics : Evaluation Metrics for Machine Learning. • mlr : Machine Learning in R. • partykit : A Toolkit for Recursive Partytioning. • plyr : Tools for Splitting, Applying and Combining Data. • randomForest : Breiman & Cutler’s Random Forests for classification & regression. • rpart : Recursive Partitioning and Regression Trees. • scales : Scale Functions for Visualization. • tibble : Simple Data Frames. • tidyr : Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. • VIM : Visualization and Imputation of Missing values.
  • 10. • In-Built datasets like iris, mtcars, state.x77 are also used extensively apart from the following mentioned datasets : 1. Bike Data : A 121 x 9 dataset to understand basic indexing in R. 2. Future 500 : A 500 x 10 dataset to understand data visualization. 3. Movies Data : A 58788 x 24 dataset that comes under ggplot2movies package. 4. Weather (Australia) : A 144187 x 24 dataset to explore tools of dplyr package. 5. Flights Data : A 227496 x 21 dataset that comes under hflights package. 6. Meterology Data : Weather datasets of 4 cities in US to understand looping in R. 7. Big Mart Sales Data : A dataset to understand Exploratory Data Analysis in R. 8. Basketball, Wine Quality Data : Datasets to understand Decision Tree mechanism. 9. Titanic, Home Credit Data : Datasets downloaded from Kaggle to understand ML Basics. 10. Income Data : A dataset to further practice Supervised Machine Learning Algorithms. 11. Social Network Ads Data : A 400 x 5 dataset to understand k-NN Algorithm. 12. Uber Data : A dataset showing Month-wise details of customers boarding cabs & their location.
  • 12. • weekend <- c(“Sat", “Sun") : Saves a variable named weekend in the environment containing 2 character values : Sat, Sun • data <- read.csv(file.choose(),header = T,sep = ",",na.strings = c(“ “)) : Standard Code for Reading csv file in R. • class(data) : States the class type of the dataset. • str(data) : Displays the structure/class of all the variables within the data. • summary(data) : Shows the minimum, maximum, mean, median, 1st & 3rd Quartiles respectively. • df <- as.data.frame(data) : Stores data into a data frame. • matx <- as.matrix(data) : Stores data into a matrix. • getwd() : Displays the working directory. • rownames(data) : Shows all the row names of the data. • colnames(data) : Shows all the column names of the data. • nrow(data) : Shows the count of number of rows in the dataset. • ncol(data) : Counts the number of columns in the dataset.
  • 13. • length(data) : Shows the length (variables count) of the dataset. • install.packages(“packagename”) : Installing a package in R • library(package) : Activate a package to further perform functions. • names(data) : Displays all the variables of the dataset. • dim(data) : Shows the dimensions of the data (No. of rows & columns). • sum(is.na(data)) : Shows the total NA values present in the dataset. • attach(df) : Attaching a dataset named df in R. • detach(df) : Detaching a dataset named df. • head(data) : Prints first 6 rows of the dataset. • tail(data) : Prints last 6 rows of the dataset. • print(tibble, n=20) : Prints first 20 rows of the dataset.
  • 14. • data$variable<-as.character(data$variable) : Saving the variable as character type. • data[1,] : Prints 1st row of the dataset. • data[,3] : Prints 3rd column of the dataset. • table(data$variable) : Tabular view of the variable (Frequency Distribution). • round(mean(data$variable,na.rm = T),2) : Rounds off the mean of variable(removing NA values) upto 2 decimals. • sort(prop.table(data$variable),decreasing=T) : Tabular view of proportions of variable totals in descending order. • complete.cases(variable) : ! is.na(variable) : Boolean Output (True/False) of Non-NA Values(Shows True for Non-NA Values). • list1 <- list(mtcars, iris) : Stores a variable named list1 containing list of 2 mentioned datasets in the global environment. • which(iris$Species == "setosa") : Displays all the row numbers where the given criteria is satisfied. • rm(list = ls()) : Clears all the stored datasets and variables from the global environment.
  • 16. A <- c("what","is","truth") • COMMAND : if("Truth" %in% A){ print("Truth is found") }else{ print("Not Found") } • OUTPUT : Not Found A <- c(1:10) • COMMAND : for(i in 1:3){ print(A) } • OUTPUT : 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 apply(mtcars, 1, mean) • OUTPUT : Displays mean of all variables row-wise. lapply(weather, t) [6th dataset (9th Slide)] • OUTPUT : Displays transpose of data in the form of a list. lapply(weather, "[",1) : Access 1st column of all datasets in the list. • Creating a Function : missingvalue <- function(x){ return(sum(is.na(x))) } sapply(weather, missingvalue) • OUTPUT : Displays count of missing values in the form of a table. • Alternative Code for the same task : sapply(weather, function(x) sum(is.na(x)))
  • 18. 1. First step to deal with a typical business problem is by making hypothesis, then performing Exploratory Data Analysis up till Step 6. 2. Data will be received - Extract all the Variables/Features from the data. 3. Understanding the Data in Detail - Finding Patterns, Outliers, Anomalies & Missing Values. 4. Performing Univariate Analysis - Single Variable Analysis. 5. Conduct Multivariate Analysis - Categorical Vs. Categorical, Numerical Vs. Categorical, Numerical Vs. Numerical Variables. 6. Missing Values Imputation, Treatment of Outliers & Anomalies (if any). 7. Apply Feature Engineering - Variable Transformation. 8. Perform Scaling of the Data - Bringing the Data near to Normal Distribution. 9. Apply ML Algorithm, make predictions & test the Accuracy.
  • 19. • Pipe Operator : %>% : Passes object on left hand side as first argument of function on right hand side. E.G : iris %>% names() : States all the variable names of the iris dataset. • Tibble : iris<- tbl_df(iris) : Converts data to tibble class, which are easier to examine than data frames. • Select() : iris %>% select(Sepal.Length, Petal.Length, Species) : Select columns by name • Filter() : filter(mtcars, cyl>4 & gear>4) : Extract rows that meet logical criteria. • Rename() : iris <- iris %>% rename(Sepal_Length = Sepal.Length) : Renaming a Variable.
  • 20. • Select_If() : iris.num <- iris %>% select_if(is.numeric) : Extracting the data basis condition. • Match Operator ( %in% ) : 8 %in% c(1,2,9,5,3,6,7,4,5) : OUTPUT : FALSE iris %>% filter(Species %in% c("setosa", "virginica")) %>% summary() : Summarises the data where the condition is TRUE. • Helper Functions : select(iris, contains(".")) : Select columns whose name contains a character string. select(iris, starts_with("Sepal")) : Select columns whose name starts with a character string. select(iris, ends_with("Length")) : Select columns whose name ends with a character string. select(iris, everything()) : Select every column. select(iris, Sepal.Length:Petal.Width) : Select all columns between Sepal.Length and Petal.Width (inclusive). select(iris, -Species) : Select all columns except Species. iris<- iris %>% select(Sepal_Length = Sepal.Length) : Renaming a variable name.
  • 21. • Arrange() : hflights %>% arrange(DepDelay) : Arranges the data in ascending order basis DepDelay. hflights %>% arrange(desc(DepDelay)) : Arranges the data in descending order basis DepDelay. • Group_By() : iris %>% group_by(Species) : Group data into rows with the same value of Species. • Summarise() : iris %>% group_by(Species) %>% summarise(Count=n()) : Summarises the count of Species in data. hflights %>% group_by(Month) %>% summarise(dist = sum(Distance)) %>% summarise( Minimum_Distance = min(dist), Maximum_Distance = max(dist), Mean_Distance = mean(dist), Median_Distance = median(dist)) : Summarises the min., max., mean and median of total monthly distances. • Tally() : hflights %>% group_by(Month) %>% tally(sort = T) : Adds a frequency column to the table.
  • 22. • Mutate() : mutate(iris, Sepal = Sepal.Length + Sepal.Width) : Compute and append one or more new columns. • Transmute() : transmute(iris, sepal = Sepal.Length + Sepal. Width) :Compute one or more new columns. Drop original columns. • Slice() : slice(hflights, 100:106) : Slice rows by position. • Distinct() : iris %>% select(Species) %>% distinct() : Removes duplicates. • If_Else() : df <- data.frame(x=c(1,NA,6,5)) df <- df %>% mutate(New_Variable = if_else(x<5, x+1, x+2, 0)) : Last argument is to replace missing values (NA).
  • 23. • Union() : union(y, z) : union(mtcars[1:16,], mtcars[17:32,]) : Rows that appear in either or both y and z. • Intersect() : intersect(y, z) : intersect(mtcars[1:16,], mtcars[16:32,]) : Rows that appear in both y and z. • Between() : hflights %>% filter(between(DepTime,600,605)) : Displays rows with variable value(DepTime) lying between specified values. • Count() : hflights %>% count(Month, sort = T) : A shortcut that does grouping as well as creating a frequency table. • Bind_Rows() : bind_rows(y, z) : bind_rows(mtcars[1:16,], mtcars[17:32,]) : Append z to y as new rows.
  • 24. • User Input : number <- as.integer(readline(prompt("Enter the number"))) • Scatter Plot (Along with Smooth Line): ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5, shape=7, alpha=0.5) + labs(title=“Scatter-Plot”, subtitle=“Mpg Vs. Hp”, x=“mpg”, y=“hp”) + geom_smooth(fill = NA, size = 1.5, method = lm, color= "blue") + theme_bw() • Box Plot : ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot(aes(fill=factor(cyl)), alpha = 0.75) + scale_fill_discrete(name = "Cyl") + labs(title = “Box-Plot”, x="Cyl", y="Mpg") + theme_classic() • Bar Plot : iris %>% group_by(Species) %>% summarise(Count=n()) %>% ggplot(aes(Species, Count)) + geom_bar(stat="identity", fill = "green") + labs(title = “Bar-Plot”, x=“Species”, y=“Count”) + geom_label(aes(Species, Count, label = Count)) + theme(axis.text.x = element_text(angle=45, hjust = 1)) • Histogram : ggplot(movies, aes(rating)) + geom_histogram(aes(fill= ..count..), binwidth = 0.1) + ggtitle("Histogram") + xlab("Ratings") + ylab("Count") + theme_minimal()
  • 25. • Density Plot : • ggplot(movies, aes(rating)) + geom_density(color = "red") + labs(title="Density Plot", subtitle="Movies Data", x="Ratings", y="Density", caption="Source : Movies Data") + theme_grey() • Heat Map (Bin2d Map) : • ggplot(movies, aes(x=year, y=rating)) + geom_bin2d() + labs(title = "Heat Map", subtitle = "Year Vs. Rating", x="Year", y="Rating", caption="Source : Movies Data") + theme_classic() • Violin Plot : • iris %>% ggplot(aes(Species, Sepal.Length)) + geom_violin(fill = "red", alpha = 0.75) + labs(title = "Violin-Plot", subtitle = "Species Vs. Sepal.Length", caption = "Source : Iris Data") • Correlation Plot : • corrplot(cor(mtcars),method = "circle") : Works with data containing all numeric variables only. • Cowplot : • plots <- plot_grid(A, B, nrow = 1) : Here, A, B and C are 3 stored plots respectively. • plot_grid(plots, C, ncol = 1) : This command will plot all 3 plots together.
  • 26. • GG Repel : • ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5) + geom_text_repel(aes(label = rownames(mtcars)), color='blue') + theme_minimal() • Faceting & Flipping : • Facet divides a plot into sub-plots. • ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))+ geom_point(aes(color=Species, size=Species)) + facet_wrap(~Species) • ggplot(mtcars, aes(x = mpg, y = hp))+geom_point(aes(color=factor(cyl), size=5))+facet_grid(.~cyl)+coord_flip() • Tidyr : • future500$Revenue <- gsub("$",“=", future500$Revenue : Replacing $ symbol with = symbol in all the Revenue entries. • future500 <- separate(future500, State, into=c("State", "City"), sep = ",") : Separating “State” in 2 separate columns. • Leaflet : • leaflet() %>% addTiles() %>% addMarkers(lat = 40.7223, lng = -73.9887) • OUTPUT : Displays map view of the location basis latitude and longitude entered.
  • 27. • VIM : • aggr(future500) : Calculate or plot the amount of missing/imputed values in each variable of the dataset. • Substr() Command : • substr(iris$Species, 1, 3) : Extract or replace first 3 substrings in Species variable. • Ifelse() Command : • iris$Species_Num <- ifelse(iris$Species=="setosa", 1, ifelse(iris$Species=="versicolor", 2, 3)) • Revalue() Command : • iris$Species <- revalue(iris$Species, c("setosa"=1,"versicolor"=2,"virginica"=3)) • Recode() Command : • iris$Species <- recode(iris$Species, "c('setosa')='Type-1'") • Writing a CSV File : • write.csv(iris, file = "Iris Dataset.csv", row.names = F)
  • 28. • Strsplit() Command : • strsplit(future500$State, ",") :Splits the data whenever it sees , separator. Variable must be character type only. • Regression : • summary(lm(target_variable~., data=dataset)) : Target Variable must be converted to numeric class. Higher the Adjusted R^2, better the model. • Impute() : • imputeddata <- impute(future500, classes = list(factor = imputeMode(), numeric = imputeMedian())) • future500 <- imputeddata$data • Na.Roughfix : • future500<- na.roughfix(future500) : Impute missing values by Median/Mode. • Skewness and Kurtosis : • skewness(iris$Sepal.Length) : Gives skewness of the numeric variable. • kurtosis(iris$Sepal.Width) : Gives kurtosis of the numeric variable.
  • 29. • Splitting Iris Data : [caret Package] index = createDataPartition(iris$Species, p=0.5, list = F) train <- iris[index,] test <- iris[-index,] • Splitting Iris Data : [caTools Package] split.data <- sample.split(iris$Species, SplitRatio = 0.75) training_set <- subset(iris, split.data==T) test_set <- subset(iris, split.data==F) • Train Dataset – Data having predictor variables and the target variable. • Test Dataset - Model is tested over this data for accuracy. • combined<- bind_rows(train, test) : Dataset on which Feature Engineering is performed. • train_new and test_new are 2 datasets extracted from combined having same dimensions as train and test data with no missing values(except Target Variable in test_new) & additional features.
  • 31. • Setting Seed : • set.seed(123) : Makes the selected sample STATIC, any number can be written. This code line is written every time before running a model or before sampling. • Decision Tree : • model_dtree <- rpart(target_variable~., data = train_new, method = "class“, control = rpart.control(minsplit =60, minbucket = 30, maxdepth = 4)) [Method can be class or anova] • Visualizing Decision Tree : • plot(as.party(model_dtree)) • Predicting Decision Tree Outcomes : • predict_dtree <- predict(model_dtree, test_new, type = "class") • Creating Confusion Matrix : • DtreeCM <- confusionMatrix(predict_dtree, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeCM$overall[1])) • Altering 3 parameters : minsplit, minbucket & maxdepth of a Decision tree : Hyper-Parametric Tuning
  • 32. • Random Forest : • model_rf <- randomForest(factor(target_variable) ~.,data = train_new, method="rf") • Predicting Random Forest Outcomes : • predict_rf <- predict(model_rf, test_new, type = "response") • Creating Confusion Matrix : • DtreeRF <- confusionMatrix(predict_rf, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeRF$overall[1])) • C Forest : • model_cf <- cforest(as.factor(target_variable)~., data = train_new) • Predicting C Forest Outcomes : • predict_cf <- predict(model_cf, test_new, type = "response", OOB = T) • Creating Confusion Matrix : • DtreeCF <- confusionMatrix(predict_cf, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeCF$overall[1]))
  • 33. • Linear Regression : • model_lm <- lm(target_variable~., data = train_new) • Visualizing Linear Regression Model : • par(mfrow=c(2,2)) ; plot(model_lm) • Predicting Linear Regression Outcomes : • predict_lm <- predict(model_lm, test_new, type = "response") • Checking Root Mean Square Error (RMSE) : rmse(actual_data$target_variable, predict_lm) : Lower the RMSE, better the model. • Logistic Regression : [Used for Classification Problem] • model_glm <- glm(target_variable~., data = train_new) • Visualizing Logistic Regression Model : • par(mfrow=c(2,2)) ; plot(model_glm) • Predicting Logistic Regression Outcomes : • predict_glm <- predict(model_glm, test_new, type = “response”) • Checking Accuracy & Kappa Value: • confusionMatrix(actual_data$target_variable, predict_glm) : Higher the Kappa, better the model.
  • 34. • k-NN Algorithm : • k-NN model runs only for numerical variables, therefore we remove all categorical columns while building the model. • model.knn <- knn(train = training_set[,-5], test = test_set[,-5], cl = training_set[,5], k = 10) • 5th column of Iris dataset being a factor type is excluded while building model. • Tuning k-NN : • summary(tune.knn(x=training_set[,-5], y=training_set[,5], k=1:20)) • k-Means Clustering : • Forming Cluster : • sepalclusters <- kmeans(iris[,1:2],3,nstart = 20) • table(sepalclusters$cluster, iris$Species) • Visualizing Clusters via Animation : • kmeans.ani(iris[,1:2],centers = 3)