SlideShare a Scribd company logo
Basics of R programming
for analytics
Course code – PGP 207
PGP MCB 2023-25
Term: II
What is R
 R is a statistical programming environment
 Statistical Programming Environment = Where you can both write
code and do data analysis
 Different from SPSS or SAS or other Statistical Packages
 You can use for more than just data analyses
 R stores everything in the form of objects
 You can combine R with other writing environments such as LaTeX
and Markdown to write reports
Why Use R?
•It is a great resource for data analysis, data
visualization, data science and machine learning
•It provides many statistical techniques (such as
statistical tests, classification, clustering and data
reduction)
•It is easy to draw graphs in R, like pie charts,
histograms, box plot, scatter plot, etc
•It works on different platforms (Windows, Mac, Linux)
•It is open-source and free
•It has a large community support
•It has many packages (libraries of functions) that can be
used to solve different problems
Obtaining R
 The best way to obtain R is to visit the CRAN Website
 http://cran.r-project.org
 You will need Internet access to download the files
 Installation of R depends on the platform you have:
 Select the appropriate binary version
 A binary version = is the machine coded version that will directly
install R
Appearance of CRAN
Obtaining additional R Packages
 For Working with R you will need additional packages
 These packages are combination of data and functions
 The packages are kept in package repositories
 To Use packages you will have to install and then call them
 Installing: use install.packages(“name of the package”, repos = “”,
dep = T)
 To Use Packages, use library(name of the package), also
require(name of the package) [Use either]
Using R with an IDE
 Always a good idea to use R with an integrated development
environment (IDE)
 Integrated Development Environment will help you to write codes,
and view the outputs at the same time
 You can also browse the objects, data, and graphs in the IDE
 The IDE used in these set of exercises is RStudio
 RStudio is free and open, and you can download from
http://rstudio.com
 Download the RStudio Desktop version for your use in these
modules
 Install R First and then RStudio
Download Page of RStudio
Your Set up to get Started
Source window:
used to edit a
script and run it.
Console window:
used to run a
particular packages
or to run particular
command.
Workspace window: it stores all
the variables used during
execution of command under
the environment tab
Plots and File window: the file tab is
used to track the working
directories
The plot tabs show all the graphical
output
What can we put in [>] and take out [<] from R?
 From Spreadsheets [ > ]
 Source Code Files [ > ]
 From other Software [ > ]
 Text Based Data [ > ] [ < ]
 Tables of Data [ > ] [ < ]
 Images [ < ]
 Dump Files [ < ]
Assignment 1
Find the answers to log2(2^5) and log(exp(1)*exp(1)).
Data frame in R studio
ID <- c(1,2,3,4,5)
Name <- c(“Ramesh”, “Kaushik”, “Chaitali”, “Hardik”, “Komal”)
English <- c(45,65,72,80,57)
Hindi <- c(65,78,56,45,48)
Science <- c(45,55,68,74,63)
So_Science <- c(58,69,63,77,52)
Math <- c(88,63,59,70,76)
Stu_marks<- data.frame(Name,English, Hindi, Science, So_Science,
Math)
View(Stu_marks)
# extracting single column from given dataframe
Stu_marks$Math
Stu_marks$Hindi
Create new data frame with
Column : name
Computer_app
EVS
Enter the cmd:
New_df_name<-merge(df1, df2, by = “names”)
View(New_df_name)
Packages in R
1. A collection of R functions, complied code and sample data.
2. Stored under a directory called library in the R environment.
By default, R installs set of packages.
To see the number of packages installs in R enter the command in
console window:
> library()
> fraction (firstVar/secondVar)
Introduction to R script
An R script is a plain text file in which you can store your R code.
Script allows you to show your work to others and also reproduce and modify the results
How to set working directories?
In the console window write:
> getwd()
the current working directory is shown in the output
How to set our current working directory?
> setwd()
How to read and store “csv” file in R?
Type the following command on console window:
file_name = read.csv(“file_name.csv”)
To view the file enter the command:
View(filename)
How to create dataframe in R?
> names <- c(“Rohit”, “Dhoni”, “Virat”, “Hardik”, “KL Rahul”, “Bumrah”)
> played <- c(45,49,47,47,40,25)
> won <- c(22,21,14,9,9,8)
> lost <- c(12,13,14,8,19,6)
> y <- c(2008, 2004,2007, 2009, 2010,2010)
>cricket_players <- data.frame(names, played, won, lost, y)
> View(cricket_players)
You can access the parts of data frame by the following cmd:
> cricket_players$names
> cricket_players$won
Suppose we want to find the ratio between no. of games played and
won:
> ratio <- cricket_players$won/cricket_players$played
The ratio is stored in the new variable name called “victory”
> cricket_players$victory <- ratio
To reduce the number of digits after decimal in victory column:
> options (digits=2)
> View(cricket_players)
 mean(cricket_players$played)
> plot (cricket_players$names, cricket_players$played)
Inputting a Source File
A source file contains all the codes that you will need to run your
analyses. This is used to input data and commands to R. You ask R to run
your codes by typing:
source(“file.R”)
Remember to save the code with the extension “.R
Code to read data from console to R
mylar <- scan(“”, what = “numeric)
▪ Reads directly from console
▪ Saves the numbers to a variable
Code to read data from text files
 Write the read.csv() code example
 Comma separated value files (csv)
 Need to indicate if you have a header
 Here we have set the variable names manually
mydata<- read.csv(“DOB.csv”, header = T, sep = “ , ”)
names(mydata) <- c (“Id”, “Time”, “DOB”)
SUGGESTED TEXT BOOKS
 Hands- On Programming with R Write Your Own Functions and Simulations, Mumbai Shroff
Publishers & Distributors
 Chambers, John M., Software for Data Analysis Programming With R, USA Springers
 Grolemund, Garrett., Hands- On Programming with R Write Your Own Functions And
Simulations, Mumbai Shroff Publishers
E-Resources
• https://www.tutorialspoint.com/r/index.htm
• https://www.w3schools.com/r/r_intro.asp
• https://www.javatpoint.com/r-tutorial
Comments in R
Comments can be used to explain R code, and to make it more readable.
It can also be used to prevent execution when testing alternative code.
Comments starts with a #. When executing code, R will ignore anything
that starts with #.
Example: This example uses a comment before a line of code:
# This is a comment
“Hello World”
Example: This example uses a comment at the end of the line of code:
“Hello World” # This is a comment
Comments does not have to be text to explain the code, it can also be
used to prevent R from executing the code:
# "Good morning!"
"Good night!"
Reserved Words in R
Reserved words in R programming are a set of words that have
special meaning and cannot be used as an identifier (variable
name, function name etc.)
Reserved words in R
if else repeat while function
for in next break TRUE
FALSE NULL Inf NaN NA
NA_integer_ NA_real_
NA_complex
_
NA_characte
r_
...
Identifiers in R
Variables in R
Variables are used to store data, whose value can be changed
according to our need. Unique name given to variable (function
and objects as well) is identifier.
Rules for writing Identifiers in R
1.Identifiers can be a combination of letters, digits, period (.) and
underscore (_).
2.It must start with a letter or a period. If it starts with a period, it
cannot be followed by a digit.
3.Reserved words in R cannot be used as identifiers.
Valid identifiers in R
Total, sum, fine.with.dot, Number5, this_is_acceptable
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .one
Constants in R
Constants, as the name suggests, are entities whose value cannot
be altered. Basic types of constant are numeric constants and
character constants.
Data cleaning in R
Here we are using Excel file “Data cleaning in R”
To view the first 5 observations the cmd will be
head(Data cleaning in R)
Handling missing values in R
mean(Data cleaning in R$Test1)
mean(Data cleaning in R$Test2)
mean(Data cleaning in R$Test3)
mean(Data cleaning in R$Test1. na.rm = TRUE)
summary(Data cleaning in R)
Imputing Excel file
To install “Excel” package
install.package(“xlsx”)
library(“xlsx”)
Reading excel File
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
Class(file_name)
Typeof(file_name)
To access the top two rows of dataframe
head(dataframe,2)
Tail(dataframe,2)
Str(dataframe)
Matrix in R
mat<- matrix(c(1,2,3,4,5,6),nrow = 2, ncol = 3)
mat
mat[1,2]
mat[,2]
mat[1,]
mat[2,]
stringmatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple",
"pear", "melon", "fig"), nrow = 3, ncol = 3)
newmatrix <- cbind(stringmatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
newmatrix
Data Visualization
A histogram is
A visual representation of the distribution of dataset.
Used to plot a frequency of score occurrences in a continuous dataset.
Working on movies dataset with file name: moviesData.csv
The script used here is myPlot.R
To plot histogram type the following command:
 hist(movies$runtime)
How to add lables and colour to the histogram for this we have to add
more arguments to the histogram:
hist(movies$runtime)
hist(movies$runtime, main = "Distribution of movies' length", xlab
= "Runtime of movies", xlim = c(0,300), col = "Blue", breaks = 4)
Pie chart
It is a circular chart
Divided into wedge-like sectors, illustrating proportion.
The total value of the pie chart is always 100 percent.
In the movie data set, we are making pie chart of the column “Genre”,
for that first we are making frequency table of the column Genre.
genrecount <- table(movies$genre)
View(genrecount)
pie(genreCount, main = "Proportion of movies' genre", border =
"blue", col = "orange")
Bar Chart
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable.
R uses the function barplot to create bar charts
We are plotting bar chart from the movie dataset, of the column
imdb_ratings and for the sake of simplicity we are taking only 20
observations.
moviesSub <- movies[1:20,]
barplot(moviesSub$imdb_rating,
ylab = "IMDB Rating",
xlab = "Movies",
col = "blue",
ylim = c(0,10),
main = "Movies', IMDB Rating")
Output of Bar Chart
In continuation of the previous slide, we will add the movie names in
the x-axis
barplot(moviesSub$imdb_rating,
ylab = "IMDB Rating",
xlab = "Movies",
col = "blue",
ylim = c(0,10),
main = "Movies', IMDB Rating",
names.arg = moviesSub$title)
In the O/P, not all name are visible, for that we will add the name in the
perpendicular to the x-axis.
barplot(moviesSub$imdb_rating,
ylab = "IMDB Rating",
xlab = "Movies",
col = "blue",
ylim = c(0,10),
main = "Movies', IMDB Rating",
names.arg = moviesSub$title,
las = 2)
Basics of R programming for analytics [Autosaved] (1).pdf
Let us analyse the relation between “imdb_ratings” and
“audience_score” for this we draw a scatter plot using the plot function
Scatter plot is a graph in which the values of the two variables are
plotted along two axes.
The pattern of the resulting points reveals the correlation.
plot(x = movies$imdb_rating,
y = movies$audience_score,
main = "IMDB Ratings vs Audience Score",
xlab = "IMDB Rating",
ylab = "Audience Score",
xlim = c(0,10),
ylim = c(0,100),
col = "blue")
Basics of R programming for analytics [Autosaved] (1).pdf
Now, we will see the correlation between the imdb_rating and
audience_score:
cor(movies$imdb_rating, movies$audience_score)
O/P
0.8651485
Box Plot
Boxplots are created in R by using the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
•x is a vector or a formula.
•data is the data frame.
•notch is a logical value. Set as TRUE to draw a notch.
•varwidth is a logical value. Set as true to draw width of the box proportionate to the
sample size.
•names are the group labels which will be printed under each boxplot.
•main is used to give a title to the graph.
boxplot(mtcars$mpg)
boxplot(mtcars$mpg, main="Mileage Data Boxplot", ylab="Miles Per
Gallon(mpg)", xlab="No. of Cylinders", col="orange")
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
Basics of R programming for analytics [Autosaved] (1).pdf
Introduction to ggplot2
Visualization is an important tool for insight generation
It is used to understand the data structure, identify outliers and find
patterns
There are two methods of data visualization in R:
 Basic Graphics
 Grammer of graphics (popularly known as ggplot2)
 Basic Graphics
Following are the code for “sin” curve
plot(x,y, main = "Plotting
sin curve", ylab = "sin(x)")
Now, we will learn how to change the type of the curve
plot(x,y, main = "Plotting sin curve", ylab = "sin(x)", type = "l",
col = "blue")
To plot the “cosine” and “sin” curve on the same plot
plot(x, sin(x),
main = "Two Graphs in one plot",
ylab = "",
type = "l",
col = "blue")
lines(x, cos(x),
col = "red")
Here, we will use “legend” to differentiate between the two graphs
plot(x, sin(x), main = "Two Graphs in one plot", ylab = "", type =
"l", col = "blue")lines(x, cos(x), col = "red")legend("topleft",
c("sin(x)","cos(x)"), fill = c("blue", "red"))
Basics of R programming for analytics [Autosaved] (1).pdf
ggplot2 graphics
ggplot2 package was created by Hadley Wickham in 2005
If offers a powerful graphics language for creating elegant and complex
plots
We will use “movies” dataset for exploring “ggplot2” package
library(ggplot2)
View(movies)
Now, we want to draw scatter plot between the “critics_score” and
“audience_score”:
Ggplot2 package take three arguments in its function:
1. Data
2. Aesthetics
3. Geometrical
ggplot(data = movies, mapping = aes(x=critics_score,
y=audience_score))+ geom_point()
There is positive correlation between critics_score and audience_score
How to save the ggplot2 graph using ggplot save function in our current
working directory?
ggsave("scatter_plot.png")
Aesthetic mapping in ggplot2
We will learn:
1. What is aesthetic
2. How to create plots using aesthetic
3. Turning parameters in aesthetic
What is Aesthetic
 Aesthetic is a visual property of the objects in a plot
 It includes lines, points, symbols, colors and positions
 It is used to add customization to our plots
# Load ggplot2
library(ggplot2)
# Clear R workspace
rm(list = ls() )
# Declare a variable to read and store moviesData
movies <- read.csv("moviesData.csv")
# View movies data frame
View(movies)
# Plot critics_score and audience_score
ggplot(data = movies, mapping = aes(x = critics_score, y = audience_score)) +
geom_point()
Now, we will assign the unique color to each “Genre” of movie column
ggplot(data = movies,mapping = aes(x = critics_score, y =
audience_score, color = genre)) + geom_point()
How to draw “Bar chart” using ggplot function
The following code represents the type of the column “mpaa_ratings”
and number of elements in this column:
str(movies$mpaa_ratings)
levels(movies$mpaa_ratings)
ggplot(data = movies,mapping = aes(x = movies$mpaa_rating))+
geom_bar()
We will learn how to add labels to this bar chart:
ggplot(data = movies, mapping = aes(x = movies$mpaa_rating,
fill=genre))+ geom_bar()+ labs(y="Rating counts", title="Count of
mpaa rating")
Now we will draw histogram for the variable “run time”
# Histogram for "runtime“
ggplot(data = movies, mapping = aes(x=runtime))+geom_histogram()+
labs(x="Runtime of Movies", title="Distribution of Runtime")
Data manipulation using dplyr package
“dplyr” is a package for data manipulation, written and maintained by
Hadley Wickham
It comprises many functions that perform mostly used data
manipulation operations
# Clear R workspace
rm(list = ls())
# Declare a variable to list and store movies data
movies<- read.csv("moviesData.csv")
View(movies)
Now we will install “dplyr” package
install.packages(“dplyr”)
library(dplyr)
Key functions in “dplyr” package
Filter- to select cases based on their values
Arrange – to reorder the cases
Select – to select variables based on their names
Mutate – to add new variables that are functions of existing variables
Summarise – to condense multiple values to a single value
All these functions can be combined with group_by functions. It allows
us to perform any operation by group.
# Clear R workspace
rm(list = ls())
# Declare a variable to list and store movies data
movies<- read.csv("moviesData.csv")
View(movies)
# using "filter" function we will filter the column "genre" by comedy
movies
moviesComedy <- filter(movies, genre == “Comedy")
View(moviesComedy)
moviesComedyDr <- filter(movies, genre =="Comedy"|
genre == "Drama")
View(moviesComedyDr)
irisspecies <- filter(iris, Species==“Setosa”)
View(irisspecies)
irisspecies <- filter(iris,
Species==“Setosa”|Petal.Length>=1.5)
Vies(irisspecies)
# filter the movies data by genre "Comedy" having "imdb_rating"
greater than or equal to 7.5
moviesComedyIm <- filter(movies, genre == "Comedy" &
imdb_rating >=7.5)
View(moviesComedyIm)
# using "arrange" function arranging the imdb_rating by ascending
order
moviesImA <- arrange(movies, imdb_rating)
View(moviesImA)
install.packages(“dplyr)
library(dplyr)
data(iris)
View(iris)
iris_pet_arr <- arrange(iris, Petal.Length)
View(iris_pet_arr)
# using "arrange" function arranging the imdb_rating by descending
order
moviesImD <- arrange(movies,desc(imdb_rating))
View(moviesImD)
# Arrange the two columns "genre" by alphabetical order and
"imdb_rating" by ascending order
moviesGeIm <- arrange(movies, genre, imdb_rating)
View(moviesGeIm)
More functions in “dplyr” package
1. Select
2. Remane
3. Mutate
Here, we are using myVis.R script which is folder containg moviesData and set
myVis folder as working directory.
Before using the above functions install the package “dplyr”
# using select function from dplyr package
moviesTGI <- select(movies, title, genre, imdb_rating)
View(moviesTGI)
Let us select the three columns “thtr_rel_year”, “thtr_rel_month” and
“thtr_rel_day” along with the “title” column
For that enter the following cmd in the console window:
moviesTHT <- select(movies, title, starts_with("thtr"))
View(moviesTHT)
Let us change the name of the column “thtr_rel_year” using “rename”
function
moviesR <- rename(movies, rel_year = "thtr_rel_year")
View(moviesR)
Suppose we want to add a new variable (column) in movies dataset for
that we will use “mutate” function
moviesLess <- select(movies, title:audience_score)
View(moviesLess)
# use of Mutate function
moviesMu <- mutate(moviesLess, criAud = critics_score-
audience_score)
View(moviesMu)
Pipe operator
We will learn about:
1. Summarise and group_by functions
2. Operations in summarise functions
3. Pipe operator
Make folder names “pipeops” in myproject folder and set “pipeops” as
working directory
Summarise function
1. Summarise function reduces a dataframe into a single row.
2. It gives summaries like mean, median etc., of the variable available
in the dataframe
3. We use summarise along with the group_by function
# use of summarise function
summarise(movies, mean(imdb_rating))
1. When we use group_by function, the data frame is divided into
groups.
We group the “genre” variable using group_by function
# use of group_by function
group_Movies <- group_by(movies, genre)
# using summarise function on the above cmd
summarise(group_Movies, mean(imdb_rating))
Now, we are using filter, group_by and summarise function to extract
the drama movies mean from mpaa_rating.
dramaMov <- filter(movies, genre == "Drama")
gr_dramaMov <- group_by(dramaMov, mpaa_rating)
summarise(gr_dramaMov, mean(imdb_rating))
Pipe operator
The pipe operator is denoted as
% > %
It prevents us from making unnecessary data frames
We can read the pipe as a series of imperative statements
If we want to find the cosine of sine for pi, we can write
Pi % > % sin() % > % cos()
We will learn how to do the same above analysis using pipe operator
movies %>% filter(genre =="Drama") %>% group_by(mpaa_rating) %>%
summarise(mean(imdb_rating))
Let us find the difference between “critics_score” and “audience_score”
from movies data frame. We will use box plot for this,using the pipe
operator we will combine the functions of “ggplot2” and “dplyr”
packages
movies %>% mutate(diff = audience_score - critics_score) %>% ggplot
(mapping = aes(x=genre, y=diff))+ geom_boxplot()
Now, we are going to find that number of category of movies in
mpaa_rating
movies %>% group_by(genre, mpaa_rating) %>% summarise(num = n())
Conditional statements
We will learn:
1. Conditional statements
2. If, else and else if statements
Conditional statements are used to execute some logical conditions in
the code
If, else and else if statements are some basic conditional statements
Statistical function for data analysis
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car
Road Tests), which is retrieved from the 1974 Motor Trend US Magazine.
In the examples below (and for the next chapters), we will use the
mtcars data set, for statistical purposes:
To get in-built data set in R
data()
data(mtcars)
View(mtcars)
head(mtcars,6)
head(mtcars)
nrow(mtcars)
ncol(mtcars)
Example
# Print the mtcars data set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the
mtcars data set:
# Use the question mark to get information about the data set
?mtcars
Get Information
Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars
data set for better organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from
the data set
names(Data_Cars)
Sort Variable Values
To sort the values, use the sort() function:
Example
Data_Cars <- mtcars
sort(Data_Cars$cyl)
Analyzing the Data
Now that we have some information about the data set, we can start to
analyze it with some statistical numbers.
For example, we can use the summary() function to get a statistical
summary of the data:
Data_Cars <- mtcars
summary(Data_Cars)
sd(mtcars$cyl)
statistical function in R
Mean, Median, and Mode
In statistics, there are often three values that interests
us:
•Mean - The average value
•Median - The middle value
•Mode - The most common value
Data_Cars <- mtcars
mean(Data_Cars$wt)
Median
The median value is the value in the middle, after you have sorted all
the values.
If we take a look at the values of the wt variable (from the mtcars data
set), we will see that there are two numbers in the middle:
Data_Cars <- mtcars
median(Data_Cars$wt)
mean(marks$Test1)
mean(marks$Test1, na.rm = TRUE)
d1 <- na.omit(old_filename)
Mode
The mode value is the value that appears the most number of times.
R does not have a function to calculate the mode. However, we can
create our own function to find it.
If we take a look at the values of the wt variable (from the mtcars data
set), we will see that the numbers 3.440 are often shown:
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
http://www.sthda.com/english/wiki/ggplot2-
essentials#:~:text=There%20are%20two%20major%20functions,a%20pl
ot%20piece%20by%20piece.
Website give the details of ggplot2 package.
https://bookdown.org/jeffreytmonroe/business_analytics_with_r7/basi
cs.html
https://www.geeksforgeeks.org/packages-in-r-programming/?ref=lbp
https://www.modernstatisticswithr.com/datachapter.html
https://www.w3schools.com/r/r_stat_data_set.asp
https://www.geeksforgeeks.org/r-keywords/?ref=lbp

More Related Content

PDF
FULL R PROGRAMMING METERIAL_2.pdf
PDF
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
PDF
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
PPTX
Introduction To Programming In R for data analyst
PDF
R-Language-Lab-Manual-lab-1.pdf
PDF
R-Language-Lab-Manual-lab-1.pdf
PDF
R-Language-Lab-Manual-lab-1.pdf
PPTX
Intro to data science module 1 r
FULL R PROGRAMMING METERIAL_2.pdf
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
محاضرة برنامج التحليل الكمي R program د.هديل القفيدي
Introduction To Programming In R for data analyst
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
Intro to data science module 1 r

Similar to Basics of R programming for analytics [Autosaved] (1).pdf (20)

PPT
Basics of R
PPTX
Unit I - 1R introduction to R program.pptx
PPTX
PPTX
Getting Started with R
PPTX
Introduction to R programming Language.pptx
PDF
Lecture1_R.pdf
PPTX
Data Science With R Programming Unit - II Part-1.pptx
PPTX
Data science with R Unit - II Part-1.pptx
PPT
Modeling in R Programming Language for Beginers.ppt
PPT
Lecture1_R.ppt
PPT
Lecture1_R.ppt
PPT
Lecture1 r
DOCX
Introduction to r
PPT
Inroduction to r
PPT
Lecture1_R Programming Introduction1.ppt
PDF
R Programming - part 1.pdf
PPTX
Introduction to R for Learning Analytics Researchers
PPT
R_Language_study_forstudents_R_Material.ppt
PPT
Brief introduction to R Lecturenotes1_R .ppt
PPT
Introduction to R for Data Science Technology
Basics of R
Unit I - 1R introduction to R program.pptx
Getting Started with R
Introduction to R programming Language.pptx
Lecture1_R.pdf
Data Science With R Programming Unit - II Part-1.pptx
Data science with R Unit - II Part-1.pptx
Modeling in R Programming Language for Beginers.ppt
Lecture1_R.ppt
Lecture1_R.ppt
Lecture1 r
Introduction to r
Inroduction to r
Lecture1_R Programming Introduction1.ppt
R Programming - part 1.pdf
Introduction to R for Learning Analytics Researchers
R_Language_study_forstudents_R_Material.ppt
Brief introduction to R Lecturenotes1_R .ppt
Introduction to R for Data Science Technology
Ad

More from suanshu15 (13)

PDF
Funding Report-Week analysis market and industry
PDF
Negotiation-in-International-Business-1.pdf
PDF
Foundation of Data Sciences_PGP_Term II.pdf
PDF
Marketing Management.pdf help students t
PDF
Org. Behaviour in mba student to help st
PDF
MS Pitch Deck for investors prasapective
PPTX
Hynuday car case study details in Indias
PDF
Businesses communication barriers corporate
PDF
Muskan & Sejal IBE.pdf
PDF
Rural immersion presentation 13th jan 2024.pdf
PPTX
Split Tone Fashion Presentation_20231107_120811_0000.pptx
PDF
Hospitality besed
PDF
Rural immersion_20240113_125642_0000.pdf
Funding Report-Week analysis market and industry
Negotiation-in-International-Business-1.pdf
Foundation of Data Sciences_PGP_Term II.pdf
Marketing Management.pdf help students t
Org. Behaviour in mba student to help st
MS Pitch Deck for investors prasapective
Hynuday car case study details in Indias
Businesses communication barriers corporate
Muskan & Sejal IBE.pdf
Rural immersion presentation 13th jan 2024.pdf
Split Tone Fashion Presentation_20231107_120811_0000.pptx
Hospitality besed
Rural immersion_20240113_125642_0000.pdf
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
annual-report-2024-2025 original latest.
PDF
[EN] Industrial Machine Downtime Prediction
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STERILIZATION AND DISINFECTION-1.ppthhhbx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Clinical guidelines as a resource for EBP(1).pdf
Supervised vs unsupervised machine learning algorithms
Reliability_Chapter_ presentation 1221.5784
SAP 2 completion done . PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
annual-report-2024-2025 original latest.
[EN] Industrial Machine Downtime Prediction

Basics of R programming for analytics [Autosaved] (1).pdf

  • 1. Basics of R programming for analytics Course code – PGP 207 PGP MCB 2023-25 Term: II
  • 2. What is R  R is a statistical programming environment  Statistical Programming Environment = Where you can both write code and do data analysis  Different from SPSS or SAS or other Statistical Packages  You can use for more than just data analyses  R stores everything in the form of objects  You can combine R with other writing environments such as LaTeX and Markdown to write reports
  • 3. Why Use R? •It is a great resource for data analysis, data visualization, data science and machine learning •It provides many statistical techniques (such as statistical tests, classification, clustering and data reduction) •It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc •It works on different platforms (Windows, Mac, Linux) •It is open-source and free •It has a large community support •It has many packages (libraries of functions) that can be used to solve different problems
  • 4. Obtaining R  The best way to obtain R is to visit the CRAN Website  http://cran.r-project.org  You will need Internet access to download the files  Installation of R depends on the platform you have:  Select the appropriate binary version  A binary version = is the machine coded version that will directly install R
  • 6. Obtaining additional R Packages  For Working with R you will need additional packages  These packages are combination of data and functions  The packages are kept in package repositories  To Use packages you will have to install and then call them  Installing: use install.packages(“name of the package”, repos = “”, dep = T)  To Use Packages, use library(name of the package), also require(name of the package) [Use either]
  • 7. Using R with an IDE  Always a good idea to use R with an integrated development environment (IDE)  Integrated Development Environment will help you to write codes, and view the outputs at the same time  You can also browse the objects, data, and graphs in the IDE  The IDE used in these set of exercises is RStudio  RStudio is free and open, and you can download from http://rstudio.com  Download the RStudio Desktop version for your use in these modules  Install R First and then RStudio
  • 9. Your Set up to get Started Source window: used to edit a script and run it. Console window: used to run a particular packages or to run particular command. Workspace window: it stores all the variables used during execution of command under the environment tab Plots and File window: the file tab is used to track the working directories The plot tabs show all the graphical output
  • 10. What can we put in [>] and take out [<] from R?  From Spreadsheets [ > ]  Source Code Files [ > ]  From other Software [ > ]  Text Based Data [ > ] [ < ]  Tables of Data [ > ] [ < ]  Images [ < ]  Dump Files [ < ]
  • 11. Assignment 1 Find the answers to log2(2^5) and log(exp(1)*exp(1)).
  • 12. Data frame in R studio ID <- c(1,2,3,4,5) Name <- c(“Ramesh”, “Kaushik”, “Chaitali”, “Hardik”, “Komal”) English <- c(45,65,72,80,57) Hindi <- c(65,78,56,45,48) Science <- c(45,55,68,74,63) So_Science <- c(58,69,63,77,52) Math <- c(88,63,59,70,76) Stu_marks<- data.frame(Name,English, Hindi, Science, So_Science, Math) View(Stu_marks) # extracting single column from given dataframe Stu_marks$Math Stu_marks$Hindi
  • 13. Create new data frame with Column : name Computer_app EVS Enter the cmd: New_df_name<-merge(df1, df2, by = “names”) View(New_df_name)
  • 14. Packages in R 1. A collection of R functions, complied code and sample data. 2. Stored under a directory called library in the R environment. By default, R installs set of packages. To see the number of packages installs in R enter the command in console window: > library() > fraction (firstVar/secondVar)
  • 15. Introduction to R script An R script is a plain text file in which you can store your R code. Script allows you to show your work to others and also reproduce and modify the results How to set working directories? In the console window write: > getwd() the current working directory is shown in the output How to set our current working directory? > setwd() How to read and store “csv” file in R? Type the following command on console window: file_name = read.csv(“file_name.csv”) To view the file enter the command: View(filename)
  • 16. How to create dataframe in R? > names <- c(“Rohit”, “Dhoni”, “Virat”, “Hardik”, “KL Rahul”, “Bumrah”) > played <- c(45,49,47,47,40,25) > won <- c(22,21,14,9,9,8) > lost <- c(12,13,14,8,19,6) > y <- c(2008, 2004,2007, 2009, 2010,2010) >cricket_players <- data.frame(names, played, won, lost, y) > View(cricket_players) You can access the parts of data frame by the following cmd: > cricket_players$names > cricket_players$won
  • 17. Suppose we want to find the ratio between no. of games played and won: > ratio <- cricket_players$won/cricket_players$played The ratio is stored in the new variable name called “victory” > cricket_players$victory <- ratio To reduce the number of digits after decimal in victory column: > options (digits=2) > View(cricket_players)  mean(cricket_players$played) > plot (cricket_players$names, cricket_players$played)
  • 18. Inputting a Source File A source file contains all the codes that you will need to run your analyses. This is used to input data and commands to R. You ask R to run your codes by typing: source(“file.R”) Remember to save the code with the extension “.R
  • 19. Code to read data from console to R mylar <- scan(“”, what = “numeric) ▪ Reads directly from console ▪ Saves the numbers to a variable
  • 20. Code to read data from text files  Write the read.csv() code example  Comma separated value files (csv)  Need to indicate if you have a header  Here we have set the variable names manually mydata<- read.csv(“DOB.csv”, header = T, sep = “ , ”) names(mydata) <- c (“Id”, “Time”, “DOB”)
  • 21. SUGGESTED TEXT BOOKS  Hands- On Programming with R Write Your Own Functions and Simulations, Mumbai Shroff Publishers & Distributors  Chambers, John M., Software for Data Analysis Programming With R, USA Springers  Grolemund, Garrett., Hands- On Programming with R Write Your Own Functions And Simulations, Mumbai Shroff Publishers E-Resources • https://www.tutorialspoint.com/r/index.htm • https://www.w3schools.com/r/r_intro.asp • https://www.javatpoint.com/r-tutorial
  • 22. Comments in R Comments can be used to explain R code, and to make it more readable. It can also be used to prevent execution when testing alternative code. Comments starts with a #. When executing code, R will ignore anything that starts with #. Example: This example uses a comment before a line of code: # This is a comment “Hello World” Example: This example uses a comment at the end of the line of code: “Hello World” # This is a comment Comments does not have to be text to explain the code, it can also be used to prevent R from executing the code: # "Good morning!" "Good night!"
  • 23. Reserved Words in R Reserved words in R programming are a set of words that have special meaning and cannot be used as an identifier (variable name, function name etc.) Reserved words in R if else repeat while function for in next break TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex _ NA_characte r_ ...
  • 24. Identifiers in R Variables in R Variables are used to store data, whose value can be changed according to our need. Unique name given to variable (function and objects as well) is identifier. Rules for writing Identifiers in R 1.Identifiers can be a combination of letters, digits, period (.) and underscore (_). 2.It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. 3.Reserved words in R cannot be used as identifiers.
  • 25. Valid identifiers in R Total, sum, fine.with.dot, Number5, this_is_acceptable Invalid identifiers in R tot@l, 5um, _fine, TRUE, .one Constants in R Constants, as the name suggests, are entities whose value cannot be altered. Basic types of constant are numeric constants and character constants.
  • 26. Data cleaning in R Here we are using Excel file “Data cleaning in R” To view the first 5 observations the cmd will be head(Data cleaning in R) Handling missing values in R mean(Data cleaning in R$Test1) mean(Data cleaning in R$Test2) mean(Data cleaning in R$Test3) mean(Data cleaning in R$Test1. na.rm = TRUE) summary(Data cleaning in R)
  • 27. Imputing Excel file To install “Excel” package install.package(“xlsx”) library(“xlsx”) Reading excel File # Read the first worksheet in the file input.xlsx. data <- read.xlsx("input.xlsx", sheetIndex = 1) print(data)
  • 28. Class(file_name) Typeof(file_name) To access the top two rows of dataframe head(dataframe,2) Tail(dataframe,2) Str(dataframe)
  • 29. Matrix in R mat<- matrix(c(1,2,3,4,5,6),nrow = 2, ncol = 3) mat mat[1,2] mat[,2] mat[1,] mat[2,] stringmatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) newmatrix <- cbind(stringmatrix, c("strawberry", "blueberry", "raspberry")) # Print the new matrix newmatrix
  • 30. Data Visualization A histogram is A visual representation of the distribution of dataset. Used to plot a frequency of score occurrences in a continuous dataset. Working on movies dataset with file name: moviesData.csv The script used here is myPlot.R To plot histogram type the following command:  hist(movies$runtime) How to add lables and colour to the histogram for this we have to add more arguments to the histogram: hist(movies$runtime) hist(movies$runtime, main = "Distribution of movies' length", xlab = "Runtime of movies", xlim = c(0,300), col = "Blue", breaks = 4)
  • 31. Pie chart It is a circular chart Divided into wedge-like sectors, illustrating proportion. The total value of the pie chart is always 100 percent. In the movie data set, we are making pie chart of the column “Genre”, for that first we are making frequency table of the column Genre. genrecount <- table(movies$genre) View(genrecount) pie(genreCount, main = "Proportion of movies' genre", border = "blue", col = "orange")
  • 32. Bar Chart A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot to create bar charts We are plotting bar chart from the movie dataset, of the column imdb_ratings and for the sake of simplicity we are taking only 20 observations. moviesSub <- movies[1:20,] barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating")
  • 33. Output of Bar Chart
  • 34. In continuation of the previous slide, we will add the movie names in the x-axis barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating", names.arg = moviesSub$title) In the O/P, not all name are visible, for that we will add the name in the perpendicular to the x-axis.
  • 35. barplot(moviesSub$imdb_rating, ylab = "IMDB Rating", xlab = "Movies", col = "blue", ylim = c(0,10), main = "Movies', IMDB Rating", names.arg = moviesSub$title, las = 2)
  • 37. Let us analyse the relation between “imdb_ratings” and “audience_score” for this we draw a scatter plot using the plot function Scatter plot is a graph in which the values of the two variables are plotted along two axes. The pattern of the resulting points reveals the correlation. plot(x = movies$imdb_rating, y = movies$audience_score, main = "IMDB Ratings vs Audience Score", xlab = "IMDB Rating", ylab = "Audience Score", xlim = c(0,10), ylim = c(0,100), col = "blue")
  • 39. Now, we will see the correlation between the imdb_rating and audience_score: cor(movies$imdb_rating, movies$audience_score) O/P 0.8651485
  • 40. Box Plot Boxplots are created in R by using the boxplot() function. Syntax The basic syntax to create a boxplot in R is − boxplot(x, data, notch, varwidth, names, main) Following is the description of the parameters used − •x is a vector or a formula. •data is the data frame. •notch is a logical value. Set as TRUE to draw a notch. •varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. •names are the group labels which will be printed under each boxplot. •main is used to give a title to the graph.
  • 41. boxplot(mtcars$mpg) boxplot(mtcars$mpg, main="Mileage Data Boxplot", ylab="Miles Per Gallon(mpg)", xlab="No. of Cylinders", col="orange") boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon", main = "Mileage Data")
  • 43. Introduction to ggplot2 Visualization is an important tool for insight generation It is used to understand the data structure, identify outliers and find patterns There are two methods of data visualization in R:  Basic Graphics  Grammer of graphics (popularly known as ggplot2)  Basic Graphics Following are the code for “sin” curve plot(x,y, main = "Plotting sin curve", ylab = "sin(x)") Now, we will learn how to change the type of the curve plot(x,y, main = "Plotting sin curve", ylab = "sin(x)", type = "l", col = "blue")
  • 44. To plot the “cosine” and “sin” curve on the same plot plot(x, sin(x), main = "Two Graphs in one plot", ylab = "", type = "l", col = "blue") lines(x, cos(x), col = "red") Here, we will use “legend” to differentiate between the two graphs plot(x, sin(x), main = "Two Graphs in one plot", ylab = "", type = "l", col = "blue")lines(x, cos(x), col = "red")legend("topleft", c("sin(x)","cos(x)"), fill = c("blue", "red"))
  • 46. ggplot2 graphics ggplot2 package was created by Hadley Wickham in 2005 If offers a powerful graphics language for creating elegant and complex plots We will use “movies” dataset for exploring “ggplot2” package library(ggplot2) View(movies) Now, we want to draw scatter plot between the “critics_score” and “audience_score”: Ggplot2 package take three arguments in its function: 1. Data 2. Aesthetics 3. Geometrical
  • 47. ggplot(data = movies, mapping = aes(x=critics_score, y=audience_score))+ geom_point()
  • 48. There is positive correlation between critics_score and audience_score How to save the ggplot2 graph using ggplot save function in our current working directory? ggsave("scatter_plot.png")
  • 49. Aesthetic mapping in ggplot2 We will learn: 1. What is aesthetic 2. How to create plots using aesthetic 3. Turning parameters in aesthetic
  • 50. What is Aesthetic  Aesthetic is a visual property of the objects in a plot  It includes lines, points, symbols, colors and positions  It is used to add customization to our plots # Load ggplot2 library(ggplot2) # Clear R workspace rm(list = ls() ) # Declare a variable to read and store moviesData movies <- read.csv("moviesData.csv") # View movies data frame View(movies) # Plot critics_score and audience_score ggplot(data = movies, mapping = aes(x = critics_score, y = audience_score)) + geom_point()
  • 51. Now, we will assign the unique color to each “Genre” of movie column ggplot(data = movies,mapping = aes(x = critics_score, y = audience_score, color = genre)) + geom_point() How to draw “Bar chart” using ggplot function The following code represents the type of the column “mpaa_ratings” and number of elements in this column: str(movies$mpaa_ratings) levels(movies$mpaa_ratings) ggplot(data = movies,mapping = aes(x = movies$mpaa_rating))+ geom_bar() We will learn how to add labels to this bar chart:
  • 52. ggplot(data = movies, mapping = aes(x = movies$mpaa_rating, fill=genre))+ geom_bar()+ labs(y="Rating counts", title="Count of mpaa rating") Now we will draw histogram for the variable “run time” # Histogram for "runtime“ ggplot(data = movies, mapping = aes(x=runtime))+geom_histogram()+ labs(x="Runtime of Movies", title="Distribution of Runtime")
  • 53. Data manipulation using dplyr package “dplyr” is a package for data manipulation, written and maintained by Hadley Wickham It comprises many functions that perform mostly used data manipulation operations # Clear R workspace rm(list = ls()) # Declare a variable to list and store movies data movies<- read.csv("moviesData.csv") View(movies)
  • 54. Now we will install “dplyr” package install.packages(“dplyr”) library(dplyr) Key functions in “dplyr” package Filter- to select cases based on their values Arrange – to reorder the cases Select – to select variables based on their names Mutate – to add new variables that are functions of existing variables Summarise – to condense multiple values to a single value All these functions can be combined with group_by functions. It allows us to perform any operation by group.
  • 55. # Clear R workspace rm(list = ls()) # Declare a variable to list and store movies data movies<- read.csv("moviesData.csv") View(movies) # using "filter" function we will filter the column "genre" by comedy movies moviesComedy <- filter(movies, genre == “Comedy") View(moviesComedy) moviesComedyDr <- filter(movies, genre =="Comedy"| genre == "Drama") View(moviesComedyDr)
  • 56. irisspecies <- filter(iris, Species==“Setosa”) View(irisspecies) irisspecies <- filter(iris, Species==“Setosa”|Petal.Length>=1.5) Vies(irisspecies)
  • 57. # filter the movies data by genre "Comedy" having "imdb_rating" greater than or equal to 7.5 moviesComedyIm <- filter(movies, genre == "Comedy" & imdb_rating >=7.5) View(moviesComedyIm) # using "arrange" function arranging the imdb_rating by ascending order moviesImA <- arrange(movies, imdb_rating) View(moviesImA)
  • 59. # using "arrange" function arranging the imdb_rating by descending order moviesImD <- arrange(movies,desc(imdb_rating)) View(moviesImD) # Arrange the two columns "genre" by alphabetical order and "imdb_rating" by ascending order moviesGeIm <- arrange(movies, genre, imdb_rating) View(moviesGeIm)
  • 60. More functions in “dplyr” package 1. Select 2. Remane 3. Mutate Here, we are using myVis.R script which is folder containg moviesData and set myVis folder as working directory. Before using the above functions install the package “dplyr”
  • 61. # using select function from dplyr package moviesTGI <- select(movies, title, genre, imdb_rating) View(moviesTGI) Let us select the three columns “thtr_rel_year”, “thtr_rel_month” and “thtr_rel_day” along with the “title” column For that enter the following cmd in the console window: moviesTHT <- select(movies, title, starts_with("thtr")) View(moviesTHT)
  • 62. Let us change the name of the column “thtr_rel_year” using “rename” function moviesR <- rename(movies, rel_year = "thtr_rel_year") View(moviesR) Suppose we want to add a new variable (column) in movies dataset for that we will use “mutate” function moviesLess <- select(movies, title:audience_score) View(moviesLess) # use of Mutate function moviesMu <- mutate(moviesLess, criAud = critics_score- audience_score) View(moviesMu)
  • 63. Pipe operator We will learn about: 1. Summarise and group_by functions 2. Operations in summarise functions 3. Pipe operator Make folder names “pipeops” in myproject folder and set “pipeops” as working directory
  • 64. Summarise function 1. Summarise function reduces a dataframe into a single row. 2. It gives summaries like mean, median etc., of the variable available in the dataframe 3. We use summarise along with the group_by function # use of summarise function summarise(movies, mean(imdb_rating)) 1. When we use group_by function, the data frame is divided into groups. We group the “genre” variable using group_by function
  • 65. # use of group_by function group_Movies <- group_by(movies, genre) # using summarise function on the above cmd summarise(group_Movies, mean(imdb_rating)) Now, we are using filter, group_by and summarise function to extract the drama movies mean from mpaa_rating. dramaMov <- filter(movies, genre == "Drama") gr_dramaMov <- group_by(dramaMov, mpaa_rating) summarise(gr_dramaMov, mean(imdb_rating))
  • 66. Pipe operator The pipe operator is denoted as % > % It prevents us from making unnecessary data frames We can read the pipe as a series of imperative statements If we want to find the cosine of sine for pi, we can write Pi % > % sin() % > % cos() We will learn how to do the same above analysis using pipe operator movies %>% filter(genre =="Drama") %>% group_by(mpaa_rating) %>% summarise(mean(imdb_rating))
  • 67. Let us find the difference between “critics_score” and “audience_score” from movies data frame. We will use box plot for this,using the pipe operator we will combine the functions of “ggplot2” and “dplyr” packages movies %>% mutate(diff = audience_score - critics_score) %>% ggplot (mapping = aes(x=genre, y=diff))+ geom_boxplot() Now, we are going to find that number of category of movies in mpaa_rating movies %>% group_by(genre, mpaa_rating) %>% summarise(num = n())
  • 68. Conditional statements We will learn: 1. Conditional statements 2. If, else and else if statements Conditional statements are used to execute some logical conditions in the code If, else and else if statements are some basic conditional statements
  • 69. Statistical function for data analysis Data Set A data set is a collection of data, often presented in a table. There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road Tests), which is retrieved from the 1974 Motor Trend US Magazine. In the examples below (and for the next chapters), we will use the mtcars data set, for statistical purposes:
  • 70. To get in-built data set in R data() data(mtcars) View(mtcars) head(mtcars,6) head(mtcars) nrow(mtcars) ncol(mtcars) Example # Print the mtcars data set mtcars Information About the Data Set You can use the question mark (?) to get information about the mtcars data set: # Use the question mark to get information about the data set ?mtcars
  • 71. Get Information Use the dim() function to find the dimensions of the data set, and the names() function to view the names of the variables: Example Data_Cars <- mtcars # create a variable of the mtcars data set for better organization # Use dim() to find the dimension of the data set dim(Data_Cars) # Use names() to find the names of the variables from the data set names(Data_Cars)
  • 72. Sort Variable Values To sort the values, use the sort() function: Example Data_Cars <- mtcars sort(Data_Cars$cyl) Analyzing the Data Now that we have some information about the data set, we can start to analyze it with some statistical numbers. For example, we can use the summary() function to get a statistical summary of the data: Data_Cars <- mtcars summary(Data_Cars) sd(mtcars$cyl)
  • 73. statistical function in R Mean, Median, and Mode In statistics, there are often three values that interests us: •Mean - The average value •Median - The middle value •Mode - The most common value Data_Cars <- mtcars mean(Data_Cars$wt)
  • 74. Median The median value is the value in the middle, after you have sorted all the values. If we take a look at the values of the wt variable (from the mtcars data set), we will see that there are two numbers in the middle: Data_Cars <- mtcars median(Data_Cars$wt) mean(marks$Test1) mean(marks$Test1, na.rm = TRUE) d1 <- na.omit(old_filename)
  • 75. Mode The mode value is the value that appears the most number of times. R does not have a function to calculate the mode. However, we can create our own function to find it. If we take a look at the values of the wt variable (from the mtcars data set), we will see that the numbers 3.440 are often shown: Data_Cars <- mtcars names(sort(-table(Data_Cars$wt)))[1]
  • 76. http://www.sthda.com/english/wiki/ggplot2- essentials#:~:text=There%20are%20two%20major%20functions,a%20pl ot%20piece%20by%20piece. Website give the details of ggplot2 package. https://bookdown.org/jeffreytmonroe/business_analytics_with_r7/basi cs.html https://www.geeksforgeeks.org/packages-in-r-programming/?ref=lbp https://www.modernstatisticswithr.com/datachapter.html https://www.w3schools.com/r/r_stat_data_set.asp https://www.geeksforgeeks.org/r-keywords/?ref=lbp