SlideShare a Scribd company logo
Introduction to R for Data Science
Lecturers
dipl. ing Branko Kovač
Data Analyst at CUBE/Data Science Mentor
at Springboard
Data Science Serbia
branko.kovac@gmail.com
dr Goran S. Milovanović
Data Scientist at DiploFoundation
Data Science Serbia
goran.s.milovanovic@gmail.com
goranm@diplomacy.edu
MultipleLinear Regression in R
• Dummy coding of categorical predictors
• Multiple regression
• Nested models and Partial
F-test
• Partial and Part Correlation
• Multicolinearity
• {Lattice} plots
• Prediction, Confidence
Intervals, Residuals
• Influential Cases and
the Influence Plot
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
########################################################
# Introduction to R for Data Science
# SESSION 7 :: 9 June, 2016
# Multiple Linear Regression in R
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
#### read data
library(datasets)
library(broom)
library(ggplot2)
library(lattice)
#### load
data(iris)
str(iris)
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### simple linearregression:SepalLength vs Petal
Lenth
# Predictorvs Criterion {ggplot2}
ggplot(data = iris,
aes(x = Sepal.Length, y = Petal.Length)) +
geom_point(size = 2, colour = "black") +
geom_point(size = 1, colour = "white") +
geom_smooth(aes(colour = "black"),
method='lm') +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length") +
theme(legend.position = "none")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# And now for something completelly different(but in
R)...
#### Problemswith linearregressionin iris
# Predictorvs Criterion {ggplot2} - group separation
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
geom_point(size = 2) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Predictorvs Criterion {ggplot2} - separate
regression lines
ggplot(data = iris,
aes(x = Sepal.Length,
y = Petal.Length,
colour=Species)) +
geom_smooth(method=lm) +
geom_point(size = 2) +
ggtitle("Sepal Length vs Petal Length") +
xlab("Sepal Length") + ylab("Petal Length")
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### better... {lattice}
xyplot(Petal.Length ~ Sepal.Length | Species, #
{latice} xyplot
data = iris,
xlab = "Sepal Length", ylab = "Petal Length"
)
MultipleRegression in R
• Problems with simple linear regression: iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Petal Length and SepalLength:Conditional
Densities
densityplot(~ Petal.Length | Species, # {latice} xyplot
data = iris,
plot.points=FALSE,
xlab = "Petal Length", ylab = "Density",
main = "P(Petal Length|Species)",
col.line = 'red'
)
densityplot(~ Sepal.Length | Species, # {latice} xyplot
data = iris,
plot.points=FALSE,
xlab = "Sepal Length", ylab = "Density",
main = "P(Sepal Length|Species)",
col.line = 'blue'
)
MultipleRegression in R
• Problems with simple linear regression:
iris dataset
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# Linearregressionin subgroups
species <- unique(iris$Species)
w1 <- which(iris$Species == species[1]) # setosa
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,])
tidy(reg)
w2 <- which(iris$Species == species[2]) # versicolor
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,])
tidy(reg)
w3 <- which(iris$Species == species[3]) # virginica
reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,])
tidy(reg)
MultipleRegression in R
• Simple linear regressions in sub-groups
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Dummy Coding:Species in the iris dataset
is.factor(iris$Species)
levels(iris$Species)
reg <- lm(Petal.Length ~ Species, data=iris)
tidy(reg)
glance(reg)
# Neverforget whatthe regressioncoefficientfor a dummy variablemeans:
# It tells us aboutthe effectof moving from the baselinetowardsthe respectivereferencelevel!
# Here: baseline = setosa (cmp.levels(iris$Species)vs.the outputof tidy(reg))
# NOTE: watch for the order of levels!
levels(iris$Species) # Levels: setosa versicolor virginica
iris$Species <- factor(iris$Species,
levels = c("versicolor",
"virginica",
"setosa"))
levels(iris$Species)
# baseline is now:versicolor
reg <- lm(Petal.Length ~ Species, data=iris)
tidy(reg)# The regression coefficents (!): figure out whathas happened!
MultipleRegression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### anotherway to do dummy coding
rm(iris); data(iris) # ...justto fix the order of Species backto default
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
contrasts(iris$Species) # this probably whatyou rememberfrom your stats class...
iris$Species <- factor(iris$Species,
levels = c ("virginica","versicolor","setosa"))
levels(iris$Species)
contrasts(iris$Species) = contr.treatment(3, base = 1)
# baseline is now:virginica
contrasts(iris$Species) # considercarefully whatyou need to do
MultipleRegression in R
• Dummy coding of categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Petal.Length ~ Species(Dummy Coding)+ Sepal.Length
rm(iris); data(iris) # ...just to fix the order of Species backto default
reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris)
# BTW: since is.factor(iris$Species)==T,R does the dummy coding in lm() for you
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
# compare w. Simple LinearRegression
reg <- lm(Petal.Length ~ Sepal.Length, data=iris)
regSum <- summary(reg)
regSum$r.squared
regSum$coefficients
MultipleRegression in R
• Multiple regression with dummy-coded categorical predictors
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
### Comparingnestedmodels
reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris)
reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2
# terminology:reg2 is a "full model"
# this terminology will be used quite often in Logistic Regression
# NOTE: Nested models
# There is a set of coefficientsfor the nested model(reg1)such thatit
# can be expressedin terms of the full model(reg2); in our case it is simple
# HOME: - figure it out.
anova(reg1, reg2) # partial F-test; Speciescertainly has an effect beyond Sepal.Length
# NOTE: for partial F-test, see:
# http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf
MultipleRegression in R
• Comparison of nested models
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### Multiple Regression - by the book
# Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression
# (that's from yourreading list, to remind you...)
data(stackloss)
str(stackloss)
# Data set description
# URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html
stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.,
data=stackloss)
# let's see:
summary(stacklossModel)
glance(stacklossModel) # {broom}
tidy(stacklossModel) # {broom}
# predictnew data
obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85)
predict(stacklossModel, obs)
MultipleRegression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# confidence intervals
confint(stacklossModel, level=.95) #
95% CI
confint(stacklossModel, level=.99) #
99% CI
# 95% CI for Acid.Conc.only
confint(stacklossModel, "Acid.Conc.",
level=.95)
# defaultregressionplots in R
plot(stacklossModel)
MultipleRegression in R
• By the book: two or three continuous predictors…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# multicolinearity
library(car) # John Fox's carpackage
VIF <- vif(stacklossModel)
VIF
sqrt(VIF)
# Variance Inflation Factor(VIF)
# The increasein the ***variance***of an regression ceoff.due to colinearity
# NOTE: sqrt(VIF)= how much larger the ***SE*** of a reg.coeff.vs. whatit would be
# if there were no correlationswith the other predictors in the model
# NOTE: lower_bound(VIF)= 1; no upperbound;VIF > 2 --> (Concerned== TRUE)
Tolerance <- 1/VIF # obviously,tolerance and VIF are redundant
Tolerance
# NOTE: you can inspectmulticolinearity in the multiple regressionmode
# by conductinga PrincipalComponentAnalysis overthe predictors;
# when the time is right.
MultipleRegression in R
• Assumptions: multicolinearity
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### R for partial and part (semi-partial)correlations
library(ppcor) # a good one;there are many ways to do this in R
#### partialcorrelation in R
dataSet <- iris
str(dataSet)
dataSet$Species <- NULL
irisPCor <- pcor(dataSet, method="pearson")
irisPCor$estimate # partialcorrelations
irisPCor$p.value # results of significancetests
irisPCor$statistic # t-test on n-2-k degrees offreedom ;k = num. of variablesconditioned
# partial correlation between x and y while controlling forz
partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
partialCor$estimate
partialCor$p.value
partialCor$statistic
MultipleRegression in R
• Partial Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
#### semi-partialcorrelation in R
# NOTE: ... Semi-partialcorrelation is the correlation of two variables
# with variation from a third or more othervariables removedonly
# from the ***second variable***
# NOTE: The first variable <- rows, the secondvariable <-columns
# cf. ppcor:An R Packagefor a FastCalculationto Semi-partialCorrelation Coefficients(2015)
# SeonghoKim, BiostatisticsCore,Karmanos CancerInstitute,Wayne State University
# URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/
irisSPCor <- spcor(dataSet, method = "pearson")
irisSPCor$estimate
irisSPCor$p.value
irisSPCor$statistic
partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length,
dataSet$Sepal.Width,
method = "pearson")
# NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length
# when the variance ofdataSet$Petal.Length(2nd variable)due to dataSet$Sepal.Width
# is removed!
partCor$estimate
partCor$p.value
MultipleRegression in R
• Part (semi-partial) Correlation in R
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
# NOTE: In multiple regression,this is the semi-partial(or part) correlation
# that you need to inspect:
# assume a modelwith X1, X2, X3 as predictors,and Y as a criterion
# You need a semi-partialof X1 and Y following the removalof X2 and X3 from Y
# It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3;
# In Step 2, you take the residualsof Y, call them RY; in Step 3, you regress (correlate)
# RY ~ X1: the correlation coefficientthat you get from Step 3 is the part correlation
# that you're looking for.
MultipleRegression in R
• NOTE on semi-partial (part) correlation in multiple regression…
Intro to R for Data Science
Session 7: Multiple Linear Regression in R
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

More Related Content

Viewers also liked (20)

PDF
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Goran S. Milovanovic
 
PPTX
Linear Regression using R
Maruthi Nataraj K
 
PDF
Variable selection for classification and regression using R
Gregg Barrett
 
PDF
Linear regression with R 2
Kazuki Yoshida
 
PDF
Multiple regression in spss
Dr. Ravneet Kaur
 
DOCX
Latest seo news, tips and tricks website lists
Manickam Srinivasan
 
PPTX
Multiple Linear Regression
Indus University
 
PDF
Weather forecasting technology
zionbrighton
 
PPSX
Electron Configuration
crumpjason
 
PDF
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
PDF
R presentation
Christophe Marchal
 
PDF
Data analysis of weather forecasting
Trupti Shingala, WAS, CPACC, CPWA, JAWS, CSM
 
PPT
Rtutorial
Dheeraj Dwivedi
 
PDF
Multiple linear regression
Avjinder (Avi) Kaler
 
PDF
2 R Tutorial Programming
Sakthi Dasans
 
PPTX
R Introduction
schamber
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Introduction to R
Samuel Bosch
 
PDF
Slides erm-cea-ia
Arthur Charpentier
 
PDF
1 R Tutorial Introduction
Sakthi Dasans
 
Introduction to R for Data Science :: Session 5 [Data Structuring: Strings in R]
Goran S. Milovanovic
 
Linear Regression using R
Maruthi Nataraj K
 
Variable selection for classification and regression using R
Gregg Barrett
 
Linear regression with R 2
Kazuki Yoshida
 
Multiple regression in spss
Dr. Ravneet Kaur
 
Latest seo news, tips and tricks website lists
Manickam Srinivasan
 
Multiple Linear Regression
Indus University
 
Weather forecasting technology
zionbrighton
 
Electron Configuration
crumpjason
 
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
R presentation
Christophe Marchal
 
Data analysis of weather forecasting
Trupti Shingala, WAS, CPACC, CPWA, JAWS, CSM
 
Rtutorial
Dheeraj Dwivedi
 
Multiple linear regression
Avjinder (Avi) Kaler
 
2 R Tutorial Programming
Sakthi Dasans
 
R Introduction
schamber
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Introduction to R
Samuel Bosch
 
Slides erm-cea-ia
Arthur Charpentier
 
1 R Tutorial Introduction
Sakthi Dasans
 

Similar to Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R] (20)

PPTX
Linear regression by Kodebay
Kodebay
 
PDF
Regression and Classification with R
Yanchang Zhao
 
PPT
Get Multiple Regression Assignment Help
HelpWithAssignment.com
 
PPTX
Predicating continuous variables-1.pptx
luckyanirudhsai
 
PDF
RDataMining slides-regression-classification
Yanchang Zhao
 
PDF
15 ch ken black solution
Krunal Shah
 
PDF
R Regression Models with Zelig
izahn
 
PPT
604_multiplee.ppt
Rufesh
 
PPTX
2. diagnostics, collinearity, transformation, and missing data
Malik Hassan Qayyum 🕵🏻‍♂️
 
PPT
A presentation for Multiple linear regression.ppt
vigia41
 
PPTX
lm() Function.pptxsfdfsfsfsfsfsfsfsdfsdfsfsfs
lewwai22tw
 
PDF
Linear models
FAO
 
PPTX
Linear Regression.pptx
Ramakrishna Reddy Bijjam
 
PDF
Multiple regression
Antoine De Henau
 
PDF
Linear Regression An Introduction To Statistical Models Peter Martin
melihtyronxx
 
PPTX
Tempest in teapot
Gaetan Lion
 
PDF
R programming intro with examples
Dennis
 
PPTX
DataScienceUsingR-Dr.P.Rajesh.PRESENTATION
GayathriShiva4
 
Linear regression by Kodebay
Kodebay
 
Regression and Classification with R
Yanchang Zhao
 
Get Multiple Regression Assignment Help
HelpWithAssignment.com
 
Predicating continuous variables-1.pptx
luckyanirudhsai
 
RDataMining slides-regression-classification
Yanchang Zhao
 
15 ch ken black solution
Krunal Shah
 
R Regression Models with Zelig
izahn
 
604_multiplee.ppt
Rufesh
 
2. diagnostics, collinearity, transformation, and missing data
Malik Hassan Qayyum 🕵🏻‍♂️
 
A presentation for Multiple linear regression.ppt
vigia41
 
lm() Function.pptxsfdfsfsfsfsfsfsfsdfsdfsfsfs
lewwai22tw
 
Linear models
FAO
 
Linear Regression.pptx
Ramakrishna Reddy Bijjam
 
Multiple regression
Antoine De Henau
 
Linear Regression An Introduction To Statistical Models Peter Martin
melihtyronxx
 
Tempest in teapot
Gaetan Lion
 
R programming intro with examples
Dennis
 
DataScienceUsingR-Dr.P.Rajesh.PRESENTATION
GayathriShiva4
 
Ad

More from Goran S. Milovanovic (20)

PDF
Geneva Social Media Index - Report 2015 full report
Goran S. Milovanovic
 
PDF
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
Goran S. Milovanovic
 
PDF
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Goran S. Milovanovic
 
PDF
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
Goran S. Milovanovic
 
Geneva Social Media Index - Report 2015 full report
Goran S. Milovanovic
 
Milovanović, G.S., Krstić, M. & Filipović, O. (2015). Kršenje homogenosti pre...
Goran S. Milovanovic
 
247113920-Cognitive-technologies-mapping-the-Internet-governance-debate
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 10. Simboličke funkcije, VI Deo: Rešavanje p...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Rezonovanje u...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, V Deo: Suđenje, heur...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, IV Deo: Analogija i ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 9. Simboličke funkcije, III Deo: Kauzalnost,...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Distribuiran...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 8. Simboličke funkcije, II Deo: Konekcioniza...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7a. Simboličke funkcije, I Deo: Učenje kateg...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7. Simboličke funkcije, I Deo: Koncepti, kat...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 7. Učenje, IV Deo: Neasocijativno učenje, ef...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Hernstejnov zakon slagan...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 6. Učenje, III Deo: Instrumentalno učenje
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, II Deo: Blokiranje, osenčavanje, ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, II Deo: klasično uslovljavanje i ...
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 5. Učenje, I Deo
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 4a. Debata o racionalnosti, nastavak
Goran S. Milovanovic
 
Učenje i viši kognitivni procesi 4. Debata o racionalnosti
Goran S. Milovanovic
 
Ad

Recently uploaded (20)

PPTX
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
PPTX
Urban Hierarchy and Service Provisions.pptx
Islamic University of Bangladesh
 
PPTX
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
PDF
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
PPTX
Project 4 PART 1 AI Assistant Vocational Education
barmanjit380
 
PDF
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
 
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
DOCX
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
PPTX
Peer Teaching Observations During School Internship
AjayaMohanty7
 
PPT
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
PPTX
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
PPTX
A Case of Identity A Sociological Approach Fix.pptx
Ismail868386
 
PPTX
F-BLOCK ELEMENTS POWER POINT PRESENTATIONS
mprpgcwa2024
 
PPTX
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 
PPTX
Photo chemistry Power Point Presentation
mprpgcwa2024
 
PPTX
How to Create & Manage Stages in Odoo 18 Helpdesk
Celine George
 
DOCX
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
PPTX
JSON, XML and Data Science introduction.pptx
Ramakrishna Reddy Bijjam
 
PPTX
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
PDF
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 
Elo the Hero is an story about a young boy who became hero.
TeacherEmily1
 
Urban Hierarchy and Service Provisions.pptx
Islamic University of Bangladesh
 
Aerobic and Anaerobic respiration and CPR.pptx
Olivier Rochester
 
Learning Styles Inventory for Senior High School Students
Thelma Villaflores
 
Project 4 PART 1 AI Assistant Vocational Education
barmanjit380
 
Lesson 1 : Science and the Art of Geography Ecosystem
marvinnbustamante1
 
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
 
MUSIC AND ARTS 5 DLL MATATAG LESSON EXEMPLAR QUARTER 1_Q1_W1.docx
DianaValiente5
 
Peer Teaching Observations During School Internship
AjayaMohanty7
 
M&A5 Q1 1 differentiate evolving early Philippine conventional and contempora...
ErlizaRosete
 
Iván Bornacelly - Presentation of the report - Empowering the workforce in th...
EduSkills OECD
 
A Case of Identity A Sociological Approach Fix.pptx
Ismail868386
 
F-BLOCK ELEMENTS POWER POINT PRESENTATIONS
mprpgcwa2024
 
Elo the HeroTHIS IS A STORY ABOUT A BOY WHO SAVED A LITTLE GOAT .pptx
JoyIPanos
 
Photo chemistry Power Point Presentation
mprpgcwa2024
 
How to Create & Manage Stages in Odoo 18 Helpdesk
Celine George
 
ANNOTATION on objective 10 on pmes 2022-2025
joviejanesegundo1
 
JSON, XML and Data Science introduction.pptx
Ramakrishna Reddy Bijjam
 
ENGLISH -PPT- Week1 Quarter1 -day-1.pptx
garcialhavz
 
Wikinomics How Mass Collaboration Changes Everything Don Tapscott
wcsqyzf5909
 

Introduction to R for Data Science :: Session 7 [Multiple Linear Regression in R]

  • 1. Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science Serbia [email protected] dr Goran S. Milovanović Data Scientist at DiploFoundation Data Science Serbia [email protected] [email protected]
  • 2. MultipleLinear Regression in R • Dummy coding of categorical predictors • Multiple regression • Nested models and Partial F-test • Partial and Part Correlation • Multicolinearity • {Lattice} plots • Prediction, Confidence Intervals, Residuals • Influential Cases and the Influence Plot Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 3. ######################################################## # Introduction to R for Data Science # SESSION 7 :: 9 June, 2016 # Multiple Linear Regression in R # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## #### read data library(datasets) library(broom) library(ggplot2) library(lattice) #### load data(iris) str(iris) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 4. #### simple linearregression:SepalLength vs Petal Lenth # Predictorvs Criterion {ggplot2} ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(size = 2, colour = "black") + geom_point(size = 1, colour = "white") + geom_smooth(aes(colour = "black"), method='lm') + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") + theme(legend.position = "none") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 5. # And now for something completelly different(but in R)... #### Problemswith linearregressionin iris # Predictorvs Criterion {ggplot2} - group separation ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 6. # Predictorvs Criterion {ggplot2} - separate regression lines ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, colour=Species)) + geom_smooth(method=lm) + geom_point(size = 2) + ggtitle("Sepal Length vs Petal Length") + xlab("Sepal Length") + ylab("Petal Length") MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 7. ### better... {lattice} xyplot(Petal.Length ~ Sepal.Length | Species, # {latice} xyplot data = iris, xlab = "Sepal Length", ylab = "Petal Length" ) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 8. # Petal Length and SepalLength:Conditional Densities densityplot(~ Petal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Petal Length", ylab = "Density", main = "P(Petal Length|Species)", col.line = 'red' ) densityplot(~ Sepal.Length | Species, # {latice} xyplot data = iris, plot.points=FALSE, xlab = "Sepal Length", ylab = "Density", main = "P(Sepal Length|Species)", col.line = 'blue' ) MultipleRegression in R • Problems with simple linear regression: iris dataset Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 9. # Linearregressionin subgroups species <- unique(iris$Species) w1 <- which(iris$Species == species[1]) # setosa reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w1,]) tidy(reg) w2 <- which(iris$Species == species[2]) # versicolor reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w2,]) tidy(reg) w3 <- which(iris$Species == species[3]) # virginica reg <- lm(Petal.Length ~ Sepal.Length, data=iris[w3,]) tidy(reg) MultipleRegression in R • Simple linear regressions in sub-groups Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 10. #### Dummy Coding:Species in the iris dataset is.factor(iris$Species) levels(iris$Species) reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg) glance(reg) # Neverforget whatthe regressioncoefficientfor a dummy variablemeans: # It tells us aboutthe effectof moving from the baselinetowardsthe respectivereferencelevel! # Here: baseline = setosa (cmp.levels(iris$Species)vs.the outputof tidy(reg)) # NOTE: watch for the order of levels! levels(iris$Species) # Levels: setosa versicolor virginica iris$Species <- factor(iris$Species, levels = c("versicolor", "virginica", "setosa")) levels(iris$Species) # baseline is now:versicolor reg <- lm(Petal.Length ~ Species, data=iris) tidy(reg)# The regression coefficents (!): figure out whathas happened! MultipleRegression in R • Dummy coding of categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 11. ### anotherway to do dummy coding rm(iris); data(iris) # ...justto fix the order of Species backto default levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) contrasts(iris$Species) # this probably whatyou rememberfrom your stats class... iris$Species <- factor(iris$Species, levels = c ("virginica","versicolor","setosa")) levels(iris$Species) contrasts(iris$Species) = contr.treatment(3, base = 1) # baseline is now:virginica contrasts(iris$Species) # considercarefully whatyou need to do MultipleRegression in R • Dummy coding of categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 12. ### Petal.Length ~ Species(Dummy Coding)+ Sepal.Length rm(iris); data(iris) # ...just to fix the order of Species backto default reg <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # BTW: since is.factor(iris$Species)==T,R does the dummy coding in lm() for you regSum <- summary(reg) regSum$r.squared regSum$coefficients # compare w. Simple LinearRegression reg <- lm(Petal.Length ~ Sepal.Length, data=iris) regSum <- summary(reg) regSum$r.squared regSum$coefficients MultipleRegression in R • Multiple regression with dummy-coded categorical predictors Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 13. ### Comparingnestedmodels reg1 <- lm(Petal.Length ~ Sepal.Length, data=iris) reg2 <- lm(Petal.Length ~ Species + Sepal.Length, data=iris) # reg1 is nested under reg2 # terminology:reg2 is a "full model" # this terminology will be used quite often in Logistic Regression # NOTE: Nested models # There is a set of coefficientsfor the nested model(reg1)such thatit # can be expressedin terms of the full model(reg2); in our case it is simple # HOME: - figure it out. anova(reg1, reg2) # partial F-test; Speciescertainly has an effect beyond Sepal.Length # NOTE: for partial F-test, see: # http://pages.stern.nyu.edu/~gsimon/B902301Page/CLASS02_24FEB10/PartialFtest.pdf MultipleRegression in R • Comparison of nested models Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 14. #### Multiple Regression - by the book # Following: http://www.r-tutor.com/elementary-statistics/multiple-linear-regression # (that's from yourreading list, to remind you...) data(stackloss) str(stackloss) # Data set description # URL: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/stackloss.html stacklossModel = lm(stack.loss ~ Air.Flow + Water.Temp + Acid.Conc., data=stackloss) # let's see: summary(stacklossModel) glance(stacklossModel) # {broom} tidy(stacklossModel) # {broom} # predictnew data obs = data.frame(Air.Flow=72, Water.Temp=20, Acid.Conc.=85) predict(stacklossModel, obs) MultipleRegression in R • By the book: two or three continuous predictors… Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 15. # confidence intervals confint(stacklossModel, level=.95) # 95% CI confint(stacklossModel, level=.99) # 99% CI # 95% CI for Acid.Conc.only confint(stacklossModel, "Acid.Conc.", level=.95) # defaultregressionplots in R plot(stacklossModel) MultipleRegression in R • By the book: two or three continuous predictors… Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 16. # multicolinearity library(car) # John Fox's carpackage VIF <- vif(stacklossModel) VIF sqrt(VIF) # Variance Inflation Factor(VIF) # The increasein the ***variance***of an regression ceoff.due to colinearity # NOTE: sqrt(VIF)= how much larger the ***SE*** of a reg.coeff.vs. whatit would be # if there were no correlationswith the other predictors in the model # NOTE: lower_bound(VIF)= 1; no upperbound;VIF > 2 --> (Concerned== TRUE) Tolerance <- 1/VIF # obviously,tolerance and VIF are redundant Tolerance # NOTE: you can inspectmulticolinearity in the multiple regressionmode # by conductinga PrincipalComponentAnalysis overthe predictors; # when the time is right. MultipleRegression in R • Assumptions: multicolinearity Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 17. #### R for partial and part (semi-partial)correlations library(ppcor) # a good one;there are many ways to do this in R #### partialcorrelation in R dataSet <- iris str(dataSet) dataSet$Species <- NULL irisPCor <- pcor(dataSet, method="pearson") irisPCor$estimate # partialcorrelations irisPCor$p.value # results of significancetests irisPCor$statistic # t-test on n-2-k degrees offreedom ;k = num. of variablesconditioned # partial correlation between x and y while controlling forz partialCor <- pcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") partialCor$estimate partialCor$p.value partialCor$statistic MultipleRegression in R • Partial Correlation in R Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 18. #### semi-partialcorrelation in R # NOTE: ... Semi-partialcorrelation is the correlation of two variables # with variation from a third or more othervariables removedonly # from the ***second variable*** # NOTE: The first variable <- rows, the secondvariable <-columns # cf. ppcor:An R Packagefor a FastCalculationto Semi-partialCorrelation Coefficients(2015) # SeonghoKim, BiostatisticsCore,Karmanos CancerInstitute,Wayne State University # URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4681537/ irisSPCor <- spcor(dataSet, method = "pearson") irisSPCor$estimate irisSPCor$p.value irisSPCor$statistic partCor <- spcor.test(dataSet$Sepal.Length, dataSet$Petal.Length, dataSet$Sepal.Width, method = "pearson") # NOTE: this is a correlation of dataSet$Sepal.Length w. dataSet$Petal.Length # when the variance ofdataSet$Petal.Length(2nd variable)due to dataSet$Sepal.Width # is removed! partCor$estimate partCor$p.value MultipleRegression in R • Part (semi-partial) Correlation in R Intro to R for Data Science Session 7: Multiple Linear Regression in R
  • 19. # NOTE: In multiple regression,this is the semi-partial(or part) correlation # that you need to inspect: # assume a modelwith X1, X2, X3 as predictors,and Y as a criterion # You need a semi-partialof X1 and Y following the removalof X2 and X3 from Y # It goes like this: in Step 1, you perform a multiple regression Y ~ X2 + X3; # In Step 2, you take the residualsof Y, call them RY; in Step 3, you regress (correlate) # RY ~ X1: the correlation coefficientthat you get from Step 3 is the part correlation # that you're looking for. MultipleRegression in R • NOTE on semi-partial (part) correlation in multiple regression… Intro to R for Data Science Session 7: Multiple Linear Regression in R