Lecture 4 - Linear Regression, a lecture in subject module Statistical & Machine Learning

DA 5230 – Statistical & Machine Learning
Lecture 4 – Linear Regression
Maninda Edirisooriya
manindaw@uom.lk

What is Regression? (Reminder)
• Is a Supervised ML problem type where we have labelled data
• Use some labeled data for training the model
• Other data to test (predict the labeled data)
• Going to predict a continuous variable (dependent variable, Y)
• One or more independent variables (X values)

• Regression can be achieved with various types of ML algorithms
• All of them would try to explain the given dataset (training set) using
a function defined by the model
• Parameters of the function would be set so that the function can give
predictions for unseen data with least amount of error
• In other words, during the training, a Regression model would try to
set it’s parameters to approximate the true relationship (i.e. try to
minimize the errors)
• Mathematically, መ
𝐟(X) would try to approximate f(X) by minimizing the
error 𝜺 which is the difference between ෡
𝐘 and Y

Y
X
Regression Model
Data points
Error = 𝜀
Y1
෡
Y1
X1
Predicted value
Actual Value

Linear Regression
• A specific type of Regression
• Using a straight line as the Regression model
• Model is in the form of, y = mx + c
• Where m and c are model parameters
• The training process should find values for m and c so that the errors are
minimized
• But there is no single error to be minimized but errors for all the data
points
• We use sum of Mean Square of each Error (a.k.a. residuals which is MSE
=
1
n
෍
i=1
n
Yi − ෡
Yi
2
) where n is the number of data points

Linear Regression - Example
• Problem: Investment company is looking
for companies to invest. They are
interested in modeling how the profit of
a company varies with the R&D cost
• Analysis: We do an Exploratory Data
Analysis and draw a scatterplot to see
the relationship between the R&D cost
(X) and the profit (Y)
• As the relationship looks linear we may
use linear regression

Linear Regression - Example
• With linear regression we model the data to a straight line after
finding m and c for the data points in the training dataset
• We can evaluate/test the model using the test dataset
• Either using the Mean Squared Error (MSE)
• Or using the R-squired

Multiple Linear Regression
• When there are more than one independent variables the model cannot be
shown in a 2 dimensional scatterplot
• For example when there are 2 independent variables X1 and X2 the graph should be 3
dimensional
• When there are more than that the data points cannot be visualized
• Anyway, for Linear Regression the model can become a Multidimensional
Hyper Plane
• For example when there are 2 independent variables X1 and X2 the model can be
represented with a flat 2D plane instead of a straight line
• This kind of generic Linear Regression is known as Multiple Linear
Regression where Simple Linear Regression is called when there is only
one independent variable

Multiple Linear Regression
That model hyper plane can be represented as,
Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn
Where,
β0 : the intercept (like c in Y = mX + c)
βi : the coefficient of each of the variable (like m in Y = mX + c) for each
variable index i
Like we find values for m and c in Y = mX + c, we have to find βi for all i
(where 0 ≤ 𝐢 ≤ n) in Multiple Linear Regression

Linear Regression - Assumptions
There are several assumptions to be held true for applying Linear
Regression on a dataset
• Linearity
• Homoscedasticity of Residuals or Equal Variances
• No Autocorrelation in residuals
• Residuals are distributed Normally

Linearity
• As we are going to model the data into a linear model each of the
independent variable should be linearly correlated with the
dependent variable
• A scatterplot can be used to view the relationship between each X
variable and the Y variable
• If there is no linearity you can try to apply a non-linear function on
each of the variable to make it linear
• Use Polynomials of X (described later in Polynomial Regression)
• Use exponentials/logarithms of X
• …

Homoscedasticity or Equal Variances of
Residuals
• Variance of the residuals of any given independent variable should be
the same for all the values of that variable
• Opposite of this phenomena is known as Heteroscedasticity where
the variance changes with the value of that variable
• Residuals can be plotted against the X values to view consistent
spread of residuals or a statistical test like Breusch-Pagan test or
White's test
• When Heteroscedasticity is present Weighted Least Square
Regression can be used to down-weight the data with higher
variances instead of Linear Regression

No Autocorrelation in Residuals
• Errors/residuals in each data point should be independent from each other
for any independent variable
• Otherwise, autocorrelation would underestimate the standard error which
would cause other issues like Heteroscedasticity and less precise variances
• This issue is often found in time series data as there can be correlation of
errors in adjacent data points
• Autocorrelation for 1 timestep residuals can be measured with Durbin
Watson statistic d where timestep is t
d =
෌t=2
T
et−et−1
2
෌t=1
T
et
2
d < 1.8 ⇒ Positive Autocorrelation ; d > 2.2 ⇒ Negative Autocorrelation

Residuals are Distributed Normally
• Residuals of each independent variable should be normally
distributed
• Normality of the data points can be viewed in a histogram, KDE or in a
QQ plot
• Statistics tests like Shapiro-Wilk test or Anderson-Darling test can be
used

No Multicollinearity of Data
• Each of the independent variable should be having little or no correlation
with other independent variables
• If correlated, that means that particular variable has the similar
information as the other variable which adds a redundant variable to the
model without new information. This makes the model more complex
without additional information
• Pairwise correlation can be viewed using a Pair plot or a Heatmap
• Variance Inflation Factor (VIF) can be used to measure the Multicollinearity
between two independent variables (R2 is explained later)
VIF =
1
1−𝑅2
• VIF = 1 ⇒ No collinearity ; VIF > 10 ⇒ Problematic, in general

Training
• During training (fitting data to a model) we try to assign best possible
values for βi for each of the i where,
• Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + ε or,
• Y = XT * β + ε where, X = [1, X1, X2, … Xn]T in the vector form
• The basis for selecting values for βi is so that Mean Square Errors
(MSE) are minimized for the training data where,
• MSE =
1
n
෍
i=1
n
Yi − ෡
Yi
2
, where n is the number of data points or,
• MSE =
1
n
෌i=1
n
ei
2
=
1
n
eT
e, where 𝐞𝐢 is Yi − ෡
Yi and e is a n x 1 column
vector in the vector form

Training
• There are 2 main types of techniques used in linear regression
• Close form solutions – that gives the exact solution minimizing residuals
• E.g.: Ordinary Least Squares (OLS)
• These forms can lead to sub-optimal solutions when linear regression assumptions
(linearity, homoscedasticity and normality of residuals) are violated
• Noise in the data can cause overfitting where the model can contain unnecessary
complexity fitting with noise
• Easier to compute when the number of parameters are less (e.g.: Simple Linear
Regression) but expensive when the number of parameters are large
• Faster as there are no iterations
• Iterative methods – iteratively compute towards the solution with numerical
methods
• E.g: Gradient Descent method – will be discussing later

Testing/Evaluating
• Once the model is created in the training phase it has to be evaluated
for the accuracy of predictions using the test dataset
• There are 2 main evaluation methods for measuring the accuracy of a
linear regression model inside the training set itself
• Mean Squared Error (MSE) method – find the MSE for the test dataset which
will measure the inaccuracy level of the model
• R-squired (R2) method – find the R-square measure for the model with test
dataset which is a measure for the accuracy of the model (When the model is
better R2 →1)

R-squired (R2) Measure
• Mean = ഥ
Y =
1
n
σi=1
n
Yi
• Residual Sum of Squares = SSres =෌i
Y ሶ
i − fi
2 = σi ei
• Total Sum of Squares = SStot = ෌i
Yሶ
i − ഥ
Y 2
R2 = 1 −
SSres
SStot
• In a perfect model, Yሶ
i = fi ⇒ SSres = 0 ⇒ R2 = 1

Polynomial Regression
• In Simple Linear Regression we learned above that we have to assume
X variable is linearly correlated with the Y variable
• But in nature that is not the only existing X-Y relationship where there
can be different non-linear relationships as well

Polynomial Regression
• In such cases, though the exact feature of X is not having a linear
relationship with Y, but X2 may have a linear relationship with Y
• Or X3 may have a linear relationship with Y too
• Or both X2 and X3 may have linear relationships with Y too
• In such cases we can engineer (create) new features by taking the
independent variable to a power of some number (e.g.: X2)
• Generally we start with power 2 and increase iteratively
• Then we can train and test it as a Multiple Linear Regression problem
• This is known as Polynomial Regression

Polynomial Regression - Example
• Suppose we have Xas independent variables and Y as the dependent
variable
• Multiple Linear Regression would be, Y = β0 + β1*X
• This is a polynomial of X to the power 1
• Let’s move to the 2nd order polynomials (quadratic form)
Y = β0 + β1*X + β2*X2
• Depending on the performance results of the tests, we can increase
the power to 3 too (cubic form)
Y = β0 + β1*X + β2*X2 + β3*X3

Multivariate Polynomial Regression
• Let’s consider the case when there are more than one independent
variables
• When we consider a polynomial of degree n (e.g.: n=2 for quadratic)
not only the exact Xi variables are raised to the power n but there
would be new features which are combinations of already engineered
variables as well
• We can iteratively add/remove such newly engineered features
depending on their significance using an algorithm as there would be
exponentially many combinations of new features with the degree of
the polynomial (We will discuss such algorithms later in this subject
module)

Multivariate Polynomial Regression - Example
• Suppose we have X1 and X2 as independent variables and Y as the
dependent variable
• Multiple Linear Regression would use, Y = β0 + β1*X1 + β2*X2
• This is a polynomial of X to the power 1
• Let’s move to the 2nd order polynomials
Y = β0 + β1*X1 + β2*X2 + β3*X1
2 + β4*X2
2 + β5*X1*X2
• Features in the cubic form would be,
X1, X2, X1
2, X2
2, X1
3, X2
3, X1*X2, X1
2*X2 , X1*X2
2

Two Hour Homework
• Officially we have two more hours to do after the end of the lectures
• Therefore, for this week’s extra hours you have a homework
• After today’s tutorial figure out all the difficult sections in it
• Try yourself to complete it and refer the Internet when needed
• Try to apply linear regression in different types of ML problems you have and
get familiar with them
• Try to identify that the linear regression assumptions are satisfied in each of
the problems and try to feature engineer with EDA done iteratively
• We need you to know linear regression for further learning ML and SL ahead
• Good Luck!

Lecture 4 - Linear Regression, a lecture in subject module Statistical & Machine Learning

Recommended

More Related Content

Similar to Lecture 4 - Linear Regression, a lecture in subject module Statistical & Machine Learning (20)

More from Maninda Edirisooriya (20)

Recently uploaded (20)

Lecture 4 - Linear Regression, a lecture in subject module Statistical & Machine Learning