SlideShare a Scribd company logo
University of Gondar
College of medicine and health science
Department of Epidemiology and
Biostatistics
Linear regression
Lemma Derseh (BSc., MPH)
Scatter Plots and Correlation
 Before trying to fit any model it is better to see its scatter plot
 A scatter plot (or scatter diagram) is used to show the
relationship between two variables
 If a scatter plot once show some sort of linear relationship, we
can use correlation analysis to measure the strength of linear
relationship between two variables
o Only concerned with strength of linear relationship and its
direction
o We consider the two variables equally; as a result no causal
effect is implied
Scatter Plot Examples
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships
Scatter Plot Examples
y
x
y
x
y
y
x
x
Strong relationships
Weak relationships
Scatter Plot Examples
y
x
y
x
No relationship at all
Correlation Coefficient
 The population correlation coefficient ρ (rho) measures
the strength of the association between the variables
 The sample correlation coefficient r is an estimate of ρ
and is used to measure the strength of the linear
relationship in the sample observations
Features of ρ and r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker the linear relationship
r = +0.3 r = +1
Examples of Approximate r Values
y
x
y
x
y
x
y
x
y
x
r = -1 r = -.6 r = 0
Calculating the Correlation Coefficient
yy
xx
xy SS
SS
SS
y
y
x
x
y
y
x
x
r /
]
)
(
][
)
(
[
)
)(
(
2
2









where:
r = Sample correlation coefficient
n = Sample size
x = Value of the ‘independent’ variable
y = Value of the ‘dependent’ variable
   
  




]
)
(
)
(
][
)
(
)
(
[ 2
2
2
2
y
y
n
x
x
n
y
x
xy
n
r
Sample correlation coefficient:
or the algebraic equivalent:
Example
Child
Height
(cm)
Child
Weight
(Kg)
x y xy x2 y2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
0
10
20
30
40
50
60
70
0 2 4 6 8 10 12 14
0.886
]
(321)
][8(14111)
(73)
[8(713)
(73)(321)
8(3142)
]
y)
(
)
y
][n(
x)
(
)
x
[n(
y
x
xy
n
r
2
2
2
2
2
2









   
  
Child weight, x
Child
Height,
y
Calculation Example
r = 0.886 → relatively strong positive
linear association between x and y
SPSS out put
SPSS Correlation Output
Analyze /correlate /bivariate /pearson /OK
Correlation between Child height and weight
Correlations
Child
weight
Child
height
Child Weight Pearson Correlation 1 0.886
Sig. (2-tailed) 0.003
N 8 8
Child height Pearson Correlation 0.886 1
Sig. (2-tailed) 0.003
N 8 8
Significance Test for Correlation
 Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
 Test statistic (with n – 2 degrees of freedom)
2
n
r
1
r
t
2



Here, the degree of freedom is taken to be n-2
because, two points can be joined by a straight line
surely
Example:
Is there evidence of a linear relationship between child
height and weight at the 0.05 level of significance?
H0: ρ = 0 (No correlation)
H1: ρ ≠ 0 (correlation exists)
 = 0.05 , df = 8 - 2 = 6
4.68
2
8
.886
1
.886
2
n
r
1
r
t
2
2







Introduction to Regression Analysis
 Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
 Dependent variable: the variable we wish to explain. In
linear regression it is always continuous variable
 Independent variable: the variable used to explain the
dependent variable. In linear regression it could have any
measurement scale have any measurement scale.
Simple Linear Regression Model
 Only one independent variable, x
 Relationship between x and y is described
by a linear function
 Changes in y are assumed to be caused by
changes in x
ε
x
β
β
y 1
0 


Linear component
Population Linear Regression
The population regression model:
Population
y intercept
Population
Slope
Coefficient
Random
Error
term, or
residual
Dependent
Variable Independent
Variable
Random Error
component
Linear Regression Assumptions
 The relationship between the two variables, x and y is Linear
 Independent observations
 Error values are Normally distributed for any given value of x
 The probability distribution of the errors has Equal variance
 Fixed independent variables (not random = non-stochastic = given
values = deterministic); the only randomness in the values of Y
comes from the error term 
 No autocorrelation of the errors (has some similarities with the 2nd)
 No outlier distortion
Assumptions viewed pictorially
LINE (Liner, Independent, Normal and Equal
variance) assumptions
my|x= +  x
~N(my|x s2
y|x)
Y
X
Identical normal
distributions of
errors, all centered on
the regression line.
Population Linear Regression
Random Error
for this x value
y
x
Observed
Value of y
for xi
Predicted
Value of y
for xi
ε
x
β
β
y 1
0 


xi
Slope = β1
Intercept
= β0
εi
x
b
b
ŷ 1
0
i 

The sample regression line provides an estimate of the
population regression line
Estimated Regression Model
Estimate of
the regression
intercept
Estimate of the
regression slope
Estimated
(or predicted)
y value
Independent
variable
The individual random error terms ei have a mean of zero
Least Squares Criterion
 b0 and b1 are obtained by finding the values of b0 and
b1 that minimize the sum of the squared residuals
2
1
0
2
2
x))
b
(b
(y
)
ŷ
(y
e








The Least Squares Equation
 After some application of calculus (derivation)
and equating it to zero, we can find the
following:
 
  



n
x
x
n
y
x
xy
b 2
2
1
)
(





 2
1
)
(
)
)(
(
x
x
y
y
x
x
b
x
b
y
b 1
0 

and
 b0 is the estimated average value of y when the
value of x is zero (provided that x is inside the data
range considered).
 Otherwise it shows the portion of the variability of
the dependent variable left unexplained by the
independent variables considered
 b1 is the estimated change in the average value of y
as a result of a one-unit change in x
Interpretation of the Slope and the Intercept
Example: Simple Linear Regression
 A researcher wishes to examine the relationship between
the amount of the daily average diets taken by a cohort of
20 sample children and the weight gained by them in one
month (both measured in kg). The content of the food is the
same for all of them.
 Dependent variable (y) = weight gained in one month
measured in kilogram
 Independent variable (x) = average weight of diet taken per
day by a child measured in Kilogram
Sample Data for child weight Model
Weight gained (y)
Diet (x) Weight gained (y) Diet (x)
0.4 0.65 0.86 1.1
0.46 0.66 0.89 1.12
0.55 0.63 0.91 1.20
0.56 0.73 0.93 1.32
0.65 0.78 0.96 1.33
0.67 0.76 0.98 1.35
0.78 0.72 1.02 1.42
0.79 0.84 1.04 1.1
0.80 0.87 1.08 1.5
0.83 0.97 1.11 1.3
Estimation using the computational formula
 
  



n
x
x
n
y
x
xy
b 2
2
1
)
(
From the data we have:
Σx = 20.35, Σy = 16.27, Σxy = 17.58, Σx2 = 22.30
643
.
0
20
/
12
.
414
30
.
22
20
/
27
.
16
*
35
.
20
57
.
17
1 



b
160
.
0
0175
.
1
*
643
.
0
8135
.
0
1
0 



 x
b
y
b
Regression Using SPSS
Analyze/ regression/linear….
Coefficients
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 .077 2.065 .054
foodweight 0.643 .073 .900 8.772 .000
Weight gained = 0.16 +0.643(food weight)
Interpretation of the Intercept, b0
 Here, no child had had 0 kilogram of food per day, so for
foods within the range of sizes observed, 0.16Kg is the
portion of the weight gained not explained by food.
 Whereas, b1 = 0.643 tells us that the average weight of a
child increases by 0.643, on average, for each additional
one kilogram of food taken each day
Weight gained = 0.16 + 0.643(food weight)
Explained and Unexplained Variation
Total variation is made up of two parts:
SSE
SSR
SST 

Total sum of
Squares
Sum of Squares
Regression
Sum of Squares
Error
 
 2
)
y
y
(
SST  
 2
)
ŷ
y
(
SSE
 
 2
)
y
ŷ
(
SSR
where:
= Average value of the dependent variable
y = Observed values of the dependent variable
= Estimated value of y for the given x value
ŷ
y
Xi
y
x
yi
SST = (yi - y)2
SSE = (yi - yi )2

SSR = (yi - y)2

_
_
_
Explained and Unexplained …
y

y
y
_
Y

 The coefficient of determination is the portion of the total
variation in the dependent variable that is explained by
variation in the independent variable
 The coefficient of determination is also called R-squared and
is denoted as R2
Coefficient of Determination, R2
1
R
0 2


where
squares
of
sum
total
regression
by
explained
squares
of
sum
SST
SSR
R 

2
Coefficient of Determination, R2
In the single independent variable case, the coefficient
of determination is
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
2
2
r
R 
Coefficient of Determination, R2 cont…
 The F-test testes the statistical significance of the
regression of the dependent variable on the
independent variable: H0: β = 0
 However, the reliability of the regression equation is
very commonly measured by the correlation
coefficient R.
 Equivalently one can check the statistical
significance of R or R2 using F-test and can reach
exactly the same F-value as model coefficients’ test
SPSS output
Model R R Square
Adjusted R
Square
1 0.900 0.810 0.800
ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
Regression 0.658 1 0.658 76.948 .000
Residual 0.154 18 0.009
Total 0.812 19
0.81
0.812
0.658
SST
SSR
R2



Model summary
81% of the variation in
children’s weight
increment is explained
by variation in food
weight they took
SPSS output
R R Square
Adjusted R
Square
Std. Error of the estimate
(the standard deviation of errors)
0.900 0.810 0.800 0.09248
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 0.077 2.065 .054
foodweight 0.643 0.073 0.900 8.772 .000
ANOVA
Model
Sum of
Squares df
Mean
Square F Sig.
Regression 0.658 1 0.658 76.948 0.000
Residual 0.154 18 0.009
Total 0.812 19
Root of ‘mean square error’ = SƐ
Model summary
Inference about the Slope: t-Test
 t test for a population slope
Is there a linear relationship between x and y?
 Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1  0 (linear relationship does exist)
 Test statistic
– 1
b
1
1
s
β
b
t

 2
n
d.f. 

Where:
b1 = Sample regression slope (coefficient)
β1 = Hypothesized slope, usually 0
sb1 = Estimator of the standard error of the slope
Estimated Regression Equation:
The slope of this model is 0.643
Does weight of food taken per day affect
children’s weight?
We have to test it statistically
Inference about the Slope :t Test
Weight gained = 0.16 +0.643(food)
Inferences about the Slope: t-Test
Example
Conclusion: There is
sufficient evidence that food
weight taken per day affects
children’s weight
1
b
s
b1
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
(Constant) 0.160 0.077 2.065 .054
Food weight 0.643 0.073 0.900 8.772 0.000
The calculated t-test is 8.772
which is greater than the
tabulated one 2.074
Decision: Reject Ho
1
b
1
s
0
b
t


Confidence Interval estimation
Coefficients
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
95% Confidence
Interval for B
B
Std.
Error Beta
Lower
Bound
Upper
Bound
(Constant) 0.160 0.077 2.065 .054 -0.003 0.322
Food weight 0.643 0.073 0.900 8.772 0.000 0.489 0.796
Confidence Interval Estimate of the Slope:
df = n-2 = 18, t(0.025,18) = 2.101
1
b
/2
1 s
t
b 

The 95% confidence interval for the slope is (0.489, 0.796).
Note also that this 95% confidence interval does not include 0.
Look the relationship between all the figures in the blue circles
Multiple linear regression
Multiple Linear Regression (MLR) is a
statistical method for estimating the
relationship between a dependent variable
and two or more independent (or predictor)
variables.
Function: Ypred = a + b1X1 + B2X2 +… + BnXn
Multiple Linear Regression
Simply, MLR is a method for studying the
relationship between a dependent variable
and two or more independent variables.
Purposes:
Prediction
Explanation
Theory building
Predictable variation by
the combination of
independent variables
Variations
Total Variation in Y
Unpredictable
Variation
44
Assumptions of the Linear regression Model
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model
(no omitted variables)*
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual
variance)
7. No multicollinearity*
8. No autocorrelation of the errors
9. No outlier distortion
(Most of them, except the 4th and 7th, are mentioned in the simple
linear regression model assumptions)
Multiple Coefficient of Determination, R2
o In multiple regression ,the corresponding correlation
coefficient is called Multiple Correlation Coefficient
 Since there are more than one independent variables, multiple
correlation coefficient R is the correlation between the
observed y and predicted y values while it is between x and y
in the case of r (simple correlation)
 Unlike the situation for simple correlation, 0 < R < 1, because
it would be impossible to have a negative correlation between
the observed and the least-squares predicted values
 The square of a multiple correlation coefficient is of course the
corresponding coefficient of determination

Intercorrelation or collinearlity
 If the two independent variables are uncorrelated, we
can uniquely partition the amount of variance in Y due
to X1 and X2 and bias is avoided.
 Small inter-correlations between the independent
variables will not greatly bias the b coefficients.
 However, large inter-correlations will bias the b
coefficients and for this reason other mathematical
procedures are needed
Multiple regression
%fat age Sex
9.5 23.0 0.0
27.9 23.0 1.0
7.8 27.0 0.0
17.8 27.0 0.0
31.4 39.0 1.0
25.9 41.0 1.0
27.4 45.0 0.0
25.2 49.0 1.0
31.1 50.0 1.0
34.7 53.0 1.0
42.0 53.0 1.0
42.0 54.0 1.0
29.1 54.0 1.0
32.5 56.0 1.0
30.3 57.0 1.0
21.0 57.0 1.0
33.0 58.0 1.0
33.8 58.0 1.0
41.1 60.0 1.0
34.5 61.0 1.0
Example:
Regress the percentage of fat relative
to body on age and sex
SPSS result on the next slide!
Model Summary
Model R
R
Square
Adjusted R
Square
Std. Error of
the Estimate
Change Statistics
R Square
Change F Change df1 df2
Sig. F
Change
1 .729a .532 .506 6.5656 .532 20.440 1 18 .000
2 .794b .631 .587 5.9986 .099 4.564 1 17 .047
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age
ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 881.128 1 881.128 20.440 .000a
Residual 775.932 18 43.107
Total 1657.060 19
2 Regression 1045.346 2 522.673 14.525 .000b
Residual 611.714 17 35.983
Total 1657.060 19
a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat
Coefficients
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
95% Confidence
Interval for B
B Std. Error Beta
Lower
Bound
Upper
Bound
1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522
sex 16.594 3.670 .729 4.521 .000 8.883 24.305
2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457
sex 10.130 4.517 .445 2.243 .039 .600 19.659
age .309 .145 .424 2.136 .047 .004 .614
a. Dependent Variable: %age of body fat relative to body

More Related Content

PPTX
Standard normal distribution
PPT
Sampling distribution
PDF
Continuous probability distribution
PPT
Statistical Methods
PPTX
Categorical and Numerical Variables
PPTX
Measure OF Central Tendency
PPTX
Central tendency
PPTX
Measures of dispersion
Standard normal distribution
Sampling distribution
Continuous probability distribution
Statistical Methods
Categorical and Numerical Variables
Measure OF Central Tendency
Central tendency
Measures of dispersion

What's hot (20)

PPTX
Statistical analysis and interpretation
PPTX
Binomial probability distribution
PPTX
Fishers test
PPTX
STATISTIC ESTIMATION
PDF
Logistic Regression Analysis
PPTX
Regression analysis
PPTX
(Manual spss)
PPTX
Presentation On Regression
PPT
Dr digs central tendency
PPTX
Skewness
PDF
Chapter 4 part2- Random Variables
PPTX
normal distribution
PDF
Spss tutorial 1
PPT
Ppt for 1.1 introduction to statistical inference
PPTX
Normal distribution
PDF
Measures of Dispersion
PPTX
Non parametric study; Statistical approach for med student
PPTX
Binomial and Poisson Distribution
PPTX
BOX PLOT STAT.pptx
PPTX
Median
Statistical analysis and interpretation
Binomial probability distribution
Fishers test
STATISTIC ESTIMATION
Logistic Regression Analysis
Regression analysis
(Manual spss)
Presentation On Regression
Dr digs central tendency
Skewness
Chapter 4 part2- Random Variables
normal distribution
Spss tutorial 1
Ppt for 1.1 introduction to statistical inference
Normal distribution
Measures of Dispersion
Non parametric study; Statistical approach for med student
Binomial and Poisson Distribution
BOX PLOT STAT.pptx
Median
Ad

Similar to Linear regression.ppt (20)

PPTX
6 the six uContinuous data analysis.pptx
PPT
Chapter13
PDF
P G STAT 531 Lecture 10 Regression
PPTX
Statistics-Regression analysis
PPTX
Regression-SIMPLE LINEAR (1).psssssssssptx
PPTX
CORRELATION AND REGRESSION.pptx
PPT
Regression and Co-Relation
PPTX
Regression and correlation in statistics
PPTX
Correlation and Regression.pptx
PPT
Ch8 Regression Revby Rao
PPT
Intro to corhklloytdeb koptrcb k & reg.ppt
PPT
Statistics08_Cut_Regression.jdnkdjvbjddj
PPTX
8. Correlation and Linear Regression.pptx
PDF
Linear Regression with one Independent Variable.pdf
PPTX
Regression analysis
PPTX
01_SLR_final (1).pptx
PPTX
Unit 7b Regression Analyss.pptxbhjjjjjjk
PPTX
Regression analysis
PPT
Chapter13
PPT
koefisienkorelasiUNTUKILMUMANAJEMENS2.ppt
6 the six uContinuous data analysis.pptx
Chapter13
P G STAT 531 Lecture 10 Regression
Statistics-Regression analysis
Regression-SIMPLE LINEAR (1).psssssssssptx
CORRELATION AND REGRESSION.pptx
Regression and Co-Relation
Regression and correlation in statistics
Correlation and Regression.pptx
Ch8 Regression Revby Rao
Intro to corhklloytdeb koptrcb k & reg.ppt
Statistics08_Cut_Regression.jdnkdjvbjddj
8. Correlation and Linear Regression.pptx
Linear Regression with one Independent Variable.pdf
Regression analysis
01_SLR_final (1).pptx
Unit 7b Regression Analyss.pptxbhjjjjjjk
Regression analysis
Chapter13
koefisienkorelasiUNTUKILMUMANAJEMENS2.ppt
Ad

More from habtamu biazin (20)

PPTX
PARACOCCIDIOIDOMYCOSIS.pptx
PDF
Chapter10_part1_slides.pdf
PPT
Lecture-8 (Demographic Studies and Health Services Statistics).ppt
PPT
Lecture-7 (Chi-Square test).ppt
PPT
Lecture-6 (t-test and one way ANOVA.ppt
PPT
Survival Analysis Lecture.ppt
PPT
Logistic Regression.ppt
PPT
Lecture-3 Probability and probability distribution.ppt
PPT
Lecture-2 (discriptive statistics).ppt
PPTX
Anti Fungal Drugs.pptx
PPTX
Opportunistic fungal infection.pptx
PPTX
7-Immunology to infection.pptx
PPT
5,6,7. Protein detection Western_blotting DNA sequencing.ppt
PPT
6. aa sequencing site directed application of biotechnology.ppt
PPT
7. Recombinat DNa & Genomics 1.ppt
PPT
3. RTPCR.ppt
PPTX
2. Prokaryotic and Eukaryotic cell structure.pptx
PPTX
1.Introduction to Microbiology MRT.pptx
PPTX
Mycobacterium species.pptx
PPT
Medical Important G+ cocci.ppt
PARACOCCIDIOIDOMYCOSIS.pptx
Chapter10_part1_slides.pdf
Lecture-8 (Demographic Studies and Health Services Statistics).ppt
Lecture-7 (Chi-Square test).ppt
Lecture-6 (t-test and one way ANOVA.ppt
Survival Analysis Lecture.ppt
Logistic Regression.ppt
Lecture-3 Probability and probability distribution.ppt
Lecture-2 (discriptive statistics).ppt
Anti Fungal Drugs.pptx
Opportunistic fungal infection.pptx
7-Immunology to infection.pptx
5,6,7. Protein detection Western_blotting DNA sequencing.ppt
6. aa sequencing site directed application of biotechnology.ppt
7. Recombinat DNa & Genomics 1.ppt
3. RTPCR.ppt
2. Prokaryotic and Eukaryotic cell structure.pptx
1.Introduction to Microbiology MRT.pptx
Mycobacterium species.pptx
Medical Important G+ cocci.ppt

Recently uploaded (20)

PDF
Trump Administration's workforce development strategy
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
Lesson notes of climatology university.
PPTX
Pharma ospi slides which help in ospi learning
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Trump Administration's workforce development strategy
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Weekly quiz Compilation Jan -July 25.pdf
Lesson notes of climatology university.
Pharma ospi slides which help in ospi learning
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Final Presentation General Medicine 03-08-2024.pptx
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Supply Chain Operations Speaking Notes -ICLT Program
STATICS OF THE RIGID BODIES Hibbelers.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Yogi Goddess Pres Conference Studio Updates
Chinmaya Tiranga quiz Grand Finale.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
school management -TNTEU- B.Ed., Semester II Unit 1.pptx

Linear regression.ppt

  • 1. University of Gondar College of medicine and health science Department of Epidemiology and Biostatistics Linear regression Lemma Derseh (BSc., MPH)
  • 2. Scatter Plots and Correlation  Before trying to fit any model it is better to see its scatter plot  A scatter plot (or scatter diagram) is used to show the relationship between two variables  If a scatter plot once show some sort of linear relationship, we can use correlation analysis to measure the strength of linear relationship between two variables o Only concerned with strength of linear relationship and its direction o We consider the two variables equally; as a result no causal effect is implied
  • 3. Scatter Plot Examples y x y x y y x x Linear relationships Curvilinear relationships
  • 4. Scatter Plot Examples y x y x y y x x Strong relationships Weak relationships
  • 5. Scatter Plot Examples y x y x No relationship at all
  • 6. Correlation Coefficient  The population correlation coefficient ρ (rho) measures the strength of the association between the variables  The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations
  • 7. Features of ρ and r  Unit free  Range between -1 and 1  The closer to -1, the stronger the negative linear relationship  The closer to 1, the stronger the positive linear relationship  The closer to 0, the weaker the linear relationship
  • 8. r = +0.3 r = +1 Examples of Approximate r Values y x y x y x y x y x r = -1 r = -.6 r = 0
  • 9. Calculating the Correlation Coefficient yy xx xy SS SS SS y y x x y y x x r / ] ) ( ][ ) ( [ ) )( ( 2 2          where: r = Sample correlation coefficient n = Sample size x = Value of the ‘independent’ variable y = Value of the ‘dependent’ variable            ] ) ( ) ( ][ ) ( ) ( [ 2 2 2 2 y y n x x n y x xy n r Sample correlation coefficient: or the algebraic equivalent:
  • 10. Example Child Height (cm) Child Weight (Kg) x y xy x2 y2 35 8 280 1225 64 49 9 441 2401 81 27 7 189 729 49 33 6 198 1089 36 60 13 780 3600 169 21 7 147 441 49 45 11 495 2025 121 51 12 612 2601 144 =321 =73 =3142 =14111 =713
  • 11. 0 10 20 30 40 50 60 70 0 2 4 6 8 10 12 14 0.886 ] (321) ][8(14111) (73) [8(713) (73)(321) 8(3142) ] y) ( ) y ][n( x) ( ) x [n( y x xy n r 2 2 2 2 2 2                 Child weight, x Child Height, y Calculation Example r = 0.886 → relatively strong positive linear association between x and y
  • 12. SPSS out put SPSS Correlation Output Analyze /correlate /bivariate /pearson /OK Correlation between Child height and weight Correlations Child weight Child height Child Weight Pearson Correlation 1 0.886 Sig. (2-tailed) 0.003 N 8 8 Child height Pearson Correlation 0.886 1 Sig. (2-tailed) 0.003 N 8 8
  • 13. Significance Test for Correlation  Hypotheses H0: ρ = 0 (no correlation) HA: ρ ≠ 0 (correlation exists)  Test statistic (with n – 2 degrees of freedom) 2 n r 1 r t 2    Here, the degree of freedom is taken to be n-2 because, two points can be joined by a straight line surely
  • 14. Example: Is there evidence of a linear relationship between child height and weight at the 0.05 level of significance? H0: ρ = 0 (No correlation) H1: ρ ≠ 0 (correlation exists)  = 0.05 , df = 8 - 2 = 6 4.68 2 8 .886 1 .886 2 n r 1 r t 2 2       
  • 15. Introduction to Regression Analysis  Regression analysis is used to: Predict the value of a dependent variable based on the value of at least one independent variable Explain the impact of changes in an independent variable on the dependent variable  Dependent variable: the variable we wish to explain. In linear regression it is always continuous variable  Independent variable: the variable used to explain the dependent variable. In linear regression it could have any measurement scale have any measurement scale.
  • 16. Simple Linear Regression Model  Only one independent variable, x  Relationship between x and y is described by a linear function  Changes in y are assumed to be caused by changes in x
  • 17. ε x β β y 1 0    Linear component Population Linear Regression The population regression model: Population y intercept Population Slope Coefficient Random Error term, or residual Dependent Variable Independent Variable Random Error component
  • 18. Linear Regression Assumptions  The relationship between the two variables, x and y is Linear  Independent observations  Error values are Normally distributed for any given value of x  The probability distribution of the errors has Equal variance  Fixed independent variables (not random = non-stochastic = given values = deterministic); the only randomness in the values of Y comes from the error term   No autocorrelation of the errors (has some similarities with the 2nd)  No outlier distortion
  • 19. Assumptions viewed pictorially LINE (Liner, Independent, Normal and Equal variance) assumptions my|x= +  x ~N(my|x s2 y|x) Y X Identical normal distributions of errors, all centered on the regression line.
  • 20. Population Linear Regression Random Error for this x value y x Observed Value of y for xi Predicted Value of y for xi ε x β β y 1 0    xi Slope = β1 Intercept = β0 εi
  • 21. x b b ŷ 1 0 i   The sample regression line provides an estimate of the population regression line Estimated Regression Model Estimate of the regression intercept Estimate of the regression slope Estimated (or predicted) y value Independent variable The individual random error terms ei have a mean of zero
  • 22. Least Squares Criterion  b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals 2 1 0 2 2 x)) b (b (y ) ŷ (y e        
  • 23. The Least Squares Equation  After some application of calculus (derivation) and equating it to zero, we can find the following:         n x x n y x xy b 2 2 1 ) (       2 1 ) ( ) )( ( x x y y x x b x b y b 1 0   and
  • 24.  b0 is the estimated average value of y when the value of x is zero (provided that x is inside the data range considered).  Otherwise it shows the portion of the variability of the dependent variable left unexplained by the independent variables considered  b1 is the estimated change in the average value of y as a result of a one-unit change in x Interpretation of the Slope and the Intercept
  • 25. Example: Simple Linear Regression  A researcher wishes to examine the relationship between the amount of the daily average diets taken by a cohort of 20 sample children and the weight gained by them in one month (both measured in kg). The content of the food is the same for all of them.  Dependent variable (y) = weight gained in one month measured in kilogram  Independent variable (x) = average weight of diet taken per day by a child measured in Kilogram
  • 26. Sample Data for child weight Model Weight gained (y) Diet (x) Weight gained (y) Diet (x) 0.4 0.65 0.86 1.1 0.46 0.66 0.89 1.12 0.55 0.63 0.91 1.20 0.56 0.73 0.93 1.32 0.65 0.78 0.96 1.33 0.67 0.76 0.98 1.35 0.78 0.72 1.02 1.42 0.79 0.84 1.04 1.1 0.80 0.87 1.08 1.5 0.83 0.97 1.11 1.3
  • 27. Estimation using the computational formula         n x x n y x xy b 2 2 1 ) ( From the data we have: Σx = 20.35, Σy = 16.27, Σxy = 17.58, Σx2 = 22.30 643 . 0 20 / 12 . 414 30 . 22 20 / 27 . 16 * 35 . 20 57 . 17 1     b 160 . 0 0175 . 1 * 643 . 0 8135 . 0 1 0      x b y b
  • 28. Regression Using SPSS Analyze/ regression/linear…. Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 .077 2.065 .054 foodweight 0.643 .073 .900 8.772 .000 Weight gained = 0.16 +0.643(food weight)
  • 29. Interpretation of the Intercept, b0  Here, no child had had 0 kilogram of food per day, so for foods within the range of sizes observed, 0.16Kg is the portion of the weight gained not explained by food.  Whereas, b1 = 0.643 tells us that the average weight of a child increases by 0.643, on average, for each additional one kilogram of food taken each day Weight gained = 0.16 + 0.643(food weight)
  • 30. Explained and Unexplained Variation Total variation is made up of two parts: SSE SSR SST   Total sum of Squares Sum of Squares Regression Sum of Squares Error    2 ) y y ( SST    2 ) ŷ y ( SSE    2 ) y ŷ ( SSR where: = Average value of the dependent variable y = Observed values of the dependent variable = Estimated value of y for the given x value ŷ y
  • 31. Xi y x yi SST = (yi - y)2 SSE = (yi - yi )2  SSR = (yi - y)2  _ _ _ Explained and Unexplained … y  y y _ Y 
  • 32.  The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable  The coefficient of determination is also called R-squared and is denoted as R2 Coefficient of Determination, R2 1 R 0 2   where squares of sum total regression by explained squares of sum SST SSR R   2
  • 33. Coefficient of Determination, R2 In the single independent variable case, the coefficient of determination is Where: R2 = Coefficient of determination r = Simple correlation coefficient 2 2 r R 
  • 34. Coefficient of Determination, R2 cont…  The F-test testes the statistical significance of the regression of the dependent variable on the independent variable: H0: β = 0  However, the reliability of the regression equation is very commonly measured by the correlation coefficient R.  Equivalently one can check the statistical significance of R or R2 using F-test and can reach exactly the same F-value as model coefficients’ test
  • 35. SPSS output Model R R Square Adjusted R Square 1 0.900 0.810 0.800 ANOVA Model Sum of Squares df Mean Square F Sig. Regression 0.658 1 0.658 76.948 .000 Residual 0.154 18 0.009 Total 0.812 19 0.81 0.812 0.658 SST SSR R2    Model summary 81% of the variation in children’s weight increment is explained by variation in food weight they took
  • 36. SPSS output R R Square Adjusted R Square Std. Error of the estimate (the standard deviation of errors) 0.900 0.810 0.800 0.09248 Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 0.077 2.065 .054 foodweight 0.643 0.073 0.900 8.772 .000 ANOVA Model Sum of Squares df Mean Square F Sig. Regression 0.658 1 0.658 76.948 0.000 Residual 0.154 18 0.009 Total 0.812 19 Root of ‘mean square error’ = SƐ Model summary
  • 37. Inference about the Slope: t-Test  t test for a population slope Is there a linear relationship between x and y?  Null and alternative hypotheses H0: β1 = 0 (no linear relationship) H1: β1  0 (linear relationship does exist)  Test statistic – 1 b 1 1 s β b t   2 n d.f.   Where: b1 = Sample regression slope (coefficient) β1 = Hypothesized slope, usually 0 sb1 = Estimator of the standard error of the slope
  • 38. Estimated Regression Equation: The slope of this model is 0.643 Does weight of food taken per day affect children’s weight? We have to test it statistically Inference about the Slope :t Test Weight gained = 0.16 +0.643(food)
  • 39. Inferences about the Slope: t-Test Example Conclusion: There is sufficient evidence that food weight taken per day affects children’s weight 1 b s b1 Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta (Constant) 0.160 0.077 2.065 .054 Food weight 0.643 0.073 0.900 8.772 0.000 The calculated t-test is 8.772 which is greater than the tabulated one 2.074 Decision: Reject Ho 1 b 1 s 0 b t  
  • 40. Confidence Interval estimation Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. 95% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound (Constant) 0.160 0.077 2.065 .054 -0.003 0.322 Food weight 0.643 0.073 0.900 8.772 0.000 0.489 0.796 Confidence Interval Estimate of the Slope: df = n-2 = 18, t(0.025,18) = 2.101 1 b /2 1 s t b   The 95% confidence interval for the slope is (0.489, 0.796). Note also that this 95% confidence interval does not include 0. Look the relationship between all the figures in the blue circles
  • 41. Multiple linear regression Multiple Linear Regression (MLR) is a statistical method for estimating the relationship between a dependent variable and two or more independent (or predictor) variables. Function: Ypred = a + b1X1 + B2X2 +… + BnXn
  • 42. Multiple Linear Regression Simply, MLR is a method for studying the relationship between a dependent variable and two or more independent variables. Purposes: Prediction Explanation Theory building
  • 43. Predictable variation by the combination of independent variables Variations Total Variation in Y Unpredictable Variation
  • 44. 44 Assumptions of the Linear regression Model 1. Linear Functional form 2. Fixed independent variables 3. Independent observations 4. Representative sample and proper specification of the model (no omitted variables)* 5. Normality of the residuals or errors 6. Equality of variance of the errors (homogeneity of residual variance) 7. No multicollinearity* 8. No autocorrelation of the errors 9. No outlier distortion (Most of them, except the 4th and 7th, are mentioned in the simple linear regression model assumptions)
  • 45. Multiple Coefficient of Determination, R2 o In multiple regression ,the corresponding correlation coefficient is called Multiple Correlation Coefficient  Since there are more than one independent variables, multiple correlation coefficient R is the correlation between the observed y and predicted y values while it is between x and y in the case of r (simple correlation)  Unlike the situation for simple correlation, 0 < R < 1, because it would be impossible to have a negative correlation between the observed and the least-squares predicted values  The square of a multiple correlation coefficient is of course the corresponding coefficient of determination 
  • 46. Intercorrelation or collinearlity  If the two independent variables are uncorrelated, we can uniquely partition the amount of variance in Y due to X1 and X2 and bias is avoided.  Small inter-correlations between the independent variables will not greatly bias the b coefficients.  However, large inter-correlations will bias the b coefficients and for this reason other mathematical procedures are needed
  • 47. Multiple regression %fat age Sex 9.5 23.0 0.0 27.9 23.0 1.0 7.8 27.0 0.0 17.8 27.0 0.0 31.4 39.0 1.0 25.9 41.0 1.0 27.4 45.0 0.0 25.2 49.0 1.0 31.1 50.0 1.0 34.7 53.0 1.0 42.0 53.0 1.0 42.0 54.0 1.0 29.1 54.0 1.0 32.5 56.0 1.0 30.3 57.0 1.0 21.0 57.0 1.0 33.0 58.0 1.0 33.8 58.0 1.0 41.1 60.0 1.0 34.5 61.0 1.0 Example: Regress the percentage of fat relative to body on age and sex SPSS result on the next slide!
  • 48. Model Summary Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .729a .532 .506 6.5656 .532 20.440 1 18 .000 2 .794b .631 .587 5.9986 .099 4.564 1 17 .047 a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 881.128 1 881.128 20.440 .000a Residual 775.932 18 43.107 Total 1657.060 19 2 Regression 1045.346 2 522.673 14.525 .000b Residual 611.714 17 35.983 Total 1657.060 19 a. Predictors: (Constant), sex; b. Predictors: (Constant), sex, age; c. Dependent Variable: %age of body fat Coefficients Model Unstandardized Coefficients Standardized Coefficients t Sig. 95% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 15.625 3.283 4.760 .000 8.728 22.522 sex 16.594 3.670 .729 4.521 .000 8.883 24.305 2 (Constant) 6.209 5.331 1.165 .260 -5.039 17.457 sex 10.130 4.517 .445 2.243 .039 .600 19.659 age .309 .145 .424 2.136 .047 .004 .614 a. Dependent Variable: %age of body fat relative to body