Linear Regression - H. James Norton PhD

Lecture 6: Linear Regression and
Correlation
Nigel Rozario, MS
Jie Zhou, MS
H. James Norton, PhD
10/17/2013
Introduction
Example
Points on this line;
(0,2)
(1,7)
(2, 12)
Linear Regression
• Linear Regression is an approach to modeling the relationship
between a scalar variable y and one or more variables
denoted X.
• Linear regression has many practical uses.
- Prediction, forecasting
- quantify the strength of the relationship between y and
the Xj
Assumptions
•
•
•
•
•
L: Linear (in parameters)
I: Independent
N: Normality
E: The variances of are equal
X:
- The regressors xi are assumed to be errorfree, that is they are not contaminated with
measurement errors.
Correlation
Pearson’s product moment correlation coefficient
• Assumptions:
x and y values follows bivariate normal
For ordinal data not normally distributed, use Spearman’s Correlation
Caution!
Method of Least Squares
• Let’s use an example..
Revised SAT ranking
UNC-Chapel Hill Journalism Professor Phil Meyer
used statistical techniques (least-square
regression) to adjust for different SAT
participation rates for the 50 states and the
District of Columbia. In essence, the technique
adjusts the data to reflect what the SAT scores
would likely be if the same percentage of
students in all states took the tests.
Data spreadsheet
State
New Hampshire
Iowa
North Dakota
Kansas
Illinois
Minnesota
Montana
Connecticut
North Carolina
South Carolina
Oregon
Massachusetts
Wisconsin
Colorado
Tennessee
Nebraska
Maryland
Washington
New Jersey
Vermont
Raw_Score
921
1093
1073
1039
1006
1023
982
897
844
832
922
896
1023
859
1015
1024
904
913
886
890
Taking_Test Orig_Rank Adjusted_Rank
0.75
28
1
0.05
1
2
0.06
2
3
0.1
4
4
0.16
10
5
0.12
7
6
0.22
19
7
0.81
33
8
0.57
48
49
0.58
51
50
0.54
27
9
0.79
35
10
0.11
7
11
0.29
23
12
0.12
9
13
0.1
6
14
0.64
32
15
0.49
31
16
0.74
39
17
0.68
37
18
Adjusted Score
993
990
981
981
978
976
974
974
898
887
972
971
971
969
968
966
965
957
957
955
Model: MODEL1
Dependent Variable: Raw_Score
Number of Observations Read
Number of Observations Used
Outcome variable (Y)
51
51
This is the p-value of
the model. It tests
whether R2 is different
from 0.
Analysis of Variance
Source
DF
Sum of
Squares
Model
Error
Corrected Total
1
49
50
166799
62917
229716
Mean
Square
166799
1284.02261
F Value
Pr > F
129.90
<.0001
Root mean squared error, is the SD of the regression. The closer to zero better the fit.
Root MSE
Dependent Mean
Coeff Var
35.83326
941.74510
3.80499
R-Square
Adj R-Sq
0.7261
0.7205
Parameter Estimates
Variable
Intercept
Taking_Test
R2 shows the amount of
variance of Y explained
by X.
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1020.60998
-220.51036
8.54729
19.34725
119.41
-11.40
<.0001
<.0001
Predictor variable (X)
Expected Score = 1020.61-220.51*Taking_Test
Two-tail P-value
test the
hypothesis that
each coefficient
is different from
0
• Expected Score (North Carolina) = 1020.61-220.51x(Taking_Test)
= 1020.61-220.51x(0.57)
= 894.9193
• Residual (or error) = Raw Score – Expected Score
= 844 - 894.9193
≈ -50.9193
Percentage of students who took
the test only partly explains what’s
the SAT score for each state
Sbp
120
130
134
148
110
137
Age
18
33
27
58
20
30
Another Example
The REG Procedure
Model: MODEL1
Dependent Variable: sbp
Number of Observations Read
Number of Observations Used
6
6
Analysis of Variance
DF
Sum of
Squares
Mean
Square
1
4
5
635.54712
253.28622
888.83333
635.54712
63.32155
Root MSE
Dependent Mean
Coeff Var
7.95748
129.83333
6.12900
Source
Model
Error
Corrected Total
R-Square
Adj R-Sq
F Value
Pr > F
10.04
0.0339
0.7150
0.6438
Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
age
1
1
105.59968
0.78173
8.31054
0.24675
12.71
3.17
0.0002
0.0339
Expected sbp = 105.60 + 0.78 x (age)
when age = 30, expected sbp = 105.60 + 0.78 x (30) = 129
Residual = observed sbp – expected sbp = 137- 129 = 8
Multiple Linear Regression
Data (First 10 observations)
age
bmi
sbp
28
24.33
111
26
25.09
101
31
26.61
120
18
32.26
158
50
22.71
125
42
36.48
166
20
25.18
114
29
21.91
143
35
29.41
111
47
27.28
133
R2 shows the amount of
variance of SBP
explained by Age & BMI
Two-tail P-value test the hypothesis that each
coefficient is different from 0
Reference: Biostatistics: A guide to design, analysis and discovery, 2nd ed [Forthofer, Lee, Hernandez]
Pearson Correlation Coefficients, N = 50
Prob > |r| under H0: Rho=0
bmi
age
0.45226
0.39324
sbp
0.0010
0.0047
Predicted SBP=76.08 + 0.33xAge + 1.3xBMI
When Age=28 and BMI=24.33
Predicted SBP=76.08 + 0.33x(28) + 1.3x(24.33)
=76.08 + 9.24 + 31.63
= 116.95
Residual = Predicted SBP - Observed SBP
= 116.95 – 111
= 5.95
Conclusion
• Simple Linear Regression: one covariate x
• Multivariate Linear Regression : multiple
covariates X
• For the previous first example, other factors
might influence the SAT scores :
- Percentage of parents have college education
- The cost on education per student for each state
• Adding more covariates, R2 always goes up.
This brings up another statistics topic - Goodness
of Fit test (GOF)
• Questions or Comments?
Questions or Comments?