Lecture 6: Linear Regression and Correlation Nigel Rozario, MS Jie Zhou, MS H. James Norton, PhD 10/17/2013 Introduction Example Points on this line; (0,2) (1,7) (2, 12) Linear Regression • Linear Regression is an approach to modeling the relationship between a scalar variable y and one or more variables denoted X. • Linear regression has many practical uses. - Prediction, forecasting - quantify the strength of the relationship between y and the Xj Assumptions • • • • • L: Linear (in parameters) I: Independent N: Normality E: The variances of are equal X: - The regressors xi are assumed to be errorfree, that is they are not contaminated with measurement errors. Correlation Pearson’s product moment correlation coefficient • Assumptions: x and y values follows bivariate normal For ordinal data not normally distributed, use Spearman’s Correlation Caution! Method of Least Squares • Let’s use an example.. Revised SAT ranking UNC-Chapel Hill Journalism Professor Phil Meyer used statistical techniques (least-square regression) to adjust for different SAT participation rates for the 50 states and the District of Columbia. In essence, the technique adjusts the data to reflect what the SAT scores would likely be if the same percentage of students in all states took the tests. Data spreadsheet State New Hampshire Iowa North Dakota Kansas Illinois Minnesota Montana Connecticut North Carolina South Carolina Oregon Massachusetts Wisconsin Colorado Tennessee Nebraska Maryland Washington New Jersey Vermont Raw_Score 921 1093 1073 1039 1006 1023 982 897 844 832 922 896 1023 859 1015 1024 904 913 886 890 Taking_Test Orig_Rank Adjusted_Rank 0.75 28 1 0.05 1 2 0.06 2 3 0.1 4 4 0.16 10 5 0.12 7 6 0.22 19 7 0.81 33 8 0.57 48 49 0.58 51 50 0.54 27 9 0.79 35 10 0.11 7 11 0.29 23 12 0.12 9 13 0.1 6 14 0.64 32 15 0.49 31 16 0.74 39 17 0.68 37 18 Adjusted Score 993 990 981 981 978 976 974 974 898 887 972 971 971 969 968 966 965 957 957 955 Model: MODEL1 Dependent Variable: Raw_Score Number of Observations Read Number of Observations Used Outcome variable (Y) 51 51 This is the p-value of the model. It tests whether R2 is different from 0. Analysis of Variance Source DF Sum of Squares Model Error Corrected Total 1 49 50 166799 62917 229716 Mean Square 166799 1284.02261 F Value Pr > F 129.90 <.0001 Root mean squared error, is the SD of the regression. The closer to zero better the fit. Root MSE Dependent Mean Coeff Var 35.83326 941.74510 3.80499 R-Square Adj R-Sq 0.7261 0.7205 Parameter Estimates Variable Intercept Taking_Test R2 shows the amount of variance of Y explained by X. DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1020.60998 -220.51036 8.54729 19.34725 119.41 -11.40 <.0001 <.0001 Predictor variable (X) Expected Score = 1020.61-220.51*Taking_Test Two-tail P-value test the hypothesis that each coefficient is different from 0 • Expected Score (North Carolina) = 1020.61-220.51x(Taking_Test) = 1020.61-220.51x(0.57) = 894.9193 • Residual (or error) = Raw Score – Expected Score = 844 - 894.9193 ≈ -50.9193 Percentage of students who took the test only partly explains what’s the SAT score for each state Sbp 120 130 134 148 110 137 Age 18 33 27 58 20 30 Another Example The REG Procedure Model: MODEL1 Dependent Variable: sbp Number of Observations Read Number of Observations Used 6 6 Analysis of Variance DF Sum of Squares Mean Square 1 4 5 635.54712 253.28622 888.83333 635.54712 63.32155 Root MSE Dependent Mean Coeff Var 7.95748 129.83333 6.12900 Source Model Error Corrected Total R-Square Adj R-Sq F Value Pr > F 10.04 0.0339 0.7150 0.6438 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > |t| Intercept age 1 1 105.59968 0.78173 8.31054 0.24675 12.71 3.17 0.0002 0.0339 Expected sbp = 105.60 + 0.78 x (age) when age = 30, expected sbp = 105.60 + 0.78 x (30) = 129 Residual = observed sbp – expected sbp = 137- 129 = 8 Multiple Linear Regression Data (First 10 observations) age bmi sbp 28 24.33 111 26 25.09 101 31 26.61 120 18 32.26 158 50 22.71 125 42 36.48 166 20 25.18 114 29 21.91 143 35 29.41 111 47 27.28 133 R2 shows the amount of variance of SBP explained by Age & BMI Two-tail P-value test the hypothesis that each coefficient is different from 0 Reference: Biostatistics: A guide to design, analysis and discovery, 2nd ed [Forthofer, Lee, Hernandez] Pearson Correlation Coefficients, N = 50 Prob > |r| under H0: Rho=0 bmi age 0.45226 0.39324 sbp 0.0010 0.0047 Predicted SBP=76.08 + 0.33xAge + 1.3xBMI When Age=28 and BMI=24.33 Predicted SBP=76.08 + 0.33x(28) + 1.3x(24.33) =76.08 + 9.24 + 31.63 = 116.95 Residual = Predicted SBP - Observed SBP = 116.95 – 111 = 5.95 Conclusion • Simple Linear Regression: one covariate x • Multivariate Linear Regression : multiple covariates X • For the previous first example, other factors might influence the SAT scores : - Percentage of parents have college education - The cost on education per student for each state • Adding more covariates, R2 always goes up. This brings up another statistics topic - Goodness of Fit test (GOF) • Questions or Comments? Questions or Comments?
© Copyright 2025 Paperzz