Directions: Complete the following exercises using the code discussed during computer lab. Save your work in an R script as well as a Word document containing the necessary output and comments. Be sure to use notes in the script to justify any computations. If you have any questions, do not hesitate to ask.
Simple Linear Regression
1. > pressure.lm <- lm(pressure ~ temperature, data = pressure)
> summary(pressure.lm)
Call:
lm(formula = pressure ~ temperature, data = pressure)
Residuals:
Min 1Q Median 3Q Max
-158.08 -117.06 -32.84 72.30 409.43
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -147.8989 66.5529 -2.222 0.040124 *
temperature 1.5124 0.3158 4.788 0.000171 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 150.8 on 17 degrees of freedom
Multiple R-squared: 0.5742, Adjusted R-squared: 0.5492
F-statistic: 22.93 on 1 and 17 DF, p-value: 0.000171
> anova(pressure.lm)
Analysis of Variance Table
Response: pressure
Df Sum Sq Mean Sq F value Pr(>F)
temperature 1 521530 521530 22.93 0.000171 ***
Residuals 17 386665 22745
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The temperature coefficient is positive so if there is a significant relationship between temperature and pressure, it is a direct relationship. Since the p-value is less than 0.05, temperature is indeed significant in the model. The relationship between the t statistic and the F statistic is t^2 = F.
2. Linearity: The scatterplot shows a clear violation of the linearity assumption. The data appears to be exponentially increasing. The standardized residual plot reinforces this observation.
Equal Variance: The lack of a linear relationship makes it difficult to determine the equality of variance in observations.
Normality: The Normal Quantile plot shows a lack of linearity at the tails of the data set. A Shapiro-Wilk test verifies that the residuals do not follow a normal distribution.
Shapiro-Wilk normality test
data: rstandard(pressure.lm)
W = 0.8832, **p-value = 0.02438**
3. Using a Box Cox transformation, the optimal transformation is either
or
where λ = 0.01
Multiple Linear Regression
1.
Fertility Agriculture Examination Education Catholic Infant.Mortality
Fertility 1.000 0.353 -0.646 -0.664 0.464 0.417
Agriculture 0.353 1.000 -0.687 -0.640 0.401 -0.061
Examination -0.646 -0.687 1.000 0.698 -0.573 -0.114
Education -0.664 -0.640 0.698 1.000 -0.154 -0.099
Catholic 0.464 0.401 -0.573 -0.154 1.000 0.175
![]()Infant.Mortality 0.417 -0.061 -0.114 -0.099 0.175 1.000
*Related Variables:*
Fertility, Agriculture
Fertility, Examination
Fertility, Infant Mortality
Agriculture, Examination
Agriculture, Education
Examination, Education
2.
Call:
lm(formula = Fertility ~ Agriculture + Examination + Education +
Catholic + Infant.Mortality, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-15.2743 -5.2617 0.5032 4.1198 15.3213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
Agriculture -0.17211 0.07030 -2.448 0.01873 *
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05 ***
Catholic 0.10412 0.03526 2.953 0.00519 **
Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.165 on 41 degrees of freedom
Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
All of the predictors are significant except examination.
3. Residual Plot: There is a random pattern in the residual plot which causes no concern with the model fit.
Normal Q-Q Plot: The data follows the diagonal line quite nicely, indicating that the residuals probably satisfy the normality assumption.
Scale - Location: The data is randomly scattered which indicates that the homoscedasticity assumption is probably met.
4.
> swiss.step.b <- step(swiss.lm, direction = 'backward')
Start: AIC=190.69
Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality
Df Sum of Sq RSS AIC
- Examination 1 53.03 2158.1 189.86
<none> 2105.0 190.69
- Agriculture 1 307.72 2412.8 195.10
- Infant.Mortality 1 408.75 2513.8 197.03
- Catholic 1 447.71 2552.8 197.75
- Education 1 1162.56 3267.6 209.36
Step: AIC=189.86
Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
Df Sum of Sq RSS AIC
<none> 2158.1 189.86
- Agriculture 1 264.18 2422.2 193.29
- Infant.Mortality 1 409.81 2567.9 196.03
- Catholic 1 956.57 3114.6 205.10
- Education 1 2249.97 4408.0 221.43
Call:
lm(formula = Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality, data = swiss)
Residuals:
Min 1Q Median 3Q Max
-14.6765 -6.0522 0.7514 3.1664 16.1422
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
Agriculture -0.15462 0.06819 -2.267 0.02857 *
Education -0.98026 0.14814 -6.617 5.14e-08 ***
Catholic 0.12467 0.02889 4.315 9.50e-05 ***
Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.168 on 42 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
The new model does not include the examination variable. Now all of the predictors are significant.
Residual Plot: There is a random pattern in the residual plot which causes no concern with the model fit.
Normal Q-Q Plot: The data no longer seems to follow a precise normal distribution. This assumption may now be violated.
Scale - Location: The data is randomly scattered which indicates that the homoscedasticity assumption is probably met.
5. The two models that best fit Mallow's Cp are the model with all 5 variables or the model with the 4 variables Agriculture, Education, Catholic, and Infant.Mortality. We prefer a simpler model in statistics, so the best model choice is the model with four explanatory variables. This is the exact same model that backward selection had identified.
Principal Component Analysis
Unemployed Armed.Forces Population Year Employed
Unemployed 1.000 -0.177 0.687 0.668 0.502
Armed.Forces -0.177 1.000 0.364 0.417 0.457
Population 0.687 0.364 1.000 0.994 0.960
Year 0.668 0.417 0.994 1.000 0.971
Employed 0.502 0.457 0.960 0.971 1.000
![]()
2. Comp.1 Comp.2
Unemployed 0.3633 0.5988
Armed.Forces 0.2269 -0.7911
Population 0.5261 0.0435
Year 0.5291 -0.0024
Employed 0.5097 -0.1171
The first component is a standardized measure of GNP and the second component is difficult to interpret.
3.
Component 1: **71.23%** Variance explained
Component2: **23.67%** Variance explained
Cumulative Variance: **94.89%**
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。