Correlation
properties (I) the magnitude (absolutevalue) of the correlation coefficient measures the strength of the linearassociation between two numerical variables
properties (2) the sign of the correlationcoefficient indicates the direction of association
properties (3) ‣ the correlation coefficient is always between -1 (perfect negativelinear association) and 1 (perfect positive linear association) ‣ R = 0 indicates no linear relationship
properties (4) the correlation coefficientis unitless, and is not affected by changes in the center or scale of eithervariable (such as unit conversions)
properties (5) the correlation of X with Yis the same as of Y with X
properties (6) the correlation coefficientis sensitive to outliers
residuals
least squares line
estimating the regression parameters: slope
estimating the regression parameters:intercept
Recap
prediction & extrapolation
prediction
Extrapolation
conditions for linear regression
(1) linearity
‣ relationship between the explanatory andthe response variable should be linear
‣ methods for fitting a model tonon-linear relationships exist
‣ check using a scatterplot of the data,or a residuals plot
(2) nearly normal residuals
‣ residuals should be nearly normallydistributed, centered at 0
‣ may not be satisfied if there areunusual observations that don’t follow the trend of the rest of the data
‣ check using a histogram or normalprobability plot of residuals
(3) constant variability
‣ variability of points around the leastsquares line should be roughly constant
‣ implies that the variability ofresiduals around the 0 line should be roughly constant as well
‣ also called homoscedasticity
‣ check using a residuals plot
R2
‣ strength of the fit of a linear model ismost commonly evaluated using R2
‣ calculated as the square of thecorrelation coefficient
‣ tells us what percent of variability inthe response variable is explained by the model
‣ the remainder of the variability isexplained by variables not included in the model
‣ always between 0 and 1
outliers in regression
‣ outliers are points that fall away fromthe cloud of points
‣ outliers that fall horizontally awayfrom the center of the cloud but don’t influence the slope of the regressionline are called leverage points
‣ outliers that actually influence theslope of the regression line are called influential points
‣ usually high leverage points
‣ to determine if a point is influential, visualize the regressionline with and without the point, and ask: Does the slope of the line changeconsiderably?
inference for linear regression
results
testing for the slope – hypotheses
testing for the slope – mechanics
confidence interval for the slope
point estimate ± margin of error
variability partitioning
‣ So far: t-test as a way to evaluate thestrength of evidence for a hypothesis test for the slope of relationshipbetween x and y.
‣ Alternative: consider the variability iny explained by x, compared to the unexplained variability.
‣ Partitioning the variability in y toexplained and unexplained variability requires analysis of variance (ANOVA).
sum of squares
Anova
revisiting R2
‣ R2 is the proportion of variability in yexplained by the model: ‣ large → linear relationship between xand y exists ‣ small → evidence provided by the datamay not be convincing
‣ Two ways to calculate R2:
(1) using correlation: square of thecorrelation coefficient
(2) from the definition: proportion ofexplained to total variability