前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Duke@coursera 数据分析与统计推断unit6introduction to linear regression

Duke@coursera 数据分析与统计推断unit6introduction to linear regression

作者头像
统计学家
发布2019-04-10 16:49:44
4940
发布2019-04-10 16:49:44
举报
文章被收录于专栏:机器学习与统计学

Correlation

properties (I) the magnitude (absolutevalue) of the correlation coefficient measures the strength of the linearassociation between two numerical variables

properties (2) the sign of the correlationcoefficient indicates the direction of association

properties (3) ‣ the correlation coefficient is always between -1 (perfect negativelinear association) and 1 (perfect positive linear association) ‣ R = 0 indicates no linear relationship

properties (4) the correlation coefficientis unitless, and is not affected by changes in the center or scale of eithervariable (such as unit conversions)

properties (5) the correlation of X with Yis the same as of Y with X

properties (6) the correlation coefficientis sensitive to outliers

residuals

least squares line

estimating the regression parameters: slope

estimating the regression parameters:intercept

Recap

prediction & extrapolation

prediction

Extrapolation

conditions for linear regression

(1) linearity

‣ relationship between the explanatory andthe response variable should be linear

‣ methods for fitting a model tonon-linear relationships exist

‣ check using a scatterplot of the data,or a residuals plot

(2) nearly normal residuals

‣ residuals should be nearly normallydistributed, centered at 0

‣ may not be satisfied if there areunusual observations that don’t follow the trend of the rest of the data

‣ check using a histogram or normalprobability plot of residuals

(3) constant variability

‣ variability of points around the leastsquares line should be roughly constant

‣ implies that the variability ofresiduals around the 0 line should be roughly constant as well

‣ also called homoscedasticity

‣ check using a residuals plot

R2

‣ strength of the fit of a linear model ismost commonly evaluated using R2

‣ calculated as the square of thecorrelation coefficient

‣ tells us what percent of variability inthe response variable is explained by the model

‣ the remainder of the variability isexplained by variables not included in the model

‣ always between 0 and 1

outliers in regression

‣ outliers are points that fall away fromthe cloud of points

‣ outliers that fall horizontally awayfrom the center of the cloud but don’t influence the slope of the regressionline are called leverage points

‣ outliers that actually influence theslope of the regression line are called influential points

‣ usually high leverage points

‣ to determine if a point is influential, visualize the regressionline with and without the point, and ask: Does the slope of the line changeconsiderably?

inference for linear regression

results

testing for the slope – hypotheses

testing for the slope – mechanics

confidence interval for the slope

point estimate ± margin of error

variability partitioning

‣ So far: t-test as a way to evaluate thestrength of evidence for a hypothesis test for the slope of relationshipbetween x and y.

‣ Alternative: consider the variability iny explained by x, compared to the unexplained variability.

‣ Partitioning the variability in y toexplained and unexplained variability requires analysis of variance (ANOVA).

sum of squares

Anova

revisiting R2

‣ R2 is the proportion of variability in yexplained by the model: ‣ large → linear relationship between xand y exists ‣ small → evidence provided by the datamay not be convincing

‣ Two ways to calculate R2:

(1) using correlation: square of thecorrelation coefficient

(2) from the definition: proportion ofexplained to total variability

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2015-05-11,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 机器学习与统计学 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档