前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Duke@coursera 数据分析与统计推断 unit1 part2 introduction to data

Duke@coursera 数据分析与统计推断 unit1 part2 introduction to data

作者头像
统计学家
发布2019-04-10 17:07:49
5290
发布2019-04-10 17:07:49
举报
文章被收录于专栏:机器学习与统计学

measures of center

shape

measures of center

measures of spread

‣ range: (max - min)

‣ variance

‣ standard deviation

‣ inter-quartile range

standard deviation

roughly the average deviation around themean, and has the same units as the data

interquartile range

range of the middle 50% of the data,distance between the first quartile (25th percentile) and third quartile (75thpercentile) IQR=Q3-Q1

robust statistics:

robust statistics

‣ define robust statistics

‣ robust measures of center & spread

robust statistics

we define robust statistics as measures onwhich extreme observations have little effect

transforming data

‣ define transformations

‣ review when it might be useful/necessary to transform data

Transformations

‣ a transformation is a rescaling of the data using a function

‣ when data are very strongly skewed, we sometimes transform them so they areeasier to model

(natural) log transformation

often applied when much of the data clusternear zero (relative to the larger values in the data set) and all observationsare positive

log transformation

to make the relationship between thevariables more linear, and hence easier to model with simple methods

goals of transformations

‣ to see the data structure differently

‣ to reduce skew assist in modeling

‣ to straighten a nonlinear relationship in a scatterplot

exploring categorical variables

‣ describe distribution of a singlecategorical variable

‣ evaluate relationship between twocategorical variables

‣ evaluate relationship between acategorical and a numerical variable

How are bar plots different than histograms?

‣ barplots for categorical variables,histograms for numerical variables

‣ x-axis on a histogram is a number line,and the ordering of the bars are not interchangeable

contingency table

relative frequencies

segmented bar plot

relative frequency segmented bar plot

Mosaicplot

side-by-side box plots

introduction to inference

‣ case study: gender discrimination

‣ introduction to inference via simulation

recap: hypothesis testing framework

‣ start with a null hypothesis (H0) that represents the status quo

‣ set an alternative hypothesis (HA) that represents the research question,i.e.what we’re testing for

‣ conduct a hypothesis test under the assumption that the null hypothesisis true, either via simulation or theoretical methods

‣ if the test resultssuggest that the data do not provide convincing evidence for the alternativehypothesis, stick with the null hypothesis

‣ if they do, then rejectthe null hypothesis in favor of the alternative

making a decision

‣ results from the simulations look likethe data → the difference between the proportions of promoted files betweenmales and females was due to chance (promotion and gender are independent)

‣ results from the simulations do not looklike the data → the difference between the proportions of promoted filesbetween males and females was not due to chance, but due to an actual effect ofgender (promotion and gender are dependent)

Summary

‣ set a null and an alternative hypothesis

‣ simulate the experiment assuming that the null hypothesis is true

‣ evaluated the probability of observing an outcome at least as extreme asthe one observed in the original data

‣ and if this probability is low, reject the null hypothesis in favor ofthe alternative

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2015-05-04,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 机器学习与统计学 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档