首页
学习
活动
专区
圈层
工具
发布

Duke@coursera 数据分析与统计推断 unit1 part2 introduction to data

measures of center

shape

measures of center

measures of spread

‣ range: (max - min)

‣ variance

‣ standard deviation

‣ inter-quartile range

standard deviation

roughly the average deviation around themean, and has the same units as the data

interquartile range

range of the middle 50% of the data,distance between the first quartile (25th percentile) and third quartile (75thpercentile) IQR=Q3-Q1

robust statistics:

robust statistics

‣ define robust statistics

‣ robust measures of center & spread

robust statistics

we define robust statistics as measures onwhich extreme observations have little effect

transforming data

‣ define transformations

‣ review when it might be useful/necessary to transform data

Transformations

‣ a transformation is a rescaling of the data using a function

‣ when data are very strongly skewed, we sometimes transform them so they areeasier to model

(natural) log transformation

often applied when much of the data clusternear zero (relative to the larger values in the data set) and all observationsare positive

log transformation

to make the relationship between thevariables more linear, and hence easier to model with simple methods

goals of transformations

‣ to see the data structure differently

‣ to reduce skew assist in modeling

‣ to straighten a nonlinear relationship in a scatterplot

exploring categorical variables

‣ describe distribution of a singlecategorical variable

‣ evaluate relationship between twocategorical variables

‣ evaluate relationship between acategorical and a numerical variable

How are bar plots different than histograms?

‣ barplots for categorical variables,histograms for numerical variables

‣ x-axis on a histogram is a number line,and the ordering of the bars are not interchangeable

contingency table

relative frequencies

segmented bar plot

relative frequency segmented bar plot

Mosaicplot

side-by-side box plots

introduction to inference

‣ case study: gender discrimination

‣ introduction to inference via simulation

recap: hypothesis testing framework

‣ start with a null hypothesis (H0) that represents the status quo

‣ set an alternative hypothesis (HA) that represents the research question,i.e.what we’re testing for

‣ conduct a hypothesis test under the assumption that the null hypothesisis true, either via simulation or theoretical methods

‣ if the test resultssuggest that the data do not provide convincing evidence for the alternativehypothesis, stick with the null hypothesis

‣ if they do, then rejectthe null hypothesis in favor of the alternative

making a decision

‣ results from the simulations look likethe data → the difference between the proportions of promoted files betweenmales and females was due to chance (promotion and gender are independent)

‣ results from the simulations do not looklike the data → the difference between the proportions of promoted filesbetween males and females was not due to chance, but due to an actual effect ofgender (promotion and gender are dependent)

Summary

‣ set a null and an alternative hypothesis

‣ simulate the experiment assuming that the null hypothesis is true

‣ evaluated the probability of observing an outcome at least as extreme asthe one observed in the original data

‣ and if this probability is low, reject the null hypothesis in favor ofthe alternative

下一篇
举报
领券