measures of center
shape
measures of center
measures of spread
‣ range: (max - min)
‣ variance
‣ standard deviation
‣ inter-quartile range
standard deviation
roughly the average deviation around themean, and has the same units as the data
interquartile range
range of the middle 50% of the data,distance between the first quartile (25th percentile) and third quartile (75thpercentile) IQR=Q3-Q1
robust statistics:
robust statistics
‣ define robust statistics
‣ robust measures of center & spread
robust statistics
we define robust statistics as measures onwhich extreme observations have little effect
transforming data
‣ define transformations
‣ review when it might be useful/necessary to transform data
Transformations
‣ a transformation is a rescaling of the data using a function
‣ when data are very strongly skewed, we sometimes transform them so they areeasier to model
(natural) log transformation
often applied when much of the data clusternear zero (relative to the larger values in the data set) and all observationsare positive
log transformation
to make the relationship between thevariables more linear, and hence easier to model with simple methods
goals of transformations
‣ to see the data structure differently
‣ to reduce skew assist in modeling
‣ to straighten a nonlinear relationship in a scatterplot
exploring categorical variables
‣ describe distribution of a singlecategorical variable
‣ evaluate relationship between twocategorical variables
‣ evaluate relationship between acategorical and a numerical variable
How are bar plots different than histograms?
‣ barplots for categorical variables,histograms for numerical variables
‣ x-axis on a histogram is a number line,and the ordering of the bars are not interchangeable
contingency table
relative frequencies
segmented bar plot
relative frequency segmented bar plot
Mosaicplot
side-by-side box plots
introduction to inference
‣ case study: gender discrimination
‣ introduction to inference via simulation
recap: hypothesis testing framework
‣ start with a null hypothesis (H0) that represents the status quo
‣ set an alternative hypothesis (HA) that represents the research question,i.e.what we’re testing for
‣ conduct a hypothesis test under the assumption that the null hypothesisis true, either via simulation or theoretical methods
‣ if the test resultssuggest that the data do not provide convincing evidence for the alternativehypothesis, stick with the null hypothesis
‣ if they do, then rejectthe null hypothesis in favor of the alternative
making a decision
‣ results from the simulations look likethe data → the difference between the proportions of promoted files betweenmales and females was due to chance (promotion and gender are independent)
‣ results from the simulations do not looklike the data → the difference between the proportions of promoted filesbetween males and females was not due to chance, but due to an actual effect ofgender (promotion and gender are dependent)
Summary
‣ set a null and an alternative hypothesis
‣ simulate the experiment assuming that the null hypothesis is true
‣ evaluated the probability of observing an outcome at least as extreme asthe one observed in the original data
‣ and if this probability is low, reject the null hypothesis in favor ofthe alternative