data basics
‣ observations, variables, and datamatrices
‣ types of variables
‣ relationships between variables
all variables
1、numerical(quantitative):take on numericalvalues sensible to add, subtract, take averages, etc. with these values
Ⅰcontinuous:take on any of an infinite number of values within a given range
Ⅱdiscrete:take on one of a specific set of numeric values
2、categorical(qualitative):take on a limitednumber of distinct categories categories can be identified with numbers, butnot sensible to do arithmetic operations
Ⅰregular categorical
Ⅱordinal:levels have an inherent orderin
relationships between variables
‣ Two variables that show some connectionwith one another are called associated (dependent)
‣ Association can be further described as positive or negative
‣ If two variables are not associated,theyare said to be independent
observational studies & experiments
‣ define observational studies andexperiments
‣ correlation vs. causation
Studies
1、observational
‣ collect data in a way that does notdirectly interfere with howthe data arise (“observe”)
‣ only establish an association
‣ retrospective: uses past data
‣ prospective: data are collectedthroughout the study
2、experiment
‣ randomly assign subjects to treatments
‣ establish causal connections
confoundingvariables
extraneous variables that affect both theexplanatory and the response variable, and that make it seem like there is arelationship between them.
sampling & sources of bias
‣ census vs. sample
‣ sources of bias
‣ sampling methods
Census
‣ Some individuals are hard to locate ormeasure,and these people may be different from the rest of the population.
‣ Populations rarely stand still
a few sources of sampling bias
‣ Convenience sample: Individuals who are easily accessible are more likely to beincluded in the sample
‣ Non-response: If only a (non-random) fraction of the randomly sampled peoplerespond to a survey such that the sample is no longer representative of thepopulation
‣ Voluntary response: Occurs when the sample consists of people who volunteer to respondbecause they have strong opinions on the issue
sampling methods
simple random sample (SRS):each case is equallylikely to be selected
stratified sample:divide the population into homogenous strata, thenrandomly sample from within each stratum
cluster sample:divide the population clusters, randomly sample a few clusters, thenrandomly sample from within these clusters
experimental design
‣ principles of experimental design
‣ experimental design terminology
principles of experimental design:
(1) control:comparetreatment of interest to a control group
(2) randomize:randomlyassign subjects to treatments
(3) replicate:collect asufficiently large sample, or replicate the entire study
(4) block:block forvariables known or suspected to affect the outcome
experimental terminology
placebo:fake treatment, often used as the control group for medical studies
placebo effect:showing change despite being on the placebo
blinding:experimental units don’t know which group they’re in
double-blind:both the experimental units and the researchers don’t know the groupassignment
visualizing numerical data
‣ scatterplots for paired data
‣ other visualizations for describing distributionsof numerical variables
evaluating the relationship
direction(positive nagative) shape(linear curved)
strength(strong weak) outliers
histogram
‣ provides a view of the data density
‣ especially useful for describing the shapeof the distribution
Skewness
distributions are skewed to the side of thelong tail
Modality
modality (cont.)
histogram & bin width
The chosen bin width can alter the storythe histogram is telling.
Dotplot
‣ useful when individual values are ofinterest
‣ can get busy as the sample sizeincreases
box plot
useful for highlighting outliers, median,IQR
intensity map
Useful for highlighting the spatialdistribution.