首页
学习
活动
专区
圈层
工具
发布

Duke@coursera 数据分析与统计推断 unit1 part1 introduction to data

data basics

‣ observations, variables, and datamatrices

‣ types of variables

‣ relationships between variables

all variables

1、numerical(quantitative):take on numericalvalues sensible to add, subtract, take averages, etc. with these values

Ⅰcontinuous:take on any of an infinite number of values within a given range

Ⅱdiscrete:take on one of a specific set of numeric values

2、categorical(qualitative):take on a limitednumber of distinct categories categories can be identified with numbers, butnot sensible to do arithmetic operations

Ⅰregular categorical

Ⅱordinal:levels have an inherent orderin

relationships between variables

‣ Two variables that show some connectionwith one another are called associated (dependent)

‣ Association can be further described as positive or negative

‣ If two variables are not associated,theyare said to be independent

observational studies & experiments

‣ define observational studies andexperiments

‣ correlation vs. causation

Studies

1、observational

‣ collect data in a way that does notdirectly interfere with howthe data arise (“observe”)

‣ only establish an association

‣ retrospective: uses past data

‣ prospective: data are collectedthroughout the study

2、experiment

‣ randomly assign subjects to treatments

‣ establish causal connections

confoundingvariables

extraneous variables that affect both theexplanatory and the response variable, and that make it seem like there is arelationship between them.

sampling & sources of bias

‣ census vs. sample

‣ sources of bias

‣ sampling methods

Census

‣ Some individuals are hard to locate ormeasure,and these people may be different from the rest of the population.

‣ Populations rarely stand still

a few sources of sampling bias

Convenience sample: Individuals who are easily accessible are more likely to beincluded in the sample

Non-response: If only a (non-random) fraction of the randomly sampled peoplerespond to a survey such that the sample is no longer representative of thepopulation

Voluntary response: Occurs when the sample consists of people who volunteer to respondbecause they have strong opinions on the issue

sampling methods

simple random sample (SRS):each case is equallylikely to be selected

stratified sample:divide the population into homogenous strata, thenrandomly sample from within each stratum

cluster sample:divide the population clusters, randomly sample a few clusters, thenrandomly sample from within these clusters

experimental design

‣ principles of experimental design

‣ experimental design terminology

principles of experimental design:

(1) control:comparetreatment of interest to a control group

(2) randomize:randomlyassign subjects to treatments

(3) replicate:collect asufficiently large sample, or replicate the entire study

(4) block:block forvariables known or suspected to affect the outcome

experimental terminology

placebo:fake treatment, often used as the control group for medical studies

placebo effect:showing change despite being on the placebo

blinding:experimental units don’t know which group they’re in

double-blind:both the experimental units and the researchers don’t know the groupassignment

visualizing numerical data

‣ scatterplots for paired data

‣ other visualizations for describing distributionsof numerical variables

evaluating the relationship

direction(positive nagative) shape(linear curved)

strength(strong weak) outliers

histogram

‣ provides a view of the data density

‣ especially useful for describing the shapeof the distribution

Skewness

distributions are skewed to the side of thelong tail

Modality

modality (cont.)

histogram & bin width

The chosen bin width can alter the storythe histogram is telling.

Dotplot

‣ useful when individual values are ofinterest

‣ can get busy as the sample sizeincreases

box plot

useful for highlighting outliers, median,IQR

intensity map

Useful for highlighting the spatialdistribution.

下一篇
举报
领券