前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Duke@coursera 数据分析与统计推断 unit1 part1 introduction to data

Duke@coursera 数据分析与统计推断 unit1 part1 introduction to data

作者头像
统计学家
发布2019-04-10 17:07:09
5920
发布2019-04-10 17:07:09
举报
文章被收录于专栏:机器学习与统计学

data basics

‣ observations, variables, and datamatrices

‣ types of variables

‣ relationships between variables

all variables

1、numerical(quantitative):take on numericalvalues sensible to add, subtract, take averages, etc. with these values

Ⅰcontinuous:take on any of an infinite number of values within a given range

Ⅱdiscrete:take on one of a specific set of numeric values

2、categorical(qualitative):take on a limitednumber of distinct categories categories can be identified with numbers, butnot sensible to do arithmetic operations

Ⅰregular categorical

Ⅱordinal:levels have an inherent orderin

relationships between variables

‣ Two variables that show some connectionwith one another are called associated (dependent)

‣ Association can be further described as positive or negative

‣ If two variables are not associated,theyare said to be independent

observational studies & experiments

‣ define observational studies andexperiments

‣ correlation vs. causation

Studies

1、observational

‣ collect data in a way that does notdirectly interfere with howthe data arise (“observe”)

‣ only establish an association

‣ retrospective: uses past data

‣ prospective: data are collectedthroughout the study

2、experiment

‣ randomly assign subjects to treatments

‣ establish causal connections

confoundingvariables

extraneous variables that affect both theexplanatory and the response variable, and that make it seem like there is arelationship between them.

sampling & sources of bias

‣ census vs. sample

‣ sources of bias

‣ sampling methods

Census

‣ Some individuals are hard to locate ormeasure,and these people may be different from the rest of the population.

‣ Populations rarely stand still

a few sources of sampling bias

Convenience sample: Individuals who are easily accessible are more likely to beincluded in the sample

Non-response: If only a (non-random) fraction of the randomly sampled peoplerespond to a survey such that the sample is no longer representative of thepopulation

Voluntary response: Occurs when the sample consists of people who volunteer to respondbecause they have strong opinions on the issue

sampling methods

simple random sample (SRS):each case is equallylikely to be selected

stratified sample:divide the population into homogenous strata, thenrandomly sample from within each stratum

cluster sample:divide the population clusters, randomly sample a few clusters, thenrandomly sample from within these clusters

experimental design

‣ principles of experimental design

‣ experimental design terminology

principles of experimental design:

(1) control:comparetreatment of interest to a control group

(2) randomize:randomlyassign subjects to treatments

(3) replicate:collect asufficiently large sample, or replicate the entire study

(4) block:block forvariables known or suspected to affect the outcome

experimental terminology

placebo:fake treatment, often used as the control group for medical studies

placebo effect:showing change despite being on the placebo

blinding:experimental units don’t know which group they’re in

double-blind:both the experimental units and the researchers don’t know the groupassignment

visualizing numerical data

‣ scatterplots for paired data

‣ other visualizations for describing distributionsof numerical variables

evaluating the relationship

direction(positive nagative) shape(linear curved)

strength(strong weak) outliers

histogram

‣ provides a view of the data density

‣ especially useful for describing the shapeof the distribution

Skewness

distributions are skewed to the side of thelong tail

Modality

modality (cont.)

histogram & bin width

The chosen bin width can alter the storythe histogram is telling.

Dotplot

‣ useful when individual values are ofinterest

‣ can get busy as the sample sizeincreases

box plot

useful for highlighting outliers, median,IQR

intensity map

Useful for highlighting the spatialdistribution.

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2015-04-30,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 机器学习与统计学 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档