R语言统计知识以及常用方法实例

三更两点

发布于 2021-01-14 14:49:05

6160

发布于 2021-01-14 14:49:05

文章目录

统计

平均值

语法

mean(x, trim = 0, na.rm = FALSE, ...)
x - 是输入向量。
trim - 用于从排序的向量的两端删除一些观测值。
na.rm - 用于从输入向量中删除缺少的值。

示例

x <- c(17,8,6,4.12,11,8,54,-11,18,-7)
# Find Mean.
result.mean <- mean(x)
print(result.mean)

中位数

语法

median(x,na.rm=FALSE)
x - 是输入向量
na.rm - 是用于输入向量中删除缺少的值。

示例

# Find the median.
median.result <- median(x)
print(median.result)

众数

众数是指给定的一组数据集合中出现次数最多的值。不同于平均值和中位数，众数可以同时具有数字和字符数据。

# 众数
# Create the function.
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.
v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.
result <- getmode(v)
print(result)

# Create the vector with characters.
charv <- c("baidu.com","tmall.com","yiibai.com","qq.com","yiibai.com")

# Calculate the mode using the user function.
result <- getmode(charv)
print(result)

线性回归

一元

y = ax+b
y 响应变量
x 预测变量
a与b是系统参数

示例：身高与体重的关系

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.
relation <- lm(y~x)

# Give the chart file a name.
png(file = "linearregression.png")

# Plot the chart.
plot(y,x,col = "blue",main = "身高和体重回归",
     abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "体重(Kg)",ylab = "身高(cm)")

# Save the file.
dev.off()


print(summary(relation))

# Find weight of a person with height 170.
a <- data.frame(x = 170)
result <-  predict(relation,a)
print(result)

多元回归

lm() 函数在多元回归的基本用法

lm(y ~ x1+x2+x3...,data)
ormula - 即：y ~ x1+x2+x3...是呈现响应变量和预测变量之间关系的符号。
data - 是应用公式的向量。

示例

# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.
print(model)

# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]
print(a)

Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]

逻辑回归

逻辑回归是一种回归模型，其中响应变量具有分类值，True/False 或0/1.
语法

glm(formula,data,family)

formula - 是呈现变量之间关系的符号。
data - 是给出这些变量值的数据集。
family - 是R对象来指定模型的概述，对于逻辑回归，它的值是二项式。

示例

# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))

input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

print(summary(am.data))

正态分布

平均数是最高点，左边一半==右边一半

# Create a sequence of numbers between -10 and 10 incrementing by 0.1.
x <- seq(-10, 10, by = .1)

# Choose the mean as 2.5 and standard deviation as 0.5.
y <- dnorm(x, mean = 2.5, sd = 0.5)

# Give the chart file a name.
png(file = "dnorm.png")

plot(x,y)

# Save the file.
dev.off()

决策树

决策树是以树的形式表示选择一起结果的图形。图中的节点表示事件或选择，并且图形的边缘表示决策规则或条件。
install.packages(“party”)
- ctree(formula, data)
  - formula - 是描述预测变量和响应变量的公式。
  - data - 是使用的数据集的名称
决策树实例

library(party)

# Create the input data frame.
input.dat <- readingSkills[c(1:105),]

# Give the chart file a name.
png(file = "decision_tree.png")

# Create the tree.
  output.tree <- ctree(
  nativeSpeaker ~ age + shoeSize + score, 
  data = input.dat)

# Plot the tree.
plot(output.tree)

# Save the file.
dev.off()

随机森林

安装包
- install.packages(“randomForest”)
R创建随机森林语法

randomForest(formula, data)
formula - 是描述预测变量和响应变量的公式。
data - 是使用的数据集的名称。

实例代码

library("party")
library("randomForest")

# Create the forest.
output.forest <- randomForest(nativeSpeaker ~ age + shoeSize + score, 
           data = readingSkills)

# View the forest results.
print(output.forest) 

# Importance of each predictor.
print(importance(output.forest,type = 2))

生存分析

生存分析以及预测特定事件的发生时间。survival用于进行生存分析。该包中含有Surv()函数，它将输入数据作为R公式，并在所选变量中创建一个生存对象进行分析。然后使用survfit()函数来创建分析图。
install.packages(“survival”)

Surv(time,event)
survfit(formula)
time - 是直到事件发生的后续时间。
event - 表示预期事件发生的状态。
formula - 是预测变量之间的关系。

实例

# Load the library.
library("survival")

# Create the survival object. 
survfit(Surv(pbc$time,pbc$status == 2)~1)

# Give the chart file a name.
png(file = "survival.png")

# Plot the graph. 
plot(survfit(Surv(pbc$time,pbc$status == 2)~1))

# Save the file.
dev.off()

卡方检验

卡方检验是一种统计方法，用于确定两个分类变量之间是否具有显著的相关性。
语法
- 语法执行卡方检验的函数是：chisq.test()
chisq.test(data)

# Load the library.
library("MASS")

# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.
car.data = table(Cars93$AirBags, Cars93$Type) 
print(car.data)

# Perform the Chi-Square test.
print(chisq.test(car.data))

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2020/06/17 ，如有侵权请联系 cloudcommunity@tencent.com 删除

决策树