原文地址为 https://github.com/nanli-7/basketball_data_visualization
分析的主要目的是利用数据可视化的方法来观察决定性因素(determinant factors)与棒球运动员的表现
(baseball players’ performanc)之间的关系;数据集包括1157位棒球运动员的数据(performance and biographical information),总共包括五个变量:
avg和HR用来衡量球员表现 (Players performance is measured by the batting average and number of home runs) 惯用手,身高和体重是球员自身的因素 因为身高和体重存在相关性(Since height and weight are positively correlated),所以只用身高和惯用手作为解释变量(explanatory variables),avg和HR作为响应变量(response variables)
library(ggplot2)
df<-read.csv("../Desktop/data_analysis_practice/basketball_data_visualization-master/baseball_data.csv",header=T)
dim(df)
colnames(df)
ggplot(df,aes(x=height))+geom_histogram(binwidth = 0.5,fill="darkgreen")+theme_bw()
从上图可以看出球员身高符合正态分布,球员身高主要集中于70~75之间,单位是inches
ggplot(df,aes(x=handedness,y=HR))+geom_boxplot(fill=c("red","darkgreen","blue"))+theme_bw()
ggplot(data=df,aes(x=height,y=avg))+
geom_jitter(aes(color=handedness),alpha=0.7,size=1.5)+
scale_x_continuous(breaks = seq(65,80,1))+
scale_color_brewer(palette = "Set1")+
stat_summary(fun.y = "mean",
geom = "line",
color = "orange2",
size = 1.2)+ theme_bw()
由上图可以看出:随着身高的增加,击球率呈现下降的趋势(the mean of batting average decreases over height )
原文中最漂亮的应该是下面这幅图,但是没有找到相应的代码