# 手把手 | Python代码和贝叶斯理论告诉你，谁是最好的棒球选手

RasmusBååth的视频链接：

“不论你了解与否，但棒球的魅力就在于精确度。没有其他运动像棒球这样完全依赖于运动数据的连续性，统计性和有序性。棒球球迷比注册会计师还要关注数字。” ——体育记者Jim Murray

• 如何解读2018年春季训练中的打击率
• 怎么比较两名球员的打击率

1.数据

2.生成模型

3.先验概率

```import matplotlib.pyplot as plt
import numpy as np
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
plt.hist(s)```

Fox Sports链接：

https://www.foxsports.com/mlb/stats

```import pandas as pd
import seaborn as sns
import requests
from bs4 import BeautifulSoup
plt.style.use('fivethirtyeight')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
def batting_stats(url,season):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find_all("table",{"class": "wisbb_standardTable tablesorter"})[0]
if season == 'spring':
row_height = len(table.find_all('tr')[:-1])
else:
row_height = len(table.find_all('tr')[:-2])
result_df = pd.DataFrame(columns=[row.text.strip() for row in table_head.find_all('th')], index = range(0,row_height))```
```row_marker = 0
for row in table.find_all('tr')[:-1]:
column_marker = 0
columns = row.find_all('td')
for column in columns:
result_df.iat[row_marker,column_marker] = column.text.strip()
column_marker += 1
row_marker += 1
return result_df```

```ds_url_st = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=3"
dominic_smith_spring = batting_stats(ds_url_st,'spring')
dominic_smith_spring.iloc[-1]```
```n_draw = 20000
prior_ni = pd.Series(np.random.uniform(0, 1, size = n_draw))
plt.figure(figsize=(8,5))
plt.hist(prior_ni)
plt.title('Uniform distribution(0,1)')
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')```

```def posterior(n_try, k_success, prior):
hit = list()
for p in prior:
hit.append(np.random.binomial(n_try, p))
posterior = prior[list(map(lambda x: x == k_success, hit))]
plt.figure(figsize=(8,5))
plt.hist(posterior)
plt.title('Posterior distribution')
plt.xlabel('Posterior on AVG')
plt.ylabel('Frequency')
print('Number of draws left: %d, Posterior mean: %.3f, Posterior median: %.3f, Posterior 95%% quantile interval: %.3f-%.3f' %
(len(posterior), posterior.mean(), posterior.median(), posterior.quantile(.025), posterior.quantile(.975)))
ds_n_trials = int(dominic_smith_spring[['AB','H']].iloc[-1][0])
ds_k_success = int(dominic_smith_spring[['AB','H']].iloc[-1][1])
posterior(ds_n_trials, ds_k_success, prior_ni)```

• 可信区间：根据观察数据，AVG的真实值落在可信区间内的概率为95％。
• 置信区间：当我们用这类数据计算置信区间时，有95%的置信区间会包含AVG的真实值。

`dominic_smith_spring.iloc[-2:]`

Beta分布是一个连续概率分布，它有两个参数，alpha和beta。Beta分布最常见的用途之一是对一个实验的成功概率的不确定性进行建模。

Beta分布相关内容：

https://www.statlect.com/probability-distributions/beta-distribution

```n_draw = 20000
prior_trials = int(dominic_smith_spring.iloc[3].AB)
prior_success = int(dominic_smith_spring.iloc[3].H)
prior_i = pd.Series(np.random.beta(prior_success+1, prior_trials-prior_success+1, size = n_draw))
plt.figure(figsize=(8,5))
plt.hist(prior_i)
plt.title('Beta distribution(a=%d, b=%d)' % (prior_success+1,prior_trials-prior_success+1))
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')```
`posterior(ds_n_trials, ds_k_success, prior_i)`

```ds_url = "https://www.foxsports.com/mlb/dominic-smith-player-stats?seasonType=1"
dominic_smith_reg = batting_stats(ds_url,'regular')
dominic_smith = dominic_smith_reg.append(dominic_smith_spring.iloc[3], ignore_index=True)
dominic_smith```
```ds_prior_trials = pd.to_numeric(dominic_smith.AB).sum()
ds_prior_success = pd.to_numeric(dominic_smith.H).sum()
n_draw = 20000
prior_i_02 = pd.Series(np.random.beta(ds_prior_success+1, ds_prior_trials-ds_prior_success+1, size = n_draw))
plt.figure(figsize=(8,5))
plt.hist(prior_i_02)
plt.title('Beta distribution(a=%d, b=%d)' % (ds_prior_success+1,ds_prior_trials-ds_prior_success+1))
plt.xlabel('Prior on AVG')
plt.ylabel('Frequency')```
`posterior(ds_n_trials, ds_k_success, prior_i_02)`

Pymc3链接：

https://github.com/pymc-devs/pymc3

HMC-NUTS链接：

http://blog.fastforwardlabs.com/2017/01/30/the-algorithms-behind-probabilistic-programming.html

```gc_url_st = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=3"
gc_url_reg = "https://www.foxsports.com/mlb/gavin-cecchini-player-stats?seasonType=1"
gavin_cecchini_spring = batting_stats(gc_url_st,'spring')
gavin_cecchini_reg = batting_stats(gc_url_reg,'regular')
gc_n_trials = int(gavin_cecchini_spring.iloc[1].AB)
gc_k_success = int(gavin_cecchini_spring.iloc[1].H)
gc_prior = pd.DataFrame(gavin_cecchini_reg.iloc[1]).transpose().append(gavin_cecchini_spring.iloc[0])
gc_prior```
```gc_prior_trials = pd.to_numeric(gc_prior.AB).sum()
gc_prior_success = pd.to_numeric(gc_prior.H).sum()
def observed_data_generator(n_try,observed_data):
result = np.ones(observed_data)
fails = n_try - observed_data
result = np.append(result, np.zeros(fails))
return result
ds_observed = observed_data_generator(ds_n_trials,ds_k_success)
gc_observed = observed_data_generator(gc_n_trials,gc_k_success)```

```import pymc3 as pm
with pm.Model() as model_a:
D_p = pm.Beta('DS_AVG', ds_prior_success+1, ds_prior_trials-ds_prior_success+1)
G_p = pm.Beta('GC_AVG', gc_prior_success+1, gc_prior_trials-gc_prior_success+1)
DS = pm.Bernoulli('DS', p=D_p, observed=ds_observed)
GC = pm.Bernoulli('GC', p=G_p, observed=gc_observed)
DvG = pm.Deterministic('DvG', D_p - G_p)
start = pm.find_MAP()
trace = pm.sample(10000, start=start)
pm.plot_posterior(trace, varnames=['DS_AVG','GC_AVG','DvG'],ref_val=0)```

http://www.sumsar.net/blog/2014/10/probable-points-and-credible-intervals-part-one/

`pm.summary(trace)`

https://github.com/tthustla/Bayesball/blob/master/Bayesball.ipynb

https://towardsdatascience.com/bayesball-bayesian-analysis-of-batting-average-102e0390c0e4

0 条评论

• ### 从睫毛膏到太空垃圾，2018年度最佳数据可视化作品都在这了！

过去几年，对数据以及数据分析的关注可谓达到了一个新的高度。如今，数据和信息已经成为一种可以带来绝妙视觉观感的工具。曾经简单的图表和散点图，现在已经成了数据艺术中...

• ### 一道小学生的趣味数学题

据说上图（来源于网络）中这道小学生趣味题只要聪明一点的小学生都可以解出来，成年人估计只要一分钟。我也试着用SAS来解答， 思路如下：首先获取所有的数字出现的位置...

• ### 写一个操作系统_15 内存管理初步

认为的先规划成两部分，用户的物理内存和内核的物理内存，专项专用，内核有方法去占用用户的内存，但是规定两部分内存是专用的，内核只使用自己的物理内存。

• ### Kali Linux Web渗透测试手册(第二版) - 2.8 - 利用robots.txt

thr0cyte，Gr33k，花花，MrTools，R1ght0us，7089bAt，

• ### 一文读懂高通苹果专利战背后的专利常识

专利诉讼早就屡见不鲜，暂且放下高通和苹果公司之间的“恩怨情仇”，我们是否有过这些疑惑：到底什么是专利？为何会有专利之争？为何专利战愈演愈烈？专利究竟有什么价值？...