前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Tensorflow实现朴素贝叶斯分类器

Tensorflow实现朴素贝叶斯分类器

原创
作者头像
flavorfan
发布2021-02-27 20:05:21
1.4K1
发布2021-02-27 20:05:21
举报

朴素贝叶斯分类器是基于贝叶斯定理以及一些有关特征独立性的强(朴素)假设的简单概率分类器,也称“独立特征模型”。本文demo使用TF的实现朴素贝叶斯分类器,用TensorFlow_probability概率库实现参数可训练的高斯分布变种。

iris.png
iris.png

1. 导入及查看数据

iris数据集包含来自鸢尾(Iris setosa),鸢尾鸢尾(Iris virginica)和杂色鸢尾(Iris versicolor)三种物种的50个样本。从每个样品中测量出四个特征:萼片和花瓣的长度和宽度。本文目标是构建一个朴素的贝叶斯分类器模型,根据萼片长度和萼片宽度特征(因此,只有4个特征中的2个)预测正确的类别。

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection

# Load the dataset
iris = datasets.load_iris()

# Use only the first two features: sepal length and width
data = iris.data[:, :2]
targets = iris.target

# Randomly shuffle the data and make train and test splits
x_train, x_test, y_train, y_test = \
    model_selection.train_test_split(data, targets, test_size=0.2)

# Plot the training data
labels = {0: 'Setosa', 1: 'Versicolour', 2: 'Virginica'}
label_colours = ['blue', 'red', 'green']

def plot_data(x, y, labels, colours):
    for y_class in np.unique(y):
        index = np.where(y == y_class)
        plt.scatter(x[index, 0], x[index, 1],
                    label=labels[y_class], c=colours[y_class])
    plt.title("Training set")
    plt.xlabel("Sepal length (cm)")
    plt.ylabel("Sepal width (cm)")
    plt.legend()
    
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()

2.贝叶斯分类器理论基础

贝叶斯分类器的基本方程式是贝叶斯定律:

d是特征维数,k是类的数目,P(Y)是类别的先验分布,P(X | Y)是输入的类条件分布

朴素贝叶斯分类器假设数据特征Xi是条件独立的,给出了Y类(“朴素”假设)。在这种情况下,类条件分布分解为

有了类的先验分布和类条件分布,朴素贝叶斯分类器模型简化为

3.TensorFlow math api 实现

3.1 先验分布

def get_prior(y):
    classes = np.unique(y)
    counts = np.zeros(len(classes))
    for c_k in classes:
        counts[c_k] = np.sum(np.where(y==c_k,1,0))
    probs = counts / np.sum(counts)   
    print('The class priors are {}'.format(probs))
    priors = tfd.Categorical(probs=probs)
    return priors

prior = get_prior(y_train)
# ==> The class priors are [0.31666667 0.325      0.35833333]

验证先验分布

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()

3.2 类的条件分布

针对i = 0,1和k = 0,1,2的类条件分布P(Xi | Y = yk)进行定义。在我们的模型中,我们将假设这些分布为单变量高斯分布

具有平均参数μik和标准偏差参数σik,总共有十二个参数。我们将再次使用最大似然估计这些参数。在这种情况下,估算值以下公式计算

def get_class_conditionals(x, y):
    def delta(a,b):
        return 1. if a==b else 0. 
    classes = np.unique(y)
    n_sample = len(y)  
   
    mu=np.zeros((len(classes),2))
    sigma=np.zeros((len(classes),2))
    
    for k in range(len(classes)): # loop over every class
        for i in range(2):
            dyk = [ delta(y[r], k) for r in range(n_sample) ]
            xin = x[:,i]           
            mu_ki = np.dot(xin,dyk) / np.sum(dyk)
            sigma_squared_ki = np.dot((xin - mu_ki)**2, dyk) / np.sum(dyk)
            mu[k][i] = mu_ki
            sigma[k][i] = np.sqrt(sigma_squared_ki)

    dist = tfd.MultivariateNormalDiag(loc=tf.cast(mu,tf.float32), \
    scale_diag=tf.cast(sigma,tf.float32))
    return dist 
class_conditionals = get_class_conditionals(x_train, y_train)

可以用等高线图可视化类条件分布密度。每个分布的轮廓与具有对角协方差矩阵的高斯分布相对应,因为模型假定类别特征都是独立的。

def get_meshgrid(x0_range, x1_range, num_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], num_points)
    x1 = np.linspace(x1_range[0], x1_range[1], num_points)
    return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
    X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
    Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
    Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
    for batch in np.arange(batch_shape):
        if levels:
            plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
        else:
            plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()
image.png
image.png

3.3 预测

先验条件分布和类别条件分布,根据它们来预测未知测试输入分类:

使用argmax将具有最大概率的类别作为预测类

def predict_class(prior, class_conditionals, x):
    def predict_fn(myx):
        class_probs = class_conditionals.prob(tf.cast(myx, dtype=tf.float32))
        prior_probs = tf.cast(prior.probs, dtype=tf.float32)
        class_times_prior_probs = class_probs * prior_probs
        Q = tf.reduce_sum(class_times_prior_probs)
        P = tf.math.divide(class_times_prior_probs,Q)
        Y = tf.cast(tf.argmax(P), dtype=tf.float32)
        return Y
    x = tf.cast(x,tf.float32)
    y = tf.cast(tf.map_fn(predict_fn, x),tf.int32)
    return np.array(y)

predictions = predict_class(prior, class_conditionals, x_test)
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))

# ==> Test accuracy: 0.9000

绘制模型的decision regions:

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior, class_conditionals, x), 
             1, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
             num_points=500)
plt.title("Training set with decision regions")
plt.show()

4.TFP trainable Gaussian Distribution

属于某个类别的各个独立特征满足高斯分布,其参数均值和方差除了可以用上述统计学方法计算,还可以用TFP从数据中学习。

用tfd.MultivariateNormalDiag()为数据集的特征创建可训练的高斯分布。分布的参数将通过最小化负对数似然来估算,这等效于最大化对数似然。

4.1 构建自定义训练过程

def learn_parameters(x, y, mus, scales, optimiser, epochs):
    @tf.function
    def nll(dist, x_train, y_train):
        log_probs = dist.log_prob(x_train)
        L = len(tf.unique(y_train)[0])
        y_train = tf.one_hot(indices=y_train, depth=L)
        return -tf.reduce_mean(log_probs * y_train)

    @tf.function
    def get_loss_and_grads(dist, x_train, y_train):
        with tf.GradientTape() as tape:
            tape.watch(dist.trainable_variables)
            loss = nll(dist, x_train, y_train)
            grads = tape.gradient(loss, dist.trainable_variables)
        return loss, grads

    nll_loss = []
    mu_values = []
    scales_values = []
    x = tf.cast(np.expand_dims(x, axis=1), tf.float32)
    dist = tfd.MultivariateNormalDiag(loc=mus, scale_diag=scales)
    for epoch in range(epochs):
        loss, grads = get_loss_and_grads(dist, x, y)
        optimiser.apply_gradients(zip(grads, dist.trainable_variables))
        nll_loss.append(loss)
        mu_values.append(mus.numpy())
        scales_values.append(scales.numpy())
    nll_loss, mu_values, scales_values = \
        np.array(nll_loss), np.array(mu_values), np.array(scales_values)
    return (nll_loss, mu_values, scales_values, dist)

4.2 模型训练

# Assign initial values for the model's parameters
mus = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
scales = tf.Variable([[1., 1.], [1., 1.], [1., 1.]])
opt = tf.keras.optimizers.Adam(learning_rate=0.005)
epochs = 10000
nlls, mu_arr, scales_arr, class_conditionals = \
    learn_parameters(x_train, y_train, mus, scales, opt, epochs)

训练完成后,绘制loss与epoch的关系图,以确保我们的优化程序收敛。

出于相同的原因,我们还绘制了模型参数的值。

# Plot the loss and convergence of the standard deviation parameters
fig, ax = plt.subplots(1, 3, figsize=(15, 4))
ax[0].plot(nlls)
ax[0].set_title("Loss vs. epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Negative log-likelihood")
for k in [0, 1, 2]:
    ax[1].plot(mu_arr[:, k, 0])
    ax[1].plot(mu_arr[:, k, 1])
ax[1].set_title("ML estimates for model's\nmeans vs. epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Means")
for k in [0, 1, 2]:
    ax[2].plot(scales_arr[:, k, 0])
    ax[2].plot(scales_arr[:, k, 1])
ax[2].set_title("ML estimates for model's\nscales vs. epoch")
ax[2].set_xlabel("Epoch")
ax[2].set_ylabel("Scales")
plt.show()

确实,我们的训练循环达到了收敛,并且即使模型参数都以相同的初始条件开始,模型的参数也取决于其最佳估计值。以下代码绘制了三个类的三个双变量高斯分布

def get_meshgrid(x0_range, x1_range, n_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], n_points)
    x1 = np.linspace(x1_range[0], x1_range[1], n_points)
    return np.meshgrid(x0, x1)

x0_range = x_train[:, 0].min(), x_train[:, 0].max()
x1_range = x_train[:, 1].min(), x_train[:, 1].max()
X0, X1 = get_meshgrid(x0_range, x1_range, n_points=300)
X_v = np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, axis=1)
Z = class_conditionals.prob(X_v)
Z = np.array(Z).T.reshape(3, *X0.shape)

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 6))
ax = fig.gca(projection='3d')
for k in [0,1,2]:
    ax.plot_surface(X0, X1, Z[k,:,:], alpha=.8)
ax.set_xlabel('Sepal length (cm)')
ax.set_ylabel('Sepal width (cm)')
ax.set_zlabel('Probability');
ax.elev = 35
ax.azim = 45
print("Class conditional means:")
print(class_conditionals.loc.numpy())
print("\nClass conditional standard deviations:")
print(class_conditionals.stddev().numpy())
# Class conditional means:
# [[5.002439  3.42439  ]
# [5.89      2.78     ]
# [6.5179486 2.9307692]]

# Class conditional standard deviations:
# [[0.34675348 0.37855995]
# [0.5309426  0.31796226]
# [0.5808357  0.28025055]]

5. 其他

经典机器学习框架比如TF的输出是确定的值,这里面有些问题,因为世界上不确定是普遍,包含Aleatoric数据本身的不确定性(数据集标签中的测量误差或噪声)Epistemic uncerntainty模型的认知不确定。模型是需要不确定性。深度学习模型需要知道它们不知道什么。“To know what they don't know.” 概率编程或者概率模型是个解决方案,而要做概率编程 (Probabilistic Programming) 的话,TensorFlow Probability (TFP) 这个库是个不错的选择。

附录&参考

Probabilistic Deep Learning with TensorFlow 2, by 伦敦帝国学院

TensorFlow Probability

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. 导入及查看数据
  • 2.贝叶斯分类器理论基础
  • 3.TensorFlow math api 实现
    • 3.1 先验分布
      • 3.2 类的条件分布
        • 3.3 预测
        • 4.TFP trainable Gaussian Distribution
          • 4.1 构建自定义训练过程
            • 4.2 模型训练
            • 5. 其他
            • 附录&参考
            相关产品与服务
            灰盒安全测试
            腾讯知识图谱(Tencent Knowledge Graph,TKG)是一个集成图数据库、图计算引擎和图可视化分析的一站式平台。支持抽取和融合异构数据,支持千亿级节点关系的存储和计算,支持规则匹配、机器学习、图嵌入等图数据挖掘算法,拥有丰富的图数据渲染和展现的可视化方案。
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档