基线估计

慎笃

发布于 2021-09-15 10:13:45

8630

发布于 2021-09-15 10:13:45

1 什么是基线估计？

以深度学习为代表的现代机器学习方法在预测和分类准确性上取得了巨大的成功。在机器学习中，典型的学习过程首先将可用的训练数据拆分为训练、验证和测试集，我们在训练集上训练模型，在验证集上验证超参数，在测试集上验证模型的性能。但是在实际应用中，性能良好的测试集模型将如何发挥作用？如果模型A在测试集上的性能优于模型B，这是否意味着在实际应用中A模型的效果更好？实际上有许多因素会影响模型的实际落地效果，如测试集的分布与模型训练数据存在分布差异，模型在预测或分类时是不可靠的，且这种误差不可评估。

Examples of 16 corruption types in ImageNet-C images

以上图为例，假设我们有一个给猫狗分类的神经网络，同一只狗的照片可做如上等多种变换，以高斯随机误差为例，测试样本随机误差的成分越高，模型对其分类识别的准确率越低。再回到实际的工业场景，以自动驾驶为例，影响样本不确定性的因素有时间（白天黑夜春夏秋冬）、天气（晴阴雨雪）、地理位置（城市农村山区平原），如果实际环境出现了模型不确定的样本，模型预测出错所带来的影响可能是十分严峻的。

已有许多前人针对这一问题展开研究，这些研究的keyword可能存在差异，但主体思路类似，通过刻画一个预测分布而并未预测概率来提升模型鲁棒性。基线估计是我在蚂蚁的工作项目所抽象出来的算法框架，它原本是针对运维领域内的容量场景所做的基线区间估计，就落地场景而言，它还是比较局限的，但基线估计这个概念本身是不局限的，这个概念在领域内的名称可能多种多样，如OOD检测、新颖性检测等，这些名词实际所描绘的东西是一致的，都是衡量正常数据的分布，并通过推荐分布阈值来判断异常，所以我姑且称之为基线估计。

在相关介绍开始前，结合在百度和蚂蚁的工作经验，我个人认为对某个对象的基线分布的刻画比直接进行异常检测有着更广泛的用途，以蚂蚁的容量工作来看，风险对象的多指标基线区间估计，比指标的异常检测有着更广泛的用途，基线估计除了可以用来做指标的异常检测，还可以应用于其他诸多场景。此外基线估计有着诸多可能落地的工业场景，如：IT安全、医学诊断、工业场景的监控与异常检测、图像识别、视频监控、文本挖掘、传感器网络等。在应对不同场景下，就整体算法架构而言，它需要适配不同的数据输入（如工业场景的多维时序监控数据，文本挖掘场景的文本数据，医学诊断的图像数据等），并针对不同的数据输入都需要与之对应的手段，同时还需要支撑不同的场景输出。诚如《致未来》中提到的方法论深度与广度的重要性，因此针对基线估计相关的算法深度和广度建设至关重要。本文对领域内的相关领域和手段做简短的总结，并在后续会针对每种手段与场景做更详细的梳理。

2 相关文献

前言提及的相关文献工作，主要以如下4个方向为代表。

2.1 Out-of-distribution Detection

Hendrycks[2]等人关于in-distribution和out-of-distribution检测的描述: can we predict whether a test example is from a different distribution from the training data; can we predict if it is from within the same distribution

Kimin[7]等人关于OOD检测的描述: For detecting out-of-distribution (OOD) samples, recent works have utilized the confidence from the posterior distribution, such as the maximum value of posterior distribution from the classifier as a baseline method

Winkens[8]等人关于OOD检测的定义: Out-of-distribution detection can be performed by approximating a probability density p(x) of training inputs x, and detecting test-time OOD inputs using a threshold γ: if p(x) < γ then x is considered OOD.

2.2 Novelty Detection

Marco[5]等人对新颖性检测做了比较完备的综述，关于新颖性检测的定义也总结的比较到位：In the novelty detection approach to classification, “normal” patterns X are available for training, while “abnormal” ones are relatively few. A model of normality M(θ), where θ represents the free parameters of the model, is inferred and used to assign novelty scores z(x) to previously unseen test data x. Larger novelty scores z(x) correspond to increased “abnormality” with respect to the model of normality. A novelty threshold z(x) = k is defined such that x is classified “normal” if z(x) <= k, or “abnormal” otherwise.

2.3 Predictive Uncertainty

Balaji[9]等人衡量预测不确定性的方法: 1) examine calibration, a frequentist notion of uncertainty which measures the discrepancy between subjective forecasts and (empirical) long-run frequencies; 2) concerns generalization of the predictive uncertainty to domain shift.

Andrey[10]等人从三个层面来描述预测不确定性：1) Model uncertainty, or epistemic uncertainty, measures the uncertainty in estimating the model parameters given the training data - this measures how well the model is matched to the data; 2) Data uncertainty, or aleatoric uncertainty, is irreducible uncertainty which arises from the natural complexity of the data; 3) Distributional uncertainty arises due to mismatch between the training and test distributions - a situation which often arises for real world problems.

2.4 Prediction with Abstention

Gergely[4]等人在模型预测时添加了abstention: instead of forcing the learner to output a prediction in {0, 1}, we allow the learner to abstain from prediction. This setup was intensively studied in statistical learning where the learner can output one of three values {0, 1, ∗} and the value ∗ corresponds to an abstention.

3 检测方法

对正常数据的分布估计手段，主要可分为如下几个领域。

3.1 Probabilistic method

基于概率的方法，主要通过估计数据的生成概率密度函数，并基于结果分布选择阈值来定义基线边界。常用的算法如下

统计假设检验

Grubbs' test: Grubbs[11]提出了基于单维正态数据的Grubbs' test，可衡量正常数据的概率密度分布
box-plot rule: Erik[12]等人将box-plot rule应用到医学实验室的异常诊断。

参数估计

Gaussian mixture models: Ilonen[12]等人通过高斯混合模型来对数据进行概率密度预估，我们都是到高斯混合模型是假设样本由多个高斯分布叠加而成，最终数据分布可能是单峰的，也可能是多峰的，对于单峰数据可直接根据概率密度函数推荐阈值，对于多峰数据则可以使用Monto-Carlo来估计实际的累积概率并推荐阈值
Extreme Value Theory: Clifton[13]等人将极值理论应用于医学领域的异常诊断。生产环境的实际数据中，通常尾部数据服从于另一个不同于正常数据的分布，这就是极值理论。极值通常会服从三个分布：Gumbel distribution[14], Fréchet distribution[15], Weibull distribution[16]
Case-based Reasoning: Perner[17]通过基于事件的推理来预估某个case出现的概率，并通过阈值判断该case是否为abnormal case。CBR通过统计模型来描述已存在的case/class，统计模型隐性表征case的相关信息，并基于相似性的学习过程将case分到不同的类别中。并将不归于当前类别的case识别为异常
Autoregressive Integrated Moving Average: 差分自回归移动平均模型ARIMA是时序预测模型，在监控等场景可以通过预测未来的数据趋势来判断异常
Hidden Markov Model: Zhang[18]等人通过隐马尔科夫模型来估计未来行为的概率，并基于概率阈值判断异常。我们都知道隐马尔科夫模型包含隐含的state和观测的action，通过EM算法估计statei→statei+1以及state→action的概率，那么在某个action发生的情况下，我们便可推导出未来不同action发生的概率，并用于异常判断
Kalman filter: Williams[19]等人使用卡尔曼滤波来对新生儿重症监控场景做异常检测。科尔曼滤波可以看做是自回归过程的推广，检测原理类似
Dynamic Bayesian Networks: 无论是隐马尔科夫模型还是卡尔曼滤波的hidden/observed state，都可以用动态贝叶斯网络来表征。Janakiram[20]等人便使用动态贝叶斯网络去识别无线传感器网络中的异常行为。

非参数估计

kernel density estimator: Erdogmus[21]等人使用核密度估计来多维数据的分布进行拟合。KDE通过bins来设置分布，并通过kernel以及当前bin周围的数据来对每个bin的概率进行估计，最终得到数据的概率密度，并基于概率密度分布识别异常。
changepoint detection: 变化点检测会基于检测观测数据估计生成分布，随着新的样本进入，会生成新的分布，通过判断分布是保持稳定还是发生变化来进行异常检测，Tartakovsky[22]等人使用变化点检测来检测异常
negative selection algorithm: 参考人类的免疫系统所设计的检测算法，通过创建一组固定长度的二进制字符串来执行类似T细胞的功能，然后通过基于距离度量的简单规则来比较字符串的位，并判断是否发生匹配[23]

3.2 Distance-based method

基于距离的检测手段，顾名思义，通过衡量数据点之间的距离来生成正常数据的分布边界。常用的算法如下：

k-nearest neighbour: k近邻假设正常点都聚集在一起，而异常点远离正常点，所以通过衡量当前点与k个邻近点的距离来表征一个点的异常程度，如果一个点远离它的临近点，则识别为异常点[24]
hypergraph-based technique: 超图可以定义为由一组顶点和和超边组成的广义图，通过Apriori算法挖掘所有的频繁特征集，超边存储频繁特征集和包含频繁特征集的顶点，通过衡量顶点的频繁特征集组成来判断其异常程度[25]
k-means clustering: 选择k个随机初始聚类中心，计算这些聚类中心与训练集中每个点之间的距离，然后识别最接近每个聚类中心的点，由此得到k个新的聚类中心，并且重复该过程，最终得到k个中心，通过判断点与k个中心的距离来识别异常[26]
wavelet transform: 使用小波变换对数据进行加工，寻找变化后空间里的密集簇，并用于异常识别[27]

3.3 Domain-based method

基于领域的方法旨在根据数据的结构拟合出一个确切的边界，而并非类别的密度。未知数据的归属将取决于他们相对于各边界的位置。常用的算法如下：

support vector data description: 该方法将正常数据边界定义为一个体积最小的超球面，其中包含大多数的正常数据，并通过测试点是否位于超球面内来评估其是否异常[28]
One-class support vector machine: 原理与SVM类似，将零点作为负样本点，其他数据作为正样本点，来训练支持向量机，通过将数据映射到与内核对应的特征空间，在数据与原点间构建超平面（边界）[29]

3.4 Reconstruction-based method

数据维度很高时，则通常使用基于重构的方法，将原始数据进行降维或重构，并通过重构误差来估计正常数据区间。常用的算法如下：

Principal Components Analysis: 主成分分析通过抽取协方差矩阵特征值最大的k个特征向量所组成的矩阵，来实现高维数据到低维数据的降维，并对降维后的低维特征做异常检测[30]
CNN/DenseNet/ResNet: 卷积神经网络通过卷积层实现原始数据到特征数据的提取，并通过多层感知机来进行图像分别识别，原输入数据到输出的损失误差可用于异常识别，现在常用的神经网络包含DenseNet、ResNet等[31, 32]
autoencoder/variational autoencoder: 自编码器通常包含一个编码网络和一个解码网络，通过编码网络实现高维数据到低维数据的编码，通过解码网络实现低维数据到高维数据的解码，通过最小化原始数据到解码后数据的重构误差，来实现两个网络的超参数优化，并通过对重构误差设置阈值实现异常检测[33, 34]
neural network+softmax: Hendrycks[2]等人在16年通过将神经网络的分类任务与softmax结合起来，计算分类结果/输出向量的softmax，通过给softmax定义阈值来判断异常。随后出现了许多将神经网络和概率手段等结合起来的新方法
autoencoder+KDE/GMM: 现阶段也存在一些研究，将自编码器与传统的KDE/GMM结合起来做异常检测，通过将自编码器的隐变量做分布拟合，来判断异常[35, 36]

4 总结

以上是目前存在前人研究的检测手段，本文仅针对算法本身做简要的描述，后续会结合每篇论文的场景、手段以及实际效果做更详细的描述。

参考资料

[1] 致未来: https://zhuanlan.zhihu.com/p/338999381

[2] A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. https://arxiv.org/abs/1610.02136

[3] Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. https://arxiv.org/abs/1906.02530

[4] Fast Rates for Online Prediction with Abstention. https://arxiv.org/abs/2001.10623?context=cs.LG

[5] A review of novelty detection.

[6] Analyzing the Role of Model Uncertainty for Electronic Health Records. https://arxiv.org/abs/1906.03842

[7] A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. https://arxiv.org/abs/1807.03888

[8] Contrastive Training for Improved Out-of-Distribution Detection. https://arxiv.org/abs/2007.05566v1

[9] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. https://arxiv.org/abs/1612.01474

[10] Predictive Uncertainty Estimation via Prior Networks. https://arxiv.org/abs/1802.10501

[11] Procedures for detecting outlying observations in samples

[12] Detection of outliers in reference distributions: performance of Horn's algorithm

[13] Novelty detection with multivariate extreme value statistics

[14] https://en.wikipedia.org/wiki/Gumbel_distribution

[15] https://en.wikipedia.org/wiki/Fr%C3%A9chet_distribution

[16] https://en.wikipedia.org/wiki/Weibull_distribution

[17] Concepts for novelty detection and handling based on a case-based reasoning process scheme

[18] A new anomaly detection method based on hierarchical HMM

[19] Factorial switching Kalman filters for condition monitoring in neonatal intensive care

[20] Outlier detection in wireless sensor networks using Bayesian belief networks

[21] Multivariate density estimation with optimal marginal parzen density estimation and gaussianization

[22] State-of-the-art in Bayesian changepoint detection

[23] Self-nonself discrimination in a computer

[24] Outlier detection using k-nearest neighbour graph

[25] Hot: hypergraph-based outlier test for categorical data

[26] Applying the possibilistic c-means algorithm in kernel-induced spaces

[27] Findout: finding outliers in very large datasets

[28] Multi-sphere support vector data description for outliers detection on multi-distribution data

[29] Support vector method for novelty detection

[30] A Novel Anomaly Detection Scheme Based on Principal Component Classifier

[31] Enhancing the reliability of out-of-distribution image detection in neural networks

[32] Contrastive Training for Improved Out-of-Distribution Detection

[33] Variational Autoencoder based Anomaly Detection using Reconstruction Probability

[34] Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications

[35] Layer-constrained variational autoencoding kernel density estimation model for anomaly detection

[36] Variational Autoencoder With Optimizing Gaussian Mixture Model Priors

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

如有侵权请联系 cloudcommunity@tencent.com 删除

linux

https

网络安全

编程算法

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

基线估计

基线估计

1 什么是基线估计？

2 相关文献

2.1 Out-of-distribution Detection

2.2 Novelty Detection

2.3 Predictive Uncertainty

2.4 Prediction with Abstention

3 检测方法

3.1 Probabilistic method

3.2 Distance-based method

3.3 Domain-based method

3.4 Reconstruction-based method

4 总结

参考资料

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐