以深度学习为代表的现代机器学习方法在预测和分类准确性上取得了巨大的成功。在机器学习中,典型的学习过程首先将可用的训练数据拆分为训练、验证和测试集,我们在训练集上训练模型,在验证集上验证超参数,在测试集上验证模型的性能。但是在实际应用中,性能良好的测试集模型将如何发挥作用?如果模型A在测试集上的性能优于模型B,这是否意味着在实际应用中A模型的效果更好?实际上有许多因素会影响模型的实际落地效果,如测试集的分布与模型训练数据存在分布差异,模型在预测或分类时是不可靠的,且这种误差不可评估。
Examples of 16 corruption types in ImageNet-C images
以上图为例,假设我们有一个给猫狗分类的神经网络,同一只狗的照片可做如上等多种变换,以高斯随机误差为例,测试样本随机误差的成分越高,模型对其分类识别的准确率越低。再回到实际的工业场景,以自动驾驶为例,影响样本不确定性的因素有时间(白天黑夜春夏秋冬)、天气(晴阴雨雪)、地理位置(城市农村山区平原),如果实际环境出现了模型不确定的样本,模型预测出错所带来的影响可能是十分严峻的。
已有许多前人针对这一问题展开研究,这些研究的keyword可能存在差异,但主体思路类似,通过刻画一个预测分布而并未预测概率来提升模型鲁棒性。基线估计是我在蚂蚁的工作项目所抽象出来的算法框架,它原本是针对运维领域内的容量场景所做的基线区间估计,就落地场景而言,它还是比较局限的,但基线估计这个概念本身是不局限的,这个概念在领域内的名称可能多种多样,如OOD检测、新颖性检测等,这些名词实际所描绘的东西是一致的,都是衡量正常数据的分布,并通过推荐分布阈值来判断异常,所以我姑且称之为基线估计。
在相关介绍开始前,结合在百度和蚂蚁的工作经验,我个人认为对某个对象的基线分布的刻画比直接进行异常检测有着更广泛的用途,以蚂蚁的容量工作来看,风险对象的多指标基线区间估计,比指标的异常检测有着更广泛的用途,基线估计除了可以用来做指标的异常检测,还可以应用于其他诸多场景。此外基线估计有着诸多可能落地的工业场景,如:IT安全、医学诊断、工业场景的监控与异常检测、图像识别、视频监控、文本挖掘、传感器网络等。在应对不同场景下,就整体算法架构而言,它需要适配不同的数据输入(如工业场景的多维时序监控数据,文本挖掘场景的文本数据,医学诊断的图像数据等),并针对不同的数据输入都需要与之对应的手段,同时还需要支撑不同的场景输出。诚如《致未来》中提到的方法论深度与广度的重要性,因此针对基线估计相关的算法深度和广度建设至关重要。本文对领域内的相关领域和手段做简短的总结,并在后续会针对每种手段与场景做更详细的梳理。
前言提及的相关文献工作,主要以如下4个方向为代表。
Hendrycks[2]等人关于in-distribution和out-of-distribution检测的描述: can we predict whether a test example is from a different distribution from the training data; can we predict if it is from within the same distribution
Kimin[7]等人关于OOD检测的描述: For detecting out-of-distribution (OOD) samples, recent works have utilized the confidence from the posterior distribution, such as the maximum value of posterior distribution from the classifier as a baseline method
Winkens[8]等人关于OOD检测的定义: Out-of-distribution detection can be performed by approximating a probability density p(x) of training inputs x, and detecting test-time OOD inputs using a threshold γ: if p(x) < γ then x is considered OOD.
Marco[5]等人对新颖性检测做了比较完备的综述,关于新颖性检测的定义也总结的比较到位:In the novelty detection approach to classification, “normal” patterns X are available for training, while “abnormal” ones are relatively few. A model of normality M(θ), where θ represents the free parameters of the model, is inferred and used to assign novelty scores z(x) to previously unseen test data x. Larger novelty scores z(x) correspond to increased “abnormality” with respect to the model of normality. A novelty threshold z(x) = k is defined such that x is classified “normal” if z(x) <= k, or “abnormal” otherwise.
Balaji[9]等人衡量预测不确定性的方法: 1) examine calibration, a frequentist notion of uncertainty which measures the discrepancy between subjective forecasts and (empirical) long-run frequencies; 2) concerns generalization of the predictive uncertainty to domain shift.
Andrey[10]等人从三个层面来描述预测不确定性:1) Model uncertainty, or epistemic uncertainty, measures the uncertainty in estimating the model parameters given the training data - this measures how well the model is matched to the data; 2) Data uncertainty, or aleatoric uncertainty, is irreducible uncertainty which arises from the natural complexity of the data; 3) Distributional uncertainty arises due to mismatch between the training and test distributions - a situation which often arises for real world problems.
Gergely[4]等人在模型预测时添加了abstention: instead of forcing the learner to output a prediction in {0, 1}, we allow the learner to abstain from prediction. This setup was intensively studied in statistical learning where the learner can output one of three values {0, 1, ∗} and the value ∗ corresponds to an abstention.
对正常数据的分布估计手段,主要可分为如下几个领域。
基于概率的方法,主要通过估计数据的生成概率密度函数,并基于结果分布选择阈值来定义基线边界。常用的算法如下
统计假设检验
参数估计
非参数估计
基于距离的检测手段,顾名思义,通过衡量数据点之间的距离来生成正常数据的分布边界。常用的算法如下:
基于领域的方法旨在根据数据的结构拟合出一个确切的边界,而并非类别的密度。未知数据的归属将取决于他们相对于各边界的位置。常用的算法如下:
数据维度很高时,则通常使用基于重构的方法,将原始数据进行降维或重构,并通过重构误差来估计正常数据区间。常用的算法如下:
以上是目前存在前人研究的检测手段,本文仅针对算法本身做简要的描述,后续会结合每篇论文的场景、手段以及实际效果做更详细的描述。
[1] 致未来: https://zhuanlan.zhihu.com/p/338999381
[2] A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. https://arxiv.org/abs/1610.02136
[3] Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. https://arxiv.org/abs/1906.02530
[4] Fast Rates for Online Prediction with Abstention. https://arxiv.org/abs/2001.10623?context=cs.LG
[5] A review of novelty detection.
[6] Analyzing the Role of Model Uncertainty for Electronic Health Records. https://arxiv.org/abs/1906.03842
[7] A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. https://arxiv.org/abs/1807.03888
[8] Contrastive Training for Improved Out-of-Distribution Detection. https://arxiv.org/abs/2007.05566v1
[9] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. https://arxiv.org/abs/1612.01474
[10] Predictive Uncertainty Estimation via Prior Networks. https://arxiv.org/abs/1802.10501
[11] Procedures for detecting outlying observations in samples
[12] Detection of outliers in reference distributions: performance of Horn's algorithm
[13] Novelty detection with multivariate extreme value statistics
[14] https://en.wikipedia.org/wiki/Gumbel_distribution
[15] https://en.wikipedia.org/wiki/Fr%C3%A9chet_distribution
[16] https://en.wikipedia.org/wiki/Weibull_distribution
[17] Concepts for novelty detection and handling based on a case-based reasoning process scheme
[18] A new anomaly detection method based on hierarchical HMM
[19] Factorial switching Kalman filters for condition monitoring in neonatal intensive care
[20] Outlier detection in wireless sensor networks using Bayesian belief networks
[21] Multivariate density estimation with optimal marginal parzen density estimation and gaussianization
[22] State-of-the-art in Bayesian changepoint detection
[23] Self-nonself discrimination in a computer
[24] Outlier detection using k-nearest neighbour graph
[25] Hot: hypergraph-based outlier test for categorical data
[26] Applying the possibilistic c-means algorithm in kernel-induced spaces
[27] Findout: finding outliers in very large datasets
[28] Multi-sphere support vector data description for outliers detection on multi-distribution data
[29] Support vector method for novelty detection
[30] A Novel Anomaly Detection Scheme Based on Principal Component Classifier
[31] Enhancing the reliability of out-of-distribution image detection in neural networks
[32] Contrastive Training for Improved Out-of-Distribution Detection
[33] Variational Autoencoder based Anomaly Detection using Reconstruction Probability
[34] Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications
[35] Layer-constrained variational autoencoding kernel density estimation model for anomaly detection
[36] Variational Autoencoder With Optimizing Gaussian Mixture Model Priors