贝叶斯估计中极大似然估计、拉普拉斯平滑定理以及M-估计

英文原文链接:http://www.temida.si/~bojan/probability_estimation.php 原文: Probability estimation 1 Introduction Let us assume that in an experiment we have conducted n independent trials, of which there are r successes. The rest of the trials (n-r) are failures. Examples of such experiments are tossing a coin (e.g. success is a head), throwing a dice (e.g. success is 6 on top), shooting a free throw in basketball (success is a scored point), etc.

In this report we will discuss the following question: how to estimate the probability of success in the next (n+1) trial. In addition, we would like to devote special attention to the cases when the sample size is effectively small. In this context we will discuss and compare three methods: relative frequency, Laplace’s law of succession and m-estimate.

What exactly do we mean by effectively small sample size? Good (1965) states then when there are more than three successes and more than three failures in our examples set, there is little difference between any of the described methods for many practical purposes. So, we will basically be dealing with the cases when either the number of successes or the number of failures are small (e.g. 0, 1, 2). Note that such situations can occur also when the actual sample size is large, but we divide the trial set to subsets that fulfill certain conditions for the purpose of estimating conditional probabilities in such subsets. 2 Estimates for success in the next trial 2.1 Relative frequency Relative frequency is sometimes called also maximum likelihood estimate. Probability of success in the nest trial is computed according to the following formula: P=r/n The estimation of probabilities can be regarded as a relatively simple task when the sample size is large enough. In such case we would hardly require any theory. Bernoulli’s theorem states that when n is large enough, the probability of success in the next trial can be reliably estimated by the relative frequency P=r/n More formally, for any arbitrary small ε and δ such n0 can be found that for every n≥n0 the following inequation holds: P(|r/n -P| <ε) >1 - δ However, after completing just one trial, which was a failure, the relative frequency probability estimate of a success in the next trial would be 0. 2.2 Laplace’s law of succession In order to alleviate such zero probability estimations, a modified formula was proposed: P = (r+1)/(n+2) In this formula a uniform prior probability is assumed (Good, 1965). In fact, the rationale behind adding 1 in the numerator and 2 in denominator is the following: before performing the actual experiment, we assume that there were two trials, one successful and one failure. 2.3 Bayesian m-estimate More general Bayesian probability estimate is described in (Cestnik, 1990, 1991). To calculate a general Bayesian probability estimate on the basis of evidence, a prior probability distribution has to be assumed first. Then, given the evidence, the prior distribution is updated to a posterior one, from which the expectation can be taken as a point estimate of p. The task of determining a prior probability distribution has been identified as an intrinsic difficulty of the Bayesian approach. P=(r + Pam)/(n + m) 3 Estimation of conditional probabilities The basic idea behind the proposed m-estimate (Cestnik, 1990, 1991) for estimating conditional probabilities is that the prior probabilities can be estimated from an unconditional sample. The remaining parameter m, which is related to the variance, has to be determined also. The prior variance is computed by the following formula: Var(p) = Pa(1-Pa)/(m+1) The parameter m is inversely proportional to the variance of the prior distribution. It also leverages the tradeoff between relative frequency and prior probability, as can be observed from the following form of m-estimate: P=n/(n+m) * r/n + m/(n+m) * Pa 4 Literature B. Cestnik: Estimating probabilities in machine learning, Ph.D. thesis, University of Ljubljana, Faculty of Computer and Information Science, 1991.

B. Cestnik: Estimating probabilities: A crucial task in machine learning. In: Carlucci Aiello, Luigia (ed.). ECAI 90. London: Pitman, 1990, str. 147-149.

B. Cestnik, I. Bratko: On estimating probabilities in tree pruning. In: Kodratoff, Yves. Machine learning - EWSL-91 : European working session on learning, Porto, Portugal, March 6-8, 1991: proceedings, (Lecture notes in computer science, Lecture notes in artificial intelligence, 482). Berlin [etc.]: Springer-Verlag, 1991, str. 138-150.

S. Džeroski, B. Cestnik, I. Petrovski: Using the m-estimate in rule induction. CIT. J. Comput. Inf. Technol., 1993, vol. 1, str. 37-46 I. J. Good: The Estimation of Probabilities: An Essay on Modern Bayesian Methods, Cambridge, Mass., M.I.T. Press, 1965.

译文: 概率估计 1引言 假设在一次实验中,我们进行了n次独立试验,其中有r次成功。其余的试验(n-r)失败。这样的试验例如:掷硬币(例如,头像的面为成功),掷骰子(例如,六朝上为成功),篮球中的罚球投篮(一个得分点是成功),等等。 在这份报告中,我们将讨论以下问题:如何估计下次(第n +1次)试验成功的概率。此外,我们将特别关注样例大小足够小的情况。在这种情况下我们将讨论和比较三种方法:相对频率,拉普拉斯平滑定理和M-估计。 足够小究竟是什么意思?Good(1965)指出当在我们的样例集中有三个以上的成功和三个以上的失败时,对于许多实际应用来说以上鸟叔的任何方法之间差别不大。所以,我们基本上要处理的情况是失败的数目或者成功的数目是很小的(如,0,1,2)。注意在这种情况下也常发生在样本大小很大时,但我们将分割试验集成子集以满足一定条件来估计这些子集上的条件概率。 2估计下次试验成功的概率 2.1相对频率 有时也称相对概率为极大似然估计。下次试验成功的概率按照以下公式计算: P=r/n 当样本数量足够大的概率估计,可以看作是一个相对简单的任务。在这种情况下,我们就不需要任何理论。伯努利定理之处,当n足够大时,下次试验成功的概率可以可信的按相对频率P=r/n估计。更正式地说,对于任何任意小ε和δ,存在n0使得对于每一个n≥n0,有以下不等式: P(|r/n -P| <ε) >1 - δ 然而,在完成一次试验后,结果是失败,下次试验的成功的相对频率的估计概率为0。 2.2拉普拉斯平滑定理 为了缓解这种零概率估计,修改该后的方案是: P = (r+1)/(n+2) 在此公式中,假定了统一先验概率(Good,1965)。事实上,分子加1和分母加2背后的基本原理是这样的:在执行实际的试验之前,我们假设已经有两次试验,一次成功和一次失败。 2.3贝叶斯M-估计 (Cestnik,1990,1991)中描述了一种更一般的贝叶斯概率估计。在证据的基础上来计算一般的贝叶斯概率估计,首先假定先验概率分布。然后,根据已有的证据,更新先验分布为后验分布,其预期可以被视为p的一个点估计。确定先验概率分布的任务被认为是贝叶斯方法的一个固有难点。 P=(r + Pam)/(n + m) 3条件概率的估计 为条件概率估计提出的m-估计(Cestnik,1990,1991)背后的基本想法是,先验概率可以从无条件样本中估计。方差相关的其它参数 m,也还必须确定。先验方差由下式计算: Var(p) = Pa(1-Pa)/(m+1) 参数m与先验分布的方差成反比。从以下形式的M-估计可以观察到,它还利用相对频率和先验概率之间的折中: P=n/(n+m) * r/n + m/(n+m) * Pa

另附M-估计的论文: Estimating Probabilities: A Crucial Task in Machine Learning. Cestnik 1990

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏IT技术精选文摘

MILABOT:基于深度强化学习打造聊天机器人

21230
来自专栏人工智能头条

写给大家看的机器学习书【Part5】机器学习为什么是可行的(中)

16950
来自专栏从流域到海域

应用自然语言处理(NLP)解码电影

原文地址:https://dzone.com/articles/applying-nlp-to-decode-an-indian-classical-movie...

26980
来自专栏AlgorithmDog的专栏

动态图计算:Tensorflow 第一次清晰地在设计理念上领先

北京时间 2017 年 2 月 8 号,Google 宣布在其博客上发布 TensorFlow Fold 支持动态图计算。动态图计算是 Tensor...

26770
来自专栏目标检测和深度学习

Uber 开源「神经演化」可视化工具 VINE

算力的提升可能会为旧的算法注入活力。近两年来,神经演化(Neuroevolution)的方法逐渐再次受到关注,包括 OpenAI、DeepMind、Google...

26540
来自专栏PPV课数据科学社区

七种数据分析领域中最为人称道的降维方法

近来由于数据记录和属性规模的急剧增长,大数据处理平台和并行数据分析算法也随之出现。于此同时,这也推动了数据降维处理的应用。实际上,数据量有时过犹不及。有时在数...

35740
来自专栏PPV课数据科学社区

进阶篇:从 0 到 1 掌握 Python 机器学习(附资源)

进阶篇 ? 机器学习算法 本篇是使用 Python 掌握机器学习的 7 个步骤系列文章的下篇,如果你已经学习了该系列的上篇基础篇:从 0 到 1 掌握 Pyth...

42170
来自专栏ATYUN订阅号

黑客技术:欺骗人工智能步骤详解

几乎只要程序员在编写计算机程序,黑客就会一直在设法利用这些程序。黑客可能利用程序中最小的漏洞侵入系统,窃取数据,通常他们能造成很严重的破坏。 但是由深度学习算...

42970
来自专栏机器学习和数学

[高大上的DL]经典网络模型总结之GoogLeNet篇

勘误:开始之前说一下,昨天介绍的环境搭建的那篇,里面我忘记写cudnn的安装说明了,只贴了在哪下载,我在word版里面已经更新了,欢迎需要的童鞋下载查看。还有一...

46040
来自专栏量子位

为什么我的CNN石乐志?我只是平移了一下图像而已

一般来说,图像经过小小的平移和变形之后,人类还是信任CNN能够把它们泛化,识别出里面的物体。

16420

扫码关注云+社区

领取腾讯云代金券