# AI 技术讲座精选：如何创建简易且能分辨垃圾邮件的ML分类器

#runs once on training data def train: total = 0 numSpam = 0 for email in trainData: if email.label == SPAM: numSpam += 1 total += 1 pA = numSpam/(float)total pNotA = (total — numSpam)/(float)total

#runs once on training data def train: total = 0 numSpam = 0 for email in trainData: if email.label == SPAM: numSpam += 1 total += 1 processEmail(email.body, email.label) pA = numSpam/(float)total pNotA = (total — numSpam)/(float)total#counts the words in a specific email def processEmail(body, label): for word in body: if label == SPAM: trainPositive[word] = trainPositive.get(word, 0) + 1 positiveTotal += 1 else: trainNegative[word] = trainNegative.get(word, 0) + 1 negativeTotal += 1#gives the conditional probability p(B_i | A_x) def conditionalWord(word, spam): if spam: return trainPositive[word]/(float)positiveTotal

return trainNegative[word]/(float)negativeTotal

#gives the conditional probability p(B | A_x)

def conditionalEmail(body, spam): result = 1.0 for word in body: result *= conditionalWord(word, spam) return result

#classifies a new email as spam or not spam def classify(email): isSpam = pA * conditionalEmail(email, True) # P (A | B) notSpam = pNotA * conditionalEmail(email, False) # P(¬A | B) return isSpam > notSpam

#gives the conditional probability p(B_i | A_x) with smoothing def conditionalWord(word, spam): if spam: return (trainPositive.get(word,0)+alpha)/(float)(positiveTotal+alpha*numWords) return (trainNegative.get(word,0)+alpha)/(float)(negativeTotal+alpha*numWords)

TF-IDF算法

N-Grams算法

Tokenization（符号化）

https://spamassassin.apache.org/publiccorpus/

0 条评论

## 相关文章

### 【学术】从一个简单的模型开始，可以让机器学习更高效

AiTechYun 编辑：xiaoshan ? 要创建通用人工智能，必须首先掌握逻辑回归 从基础开始 在试图发展对世界的科学认识的时候，大多数的领域在探索重要的...

4237

1928

7885

### 13张动图助你彻底看懂马尔科夫链、PCA和条件概率！

[ 导读 ]马尔科夫链、主成分分析以及条件概率等概念，是计算机学生必学的知识点，然而理论的抽象性往往让学生很难深入地去体会和理解。而本文，将这些抽象的理论概念，...

1191

3716

1733

4478

41610

### 本周必看 | 7月ML&Python 最佳开源项目Top 10 ：从几百个项目中脱颖而出，都在收藏！

【导读】七月就要结束了，在即将到来的 7 月最后一个周末，人工智能头条为大家整理了本月 ML 和 Python 最受欢迎的十大开源项目。就算放假在家也可以知道大...

1283

### 谷歌大规模机器学习：模型训练、特征工程和算法选择 (32PPT下载)

【新智元导读】在 ThingsExpo 会议上，谷歌软件工程师 Natalia Ponomareva 作了有关如何在大规模机器学习中取得成功的讲座。Natali...

40010