基于HMM的中文词性标注 POSTagging

Michael阿明

发布于 2020-07-13 17:05:12

2.1K2

发布于 2020-07-13 17:05:12

文章被收录于专栏：Michael阿明学习之路

本文的代码是在徐老师的代码基础上，自己加了些注释，在此表示感谢！

1. 词性标注

1.1 概念

请看专家介绍中文词性标注简介

1.2 任务

给定标注文本corpus4pos_tagging.txt，训练一个模型，用模型预测给定文本的词性

标注文本部分内容如下所示：

19980101-01-001-001/m  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  ——/w  一九九八年/t  新年/t  讲话/n  （/w  附/v  图片/n  １/m  张/q  ）/w   
19980101-01-001-003/m  （/w  一九九七年/t  十二月/t  三十一日/t  ）/w  
19980101-01-001-004/m  １２月/t  ３１日/t  ，/w  发表/v  １９９８年/t  新年/t  讲话/n  《/w  迈向/v  充满/v  希望/n  的/u  新/a  世纪/n  》/w  。/w  （/w  新华社/nt  记者/n  兰/nr  红光/nr  摄/Vg  ）/w  
19980101-01-001-005/m  同胞/n  们/k  、/w  朋友/n  们/k  、/w  女士/n  们/k  、/w  先生/n  们/k  ：/w  
19980101-01-001-006/m  在/p  １９９８年/t  来临/v  之际/f  ，/w  我/r  十分/m  高兴/a  地/u  通过/p  [中央/n  人民/n  广播/vn  电台/n]nt  、/w  [中国/ns  国际/n  广播/vn  电台/n]nt  和/c  [中央/n  电视台/n]nt  ，/w  向/p  全国/n  各族/r  人民/n  ，/w  向/p  [香港/ns  特别/a  行政区/n]ns  同胞/n  、/w  澳门/ns  和/c  台湾/ns  同胞/n  、/w  海外/s  侨胞/n  ，/w  向/p  世界/n  各国/r  的/u  朋友/n  们/k  ，/w  致以/v  诚挚/a  的/u  问候/vn  和/c  良好/a  的/u  祝愿/vn  ！/w

1.3 预处理

文本处理corpusSplit函数：删除空格；词语分割；特殊字符删除；最后存入句子list
数据切分out函数：将句子分配到20个文件中（18个训练集，1个开发集，1个测试集）

# corpusSplit.py
def corpusSplit(infile, sentenceList):  # 将语料分割为句子
    fdi = open(infile, 'r', encoding='utf-8')  # 打开原始数据
    fullStopDict = {"。": 1, "；": 1, "？": 1, "！": 1}
    for line in fdi:
        text = line.strip()  # 删除左右空格
        if text == "":
            continue
        else:
            infs = text.split() # 将所有单词分开
            sentence = []
            flag = True
            for s in infs:
                w_p = s.split("/")  # 返回分割后的字符串列表
                if len(w_p) == 2:
                    word = w_p[0]
                    if word.startswith("["):
                        word = word.replace("[", "")  # 以[开始的，删除[
                    pos = w_p[1]
                    pos = re.sub("].*", "", pos)  # re正则表达式模块替换掉后面的]
                    if word == "" or pos == "":
                        flag = False
                    else:
                        sentence.append(word + "/" + pos)
                    if word in fullStopDict:
                        if flag == True:
                            sentenceList.append(" ".join(sentence)) # 序列中元素用空格隔开
                        flag = True
                        sentence = []
                else:
                    flag = False
            if sentence != [] and flag == True:
                sentenceList.append(" ".join(sentence))
    fdi.close()

def out(sentenceList, out_dir): # 将句子分别写到20个文件中，18个训练文件
    fdo_train_list = []
    for i in range(18):
        fdo_train = open(out_dir + "/train.%d" % (i), "w", encoding='utf-8')
        fdo_train_list.append(fdo_train)
    fdo_dev = open(out_dir + "/dev.txt", "w", encoding='utf-8')
    fdo_test = open(out_dir + "/test.txt", "w", encoding='utf-8')
    for sindx in range(len(sentenceList)):
        if sindx % 20 < 18:
            for i in range(sindx % 20, 18): # 后面的文件语料多
                fdo_train_list[i].write(sentenceList[sindx] + "\n")
        elif sindx % 20 == 18:
            fdo_dev.write(sentenceList[sindx] + "\n")   # 1个开发集
        elif sindx % 20 == 19:
            fdo_test.write(sentenceList[sindx] + "\n")  # 1个测试集
    for i in range(18):
        fdo_train_list[i].close()   # 文件有开，有关
    fdo_dev.close()
    fdo_test.close()

import sys
import re # 正则表达式模块
import random
'''
try:
    infile = sys.argv[1]
    out_dir = sys.argv[2]
except:
    sys.stderr.write("\tpython " + sys.argv[0] + " infile out_dir\n")
    sys.exit(-1)
'''
# step 1 : 将语料分割为句子
infile = "./data/corpus4pos_tagging.txt"
out_dir = "./data"
sentenceList = []
corpusSplit(infile, sentenceList)
# step 2 : 输出
out(sentenceList, out_dir)

处理后的文本示例：

19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w
但/c 前进/v 的/u 道路/n 不会/v 也/d 不/d 可能/v 一帆风顺/i ，/w 关键/n 是/v 世界/n 各国/r 人民/n 要/v 进一步/d 团结/a 起来/v ，/w 共同/d 推动/v 早日/d 建立/v 公正/a 合理/a 的/u 国际/n 政治/n 经济/n 新/a 秩序/n 。/w
我们/r 必须/d 进一步/d 深入/ad 学习/v 和/c 掌握/v 党/n 的/u 十五大/j 精神/n ，/w 统揽全局/l ，/w 精心/d 部署/v ，/w 狠抓/v 落实/v ，/w 团结/a 一致/a ，/w 艰苦奋斗/i ，/w 开拓/v 前进/v ，/w 为/p 夺取/v 今年/t 改革/v 开放/v 和/c 社会主义/n 现代化/vn 建设/vn 的/u 新/a 胜利/vn 而/c 奋斗/v 。/w

1.4 初步统计预览

# staForPosDistribution.py
import sys
def add2posDict(pos, pDict):
	if pos in pDict:
		pDict[pos] += 1
	else:
		pDict[pos]  = 1
def sta(infile, pDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		posList = [s.split("/")[1] for s in infs]	# 词性list
		for pos in posList:
			add2posDict(pos, pDict)	# 统计各个词性的次数
			add2posDict("all", pDict)	# 总的次数
	fdi.close()
def out(pDict):
	oList = list(pDict.items())
	oList.sort(key=lambda infs:(infs[1]), reverse=True)	# 按匿名函数排序
	total = oList[0][1]
	for pos, num in oList:
		print("%s\t%.4f" % (pos, num/total))	# 打印 词性，对应频率
try:
	infile = sys.argv[1]
except:
	sys.stderr.write("\tpython "+sys.argv[0]+" infile\n")
	sys.exit(-1)
pDict = {}
sta(infile, pDict)	# 统计训练集中的语料出现频率
out(pDict)	# 打印输出

输入以下命令，对最大的那个训练集执行统计

python staForPosDistribution.py ./data/train.17

2. 最大概率模型

2.1 训练

统计每个单词、其总的出现次数、其出现最多的词性、该词性的概率

# trainByMaxProb.py
def staForWordToPosDict(infile, word2posDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		for s in infs:
			w_p = s.split("/")
			if len(w_p) == 2:
				word = w_p[0]
				pos  = w_p[1]
				if word in word2posDict:
					if pos in word2posDict[word]:
						word2posDict[word][pos] += 1
					else:
						word2posDict[word][pos]  = 1
				else:
					word2posDict[word] = {pos:1}
				# 两重字典 {word ： {pos, count}}
				# 统计文本中：单词、  词性、 频次
	fdi.close()

def getMaxProbPos(posDict):
	total = sum(posDict.values())
	max_num  = -1
	max_pos  = ""
	for pos in posDict:
		if posDict[pos] > max_num:
			max_num = posDict[pos]
			max_pos = pos
	return max_pos, max_num/total

def out4model(word2posDict, model_file):
	wordNumList = [[word, sum(word2posDict[word].values())] for word in  word2posDict]
	# [[word, counts]] 两重列表，单词 & 其所有词性下的频次总和
	wordNumList.sort(key=lambda infs:(infs[1]), reverse=True)	# 按counts降序
	fdo = open(model_file, "w", encoding='utf-8')
	for word, num in wordNumList:
		pos, prob = getMaxProbPos(word2posDict[word])	
		# 单词可能有多个词性，出现最多的词性，及其概率(最大)
		if word != "" and pos != "":
			fdo.write("%s\t%d\t%s\t%f\n" % (word, num, pos, prob))	
			# 写入文件			单词、 出现次数、出现最多的词性、该词性的概率
	fdo.close()

import sys
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
except:
	sys.stderr.write("\tpython "+sys.argv[0]+" infile model_file\n")
	sys.exit(-1)
word2posDict = {}
staForWordToPosDict(infile, word2posDict)	# 对训练文件进行统计
out4model(word2posDict, model_file)	# 输出到文件

python trainByMaxProb.py ./data/train.0 ./data/model.MaxProb.0

输出的模型文件model.MaxProb.0部分内容如下：

2.2 预测

# predictByMaxProb.py
def loadModel(model_file, word2posDict):	# 加载训练模型
	fdi = open(model_file, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		if len(infs) == 4:
			word = infs[0]
			pos  = infs[2]
			word2posDict[word] = pos	# 从模型读取单词，和其最大概率的词性
		else:
			sys.stderr.write("format error in "+model_file+"\n")
			sys.stderr.write(line)
			sys.exit(-1)
	fdi.close()

def getWords(infs):
	return [s.split("/")[0] for s in infs]

def predict(infile, word2posDict, outfile):
	fdi = open(infile, 'r', encoding='utf-8')
	fdo = open(outfile, 'w', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		# 盖住答案，闭卷考试
		words = getWords(infs)	# 只获取输入文件的单词
		results = []
		for word in words:
			if word in word2posDict:	# 从模型中获取它的最大概率词性
				results.append(word + "/" + word2posDict[word])
			else:
				results.append(word + "/unknown")
		fdo.write(" ".join(results)+"\n")	# 写入输出文件
	fdo.close()
	fdi.close()

import sys
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
	outfile    = sys.argv[3]
except:
	sys.stderr.write("\tpython "+sys.argv[0]+" infile model_file outfile\n")
	sys.exit(-1)
word2posDict = {}
loadModel(model_file, word2posDict)	# 加载训练模型
predict(infile, word2posDict, outfile)	# 输出

运行命令：执行预测

python predictByMaxProb.py ./data/train.0 ./data/model.MaxProb.0 ./data/train.0.MaxProb.predict

预测文件train.0.MaxProb.predict部分内容如下：

19980101-01-001-001/m 迈向/v 充满/v 希望/v 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/nr ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w
但/c 前进/v 的/u 道路/n 不会/v 也/d 不/d 可能/v 一帆风顺/i ，/w 关键/n 是/v 世界/n 各国/r 人民/n 要/v 进一步/d 团结/v 起来/v ，/w 共同/d 推动/v 早日/d 建立/v 公正/a 合理/a 的/u 国际/n 政治/n 经济/n 新/a 秩序/n 。/w
我们/r 必须/d 进一步/d 深入/v 学习/v 和/c 掌握/v 党/n 的/u 十五大/j 精神/n ，/w 统揽全局/l ，/w 精心/ad 部署/vn ，/w 狠抓/v 落实/v ，/w 团结/v 一致/a ，/w 艰苦奋斗/i ，/w 开拓/v 前进/v ，/w 为/p 夺取/v 今年/t 改革/vn 开放/v 和/c 社会主义/n 现代化/vn 建设/vn 的/u 新/a 胜利/vn 而/c 奋斗/v 。/w

2.3 结果评估

# resultEval.py
import sys
def getPosList(infs):
	return [s.split("/")[1] for s in infs]
def add2staDict(pos, indx, staDict):
	if pos not in staDict:
		staDict[pos] = [pos, 0, 0, 0]
	staDict[pos][indx] += 1
def add2errDict(mykey, errDict):
	if mykey in errDict:
		errDict[mykey] += 1
	else:
		errDict[mykey]  = 1
def sta(label_file, predict_file, staDict, errDict):
	fdi1 = open(label_file, 'r', encoding='utf-8')
	fdi2 = open(predict_file, 'r', encoding='utf-8')
	while True:
		line1 = fdi1.readline()
		line2 = fdi2.readline()
		if line1 == "" and line2 == "":
			break
		elif line1 == "" or line2 == "":
			sys.stderr.write("the number of lines is not equal between %s and %s!\n" % (
				label_file, predict_file))
			sys.exit(-1)
		else:
			labelList = getPosList(line1.strip().split())	# 读取正确的词性
			predictList = getPosList(line2.strip().split())	# 读取预测的词性
			if len(labelList) != len(predictList):
				sys.stderr.write("the number of words is not equal between %s and %s!\n" % (
					label_file, predict_file))
				sys.exit(-1)
			else:
				for i in range(len(labelList)):
					label = labelList[i]
					predict = predictList[i]
					add2staDict(label, 1, staDict)	# staDict[pos] = [pos, 0, 0, 0]
					add2staDict(predict, 2, staDict) # (词性，正确词性频数，预测词性频数，label=预测的频数)
					add2staDict("all", 1, staDict)
					add2staDict("all", 2, staDict)
					if label == predict:
						add2staDict(label, 3, staDict)
						add2staDict("all", 3, staDict)
					else:
						add2errDict("%s-->%s" % (label, predict), errDict)	# 统计错误频数
						add2errDict("all-->all", errDict)
	fdi2.close()
	fdi1.close()

def out(staDict, errDict, outfile):
	staList = list(staDict.values())
	staList.sort(key=lambda infs:(infs[1]), reverse=True)
	errList = list(errDict.items())
	errList.sort(key=lambda infs:(infs[1]), reverse=True)
	fdo = open(outfile, 'w', encoding='utf-8')
	total = staList[0][1]
	for pos, nlabel, npredict, nright in staList:
		fdo.write("pos_%s\t%.4f\t%.4f\t%.4f\n" % (pos, 
			nlabel/total, 
			nright/(npredict if npredict > 0 else 100), 
			nright/(nlabel if nlabel > 0 else 100)))
		# 写入评估文件：(词性、各种概率)
	total = errList[0][1]
	for errKey, num in errList:
		fdo.write("err_%s\t%.4f\n" % (errKey, num/total))
	fdo.close()

try:
	label_file   = sys.argv[1]
	predict_file = sys.argv[2]
	outfile      = sys.argv[3]
except:
	sys.stderr.write("\tpython "+sys.argv[0]+" label_file predict_file outfile\n")
	sys.exit(-1)
staDict = {}
errDict = {}
sta(label_file, predict_file, staDict, errDict)	# 统计正确率
out(staDict, errDict, outfile)	# 写入评估文件

执行评估：

python resultEval.py ./data/train.0 ./data/train.0.MaxProb.predict ./data/train.0.MaxProb.eval

评估文件train.0.MaxProb.eval部分内容如下：

2.4 结果可视化

编写shell脚本，对18个训练集批量执行

echo "将python的路径改为当前机器环境下的路径"
alias python='/usr/local/bin/python3.7'
for ((i=0; i<=17; i++))
do
	# step 1 : 最大概率模型
	# step 1.1 : 训练模型
	python trainByMaxProb.py ./data/train.${i} ./data/model.MaxProb.${i}
	# step 1.2 : 在训练集上做评估
	python predictByMaxProb.py ./data/train.${i} ./data/model.MaxProb.${i} ./data/train.${i}.MaxProb.predict
	python resultEval.py ./data/train.${i} ./data/train.${i}.MaxProb.predict ./data/train.${i}.MaxProb.eval
	# step 1.3 : 在开发集上做评估
	python predictByMaxProb.py ./data/dev.txt ./data/model.MaxProb.${i} ./data/dev.${i}.MaxProb.predict
	python resultEval.py ./data/dev.txt ./data/dev.${i}.MaxProb.predict ./data/dev.${i}.MaxProb.eval
	# step 1.4 : 在测试集上做评估
	python predictByMaxProb.py ./data/test.txt ./data/model.MaxProb.${i} ./data/test.${i}.MaxProb.predict
	python resultEval.py ./data/test.txt ./data/test.${i}.MaxProb.predict ./data/test.${i}.MaxProb.eval
done
echo "FINISH !!!"

对所有的eval 评估文件读取第一行的第3个或第4个准确率，绘制语料大小与准确率的曲线

# -*- coding:utf-8 -*-
# python3.7
# @Time: 2019/12/20 23:03
# @Author: Michael Ming
# @Website: https://michael.blog.csdn.net/
# @File: resultView.py

trainEval = []
devEval = []
testEval = []
for i in range(18):
    filename1 = "./data/train." + str(i) + ".MaxProb.eval"
    filename2 = "./data/dev." + str(i) + ".MaxProb.eval"
    filename3 = "./data/test." + str(i) + ".MaxProb.eval"
    with open(filename1, 'r', encoding='utf-8') as f1:
        trainEval.append(float(f1.readline().split()[2]))
    with open(filename2, 'r', encoding='utf-8') as f2:
        devEval.append(float(f2.readline().split()[2]))
    with open(filename3, 'r', encoding='utf-8') as f3:
        testEval.append(float(f3.readline().split()[2]))

import matplotlib.pyplot as plt

# plt.rcParams['font.family'] = 'sans-serif'	# 消除中文乱码
plt.rcParams['font.sans-serif'] = 'SimHei'	# 消除中文乱码
plt.title("不同大小语料下的结果对比")
plt.xlabel("语料")
plt.ylabel("准确率")
plt.plot(trainEval, 'r-', devEval, 'b-', testEval, 'g-')
plt.legend(('train', 'dev', 'test'), loc='upper right')
plt.show()

可以看出，随着训练语料的不断增加，模型在开发集和测试集上的准确率在不断提升，一开始提升很快，后序提升趋于平缓，模型的预测准确率达到了一个瓶颈 90% 左右

3. 二元隐马尔科夫BiHMM模型

HMM模型介绍请点击我的博客：隐马尔科夫模型（HMM）笔记

3.1 训练

# -*- coding: UTF-8 -*-
# trainByBiHMM.py
def add2transDict(pos1, pos2, transDict):
    if pos1 in transDict:
        if pos2 in transDict[pos1]:
            transDict[pos1][pos2] += 1
        else:
            transDict[pos1][pos2] = 1
    else:
        transDict[pos1] = {pos2: 1}


def add2emitDict(pos, word, emitDict):
    if pos in emitDict:
        if word in emitDict[pos]:
            emitDict[pos][word] += 1
        else:
            emitDict[pos][word] = 1
    else:
        emitDict[pos] = {word: 1}


def sta(infile, transDict, emitDict):
    fdi = open(infile, 'r', encoding='utf-8')
    for line in fdi:
        infs = line.strip().split()
        wpList = [["__NONE__", "__start__"]] + [s.split("/") for s in infs] + [["__NONE_", "__end__"]]
        # 边界处理，首尾加个开始和结束标记
        for i in range(1, len(wpList)):
            pre_pos = wpList[i - 1][1]  # 前面一个词性（隐藏状态 y_t-1）
            cur_pos = wpList[i][1]  # 当前词性状态 y_t
            word = wpList[i][0]  # 当前观测值(发射值) x_t
            if word == "" or cur_pos == "" or pre_pos == "":
                continue
            add2transDict(pre_pos, cur_pos, transDict)	# 统计转移频次
            add2emitDict(cur_pos, word, emitDict)	# 统计发射频次
        add2transDict("__end__", "__end__", transDict)
    fdi.close()


def getPosNumList(transDict):
    pnList = []
    for pos in transDict:  # {pre_pos,{cur_pos, count}}
        # if pos == "__start__" or pos == "__end__":
        #	continue
        num = sum(transDict[pos].values())
        pnList.append([pos, num])  # 前一个词性出现了多少次
    pnList.sort(key=lambda infs: (infs[1]), reverse=True)
    return pnList


def getTotalWordNum(emitDict):
    total_word_num = 0
    for pos in emitDict:
        total_word_num += sum(list(emitDict[pos].values()))
    return total_word_num


def out4model(transDict, emitDict, model_file):
    pnList = getPosNumList(transDict)

    # 状态集合
    fdo = open(model_file, 'w', encoding='utf-8')
    total = sum([num for pos, num in pnList])  # 所有词性的出现次数
    for pos, num in pnList:
        fdo.write("pos_set\t%s\t%d\t%f\n" % (pos, num, num / total))
    #								词性、词性出现次数，出现频率

    # 转移概率
    total_word_num = getTotalWordNum(emitDict)  # {cur_pos, {word, count}}
    for pos1, num1 in pnList:  # 前一个词性，频次
        if pos1 == "__end__":
            continue
        #smoothing_factor = num1/total_word_num # 平滑方案1
        smoothing_factor = 1.0                  # 平滑方案2
        tmpList = []
        for pos2, _ in pnList:
            if pos2 == "__start__":
                continue
            if pos2 in transDict[pos1]:
                tmpList.append([pos2, transDict[pos1][pos2] + smoothing_factor])
            else:
                tmpList.append([pos2, smoothing_factor])
        denominator = sum([infs[1] for infs in tmpList])
        for pos2, numerator in tmpList:
            fdo.write("trans_prob\t%s\t%s\t%f\n" % (pos1, pos2, math.log(numerator/denominator)))
        
    # 发射概率
    for pos, _ in pnList:
        if pos == "__start__" or pos == "__end__":
            continue
        wnList = list(emitDict[pos].items())
        wnList.sort(key=lambda infs: infs[1], reverse=True)
        num = sum([num for _, num in wnList])
        #smoothing_factor = num/total_word_num # 平滑方案1
        smoothing_factor = 1.0                 # 平滑方案2
        tmpList = []
        for word, num in wnList:
            tmpList.append([word, num+smoothing_factor])
        tmpList.append(["__NEW__", smoothing_factor])
        # pos词性下，发射其他未统计到的词时的概率给个平滑
        denominator = sum([infs[1] for infs in tmpList])
        for word, numerator in tmpList:
            fdo.write("emit_prob\t%s\t%s\t%f\n" % (pos, word, math.log(numerator/denominator)))    
    fdo.close()


import sys
import math

try:
    infile = sys.argv[1]
    model_file = sys.argv[2]
except:
    sys.stderr.write("\tpython " + sys.argv[0] + " infile model_file\n")
    sys.exit(-1)
transDict = {}  # 转移
emitDict = {}  # 发射
sta(infile, transDict, emitDict)
out4model(transDict, emitDict, model_file)

执行训练

python trainByBiHMM.py ./data/train.0 ./data/model.BiHMM.0

生成的模型文件 model.BiHMM.0部分内容如下：

pos_set	n	77189	0.192527
pos_set	v	59762	0.149060
pos_set	w	54829	0.136756
pos_set	u	24474	0.061044
pos_set	m	19642	0.048992
pos_set	d	15820	0.039459
pos_set	__start__	15432	0.038491
pos_set	__end__	15432	0.038491
pos_set	vn	15115	0.037700
（省略）
trans_prob	n	n	-1.710944
trans_prob	n	v	-1.924026
trans_prob	n	w	-1.346831
trans_prob	n	u	-2.400427
trans_prob	n	m	-4.080009
trans_prob	n	d	-2.913617
trans_prob	n	__end__	-4.992247
trans_prob	n	vn	-2.887937
（省略）
emit_prob	n	人	-4.622626
emit_prob	n	经济	-4.715296
emit_prob	n	企业	-4.757801
emit_prob	n	记者	-4.804948
emit_prob	n	国家	-4.840039
emit_prob	n	问题	-4.980944
emit_prob	n	人民	-5.088500
emit_prob	n	全国	-5.099550
emit_prob	Bg	翠	-0.405465
emit_prob	Bg	__NEW__	-12.862278

3.2 预测

# -*- coding: UTF-8 -*-
# predictByBiHMM.py
def add2transDict(pos1, pos2, prob, transDict):
	if pos1 in transDict:
		transDict[pos1][pos2] = prob
	else:
		transDict[pos1] = {pos2:prob}
def add2emitDict(pos, word, prob, emitDict):
	if pos in emitDict:
		emitDict[pos][word] = prob
	else:
		emitDict[pos] = {word:prob}
def loadModel(infile, gPosList, transDict, emitDict):
	fdi = open(infile, 'r', encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		if infs[0] == "pos_set":
			pos = infs[1]
			if pos != "__start__" and pos != "__end__":
				gPosList.append(pos)
		if infs[0] == "trans_prob":
			pos1 = infs[1]
			pos2 = infs[2]
			prob = float(infs[3])
			add2transDict(pos1, pos2, prob, transDict)
		if infs[0] == "emit_prob":
			pos = infs[1]
			word = infs[2]
			prob = float(infs[3])
			add2emitDict(pos, word, prob, emitDict)
	fdi.close()
	
def getWords(infs):
	return [s.split("/")[0] for s in infs]	# 只获取单词
def getEmitProb(emitDict, pos, word):
	if word in emitDict[pos]:
		return emitDict[pos][word]
	else:
		return emitDict[pos]["__NEW__"]
def predict4one(words, gPosList, transDict, emitDict, results):
	if words == []:
		return
	prePosDictList = []
	for i in range(len(words)):	# 遍历单词，相当于时间i
		prePosDict = {}
		for pos in gPosList:	# 遍历词性，即状态
			if i == 0:	# 初始时刻
				trans_prob = transDict["__start__"][pos]
				emit_prob  = getEmitProb(emitDict, pos, words[i])
				total_prob = trans_prob + emit_prob	# 概率之前取了log，logA+logB = logAB
				prePosDict[pos] = [total_prob, "__start__"]
			else:
				emit_prob = getEmitProb(emitDict, pos, words[i])
				max_total_prob = -10000000.0
				max_pre_pos    = ""
				for pre_pos in prePosDictList[i-1]:	# 在前一次的里面找最大的
					pre_prob   = prePosDictList[i-1][pre_pos][0]
					trans_prob = transDict[pre_pos][pos]
					total_prob = pre_prob + trans_prob + emit_prob
					if max_pre_pos == "" or total_prob > max_total_prob:
						max_total_prob = total_prob
						max_pre_pos = pre_pos
				prePosDict[pos] = [max_total_prob, max_pre_pos]
		prePosDictList.append(prePosDict)
	max_total_prob = -10000000.0
	max_pre_pos    = ""
	for pre_pos in prePosDictList[len(prePosDictList)-1]:	# 最后一列
		pre_prob   = prePosDictList[len(prePosDictList)-1][pre_pos][0]
		trans_prob = transDict[pre_pos]["__end__"]
		total_prob = pre_prob + trans_prob	# end 不发射
		if max_pre_pos == "" or total_prob > max_total_prob:
			max_total_prob = total_prob
			max_pre_pos = pre_pos
	posList = [max_pre_pos]	# 最优路径
	indx = len(prePosDictList)-1
	max_pre_pos = prePosDictList[indx][max_pre_pos][1]
	indx -= 1
	while indx >= 0:
		posList.append(max_pre_pos)
		max_pre_pos = prePosDictList[indx][max_pre_pos][1]	# 递推前向的路径
		indx -= 1
	if len(posList) == len(words):
		posList.reverse()	# 原来的推出来的路径是逆向的，反转下
		for i in range(len(posList)):
			results.append(words[i]+"/"+posList[i])	# 预测结果
	else:
		sys.stderr.write("error : the number of pos is not equal to the number of words!\n")
		sys.exit(-1)
def predict(infile, gPosList, transDict, emitDict, outfile):
	fdi = open(infile, 'r', encoding='utf-8')
	fdo = open(outfile, "w", encoding='utf-8')
	for line in fdi:
		infs = line.strip().split()
		# 盖住答案，闭卷考试
		words = getWords(infs)
		results = []
		predict4one(words, gPosList, transDict, emitDict, results)
		fdo.write(" ".join(results)+"\n")
	fdo.close()
	fdi.close()

import sys
import math
try:
	infile     = sys.argv[1]
	model_file = sys.argv[2]
	outfile    = sys.argv[3]
except:
	sys.stderr.write("\tpython "+sys.argv[0]+" infile model_file outfile\n")
	sys.exit(-1)
gPosList  = []
transDict = {}
emitDict  = {}
loadModel(model_file, gPosList, transDict, emitDict)
predict(infile, gPosList, transDict, emitDict, outfile)

执行预测：

predictByBiHMM.py ./data/train.0 ./data/model.BiHMM.0 ./data/train.0.BiHMM.predict

生成预测文件train.0.BiHMM.predict部分内容如下：

19980101-01-001-001/m 迈向/v 充满/v 希望/v 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w
中国/ns 与/p 周边/n 国家/n 和/c 广大/b 发展中国家/l 的/u 友好/a 合作/vn 进一步/d 加强/v 。/w

3.3 结果评估

执行评估：

resultEval.py ./data/train.0 ./data/train.0.BiHMM.predict ./data/train.0.BiHMM.eval

评估文件train.0.BiHMM.eval部分内容如下：

（预测准确率在95 %左右）

pos_all	1.0000	0.9541	0.9541
pos_n	0.2086	0.9790	0.9815
pos_v	0.1615	0.9331	0.8918
pos_w	0.1482	0.9907	0.9999
pos_u	0.0661	0.9901	0.9905
pos_m	0.0531	0.9855	0.9746
pos_d	0.0427	0.9442	0.9530
pos_vn	0.0408	0.7667	0.8178
pos_p	0.0366	0.8887	0.9410
pos_a	0.0298	0.9062	0.8951

3.4 结果可视化

编写shell脚本批量执行：（训练耗时1天多的时间）

echo "将python的路径改为当前机器环境下的路径"
for ((i=0; i<=17; i++))
do
	alias python='/usr/local/bin/python3.7'
	# step 2 : BiHMM模型
	# step 2.1 : 训练模型
	python trainByBiHMM.py ./data/train.${i} ./data/model.BiHMM.${i}
	# step 2.2 : 在训练集上做评估
	python predictByBiHMM.py ./data/train.${i} ./data/model.BiHMM.${i} ./data/train.${i}.BiHMM.predict
	python resultEval.py ./data/train.${i} ./data/train.${i}.BiHMM.predict ./data/train.${i}.BiHMM.eval
	# step 2.3 : 在开发集上做评估
	python predictByBiHMM.py ./data/dev.txt ./data/model.BiHMM.${i} ./data/dev.${i}.BiHMM.predict
	python resultEval.py ./data/dev.txt ./data/dev.${i}.BiHMM.predict ./data/dev.${i}.BiHMM.eval
	# step 2.4 : 在测试集上做评估
	python predictByBiHMM.py ./data/test.txt ./data/model.BiHMM.${i} ./data/test.${i}.BiHMM.predict
	python resultEval.py ./data/test.txt ./data/test.${i}.BiHMM.predict ./data/test.${i}.BiHMM.eval
done
echo "FINISH !!!"

对所有的eval 评估文件读取第一行的第3个或第4个准确率，绘制语料大小与准确率的曲线

对比上面最大概率模型的 90% 的预测准确率，二元隐马尔科夫模型BiHMM的预测准确率提升到了 94.5% 左右，随着语料的增加，预测的准确率也在提升，提升速率也趋于平缓。

4. 结果讨论思考

在数据规模较小的情况下，每种模型（最大概率、二元HMM、三元HMM）的各自表现如何？差距是怎样产生的？

| 解答：最大概率模型的预测准确率比BiHMM模型小，原因有2个，1. 最大概率模型需要的参数多（words个数 * pos词性40种），BiHMM模型参数大概只有40*40种，相同的语料训练下，参数少的模型得到的训练充分性更好。 |

|:----|

| 2. BiHMM模型结合了上下文来进行预测，准确率更高 |

随着语料规模增加，每种模型的性能曲线如何？语料增加，在解决什么问题？

| 解答：最大概率模型小语料情况下预测准确率低，模型准确率上升空间大，随着语料增加准确率提升速度较快，BiHMM由于小语料下准确率就比较高，所以准确率上升没有那么快。 |

|:----|

| 语料的增加在解决统计的充分性问题，统计的越充分，统计结果越趋近于真实的概率分布，所以在小语料时，统计不充分，得到的概率分布可能与实际不符合，随着语料的增多，概率分布趋于真实情况，预测准确率在提升。 |

模型在哪方面的限制，使性能改进遇到了天花板？

| 解答：模型在一些词的预测上是有缺陷的，比如数词 m，告诉机器 20200112是日期，但是换一个日期 20200113，机器不认识了，不知道他是日期，还比如人名，地名等等，这些机器遇到的时候都会预测不准； |

|:----|

| 另外一个原因就是词语的歧义造成的预测不准，比如中国/ 建设/ 高铁；中国/ 建设/ 银行；建设的词性一个是动词，一个是动名词，且他们前面都是名词中国；这样的话，即使模型非常接近真实情况，预测的时候也只会将最大概率的路径输出，比如预测建设是动词，这就是模型的瓶颈所在。 |

错分的词性，应该怎样归类问题？

| 解答：预测时不认识的词，进行统计分析，比如数词（日期）基本上都不认识，那么是不是可以按照日期格式，写正则匹配，遇到 XXXX-XX-XX 的数词，预测其为日期，再比如姓名，遇到姓式开头的词，将其和其后的几个字归为姓名；但是也有更复杂的问题，以姓氏开头的词语，不一定是姓名，比如陈酒 |

|:----|

如何提升解码效率？

| 解答：避免多重for循环，尽可能利用造好的轮子，numpy等进行矩阵运算 |

|:----|

标注偏置、概率平滑问题

| 解答：需要选择合适的平滑算法。对没有出现过的事例，需要给他一个概率，用来贴近真实情况。 |

|:----|

| 粗暴法：频次都+1；缺点，对事例较少的词，给了他较大的发射概率，造成路径上的总的概率是最大的，继而预测失败。 |

| 举例：比如，Rg 这个词性，在文本中只出现了一次，对应的词是斯（逝者如斯夫），那么在 +1 平滑的时候，当预测当前词性为 Rg，但是词又不是斯的时候，斯的频次1+1=2，不认识的词是 0+1=1，所以不认识的词给的发射概率为 1/3，这是个很大的概率，足以打败所有的其他路径，继而造成文本预测结果的词性全部都是 Rg，所以选择合适的概率平滑算法很重要。 |

| 变种+1法：比如 pos1（n，名词） --> pos2（v，动词），pos1是名词的时候，pos2可能有40种可能，但是统计的时候，有的路径的频次为0，这时候我们给pos2的每种可能的词性的频次 + p，p=num(pos1)/num(total)p = num(pos1)/num(total)p=num(pos1)/num(total); 举个例子就是扶贫，给每个省都给1亿（粗暴+1法），按照贫困人口占比，人口多的省，多发一些（变种+1法，该法相比更优）。 |

p = num(pos1)/num(total)

; 举个例子就是扶贫，给每个省都给1亿（粗暴+1法），按照贫困人口占比，人口多的省，多发一些（变种+1法，该法相比更优）。

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019/12/20 ，如有侵权请联系 cloudcommunity@tencent.com 删除

shell

eval

model

predict

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度