# 基于jieba、TfidfVectorizer、LogisticRegression的文档分类

jieba中文叫做结巴，是一款中文分词工具，官方文档链接：https://github.com/fxsjy/jieba TfidfVectorizer中文叫做词袋向量化模型，是用来文章内容向量化的工具，官方文档链接：http://sklearn.apachecn.org/cn/0.19.0/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html LogisticRegression中文叫做逻辑回归模型，是一种基础、常用的分类方法。

## 0.打开jupyter

image.png

image.png

image.png

PowerShell运行命令后，会自动打开网页，点击如下图所示网页中的按钮：

image.png

image.png

## 1.数据准备

image.png

```import pandas as pd

image.png

```for name, group in train_df.groupby(0):
print(name,len(group))```

image.png

```test_df = pd.read_csv('sohu_test.txt', sep='\t', header=None)
for name, group in test_df.groupby(0):
print(name, len(group))```

image.png

```with open('stopwords.txt', encoding='utf8') as file:
stopWord_list = [k.strip() for k in file.readlines()]```

## 2.分词

```iimport jieba
import time

train_df.columns = ['分类', '文章']
stopword_list = [k.strip() for k in open('stopwords.txt', encoding='utf8').readlines() if k.strip() != '']
cutWords_list = []
i = 0
startTime = time.time()
for article in train_df['文章']:
cutWords = [k for k in jieba.cut(article) if k not in stopword_list]
i += 1
if i % 1000 == 0:
print('前%d篇文章分词共花费%.2f秒' %(i, time.time()-startTime))
cutWords_list.append(cutWords)```

```with open('cutWords_list.txt', 'w') as file:
for cutWords in cutWords_list:
file.write(' '.join(cutWords) + '\n')```

```with open('cutWords_list.txt') as file:
cutWords_list = [k.split() for k in file.readlines()]```

## 3.TfidfVectorizer模型

```from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(cutWords_list, stop_words=stopWord_list, min_df=40, max_df=0.3)```

## 4.特征工程

image.png

```print('词表大小:', len(tfidf.vocabulary_))
X = tfidf.fit_transform(train_df[1])
print(X.shape)```

## 5.模型训练

### 5.1 标签编码

```from sklearn.preprocessing import LabelEncoder
import pandas as pd

labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(train_df[0])
y.shape```

### 5.2 逻辑回归模型

```from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(train_X, train_y)
logistic_model.score(test_X, test_y)```

0.8754166666666666

### 5.3 保存模型

```import pickle

with open('tfidf.model', 'wb') as file:
save = {
'labelEncoder' : labelEncoder,
'tfidfVectorizer' : tfidf,
'logistic_model' : logistic_model
}
pickle.dump(save, file)```

### 5.4 交叉验证

```import pickle

with open('tfidf.model', 'rb') as file:
tfidfVectorizer = tfidf_model['tfidfVectorizer']
labelEncoder = tfidf_model['labelEncoder']
logistic_model = tfidf_model['logistic_model']```

```import pandas as pd

X = tfidfVectorizer.transform(train_df[1])
y = labelEncoder.transform(train_df[0])```

```from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

logistic_model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
cv_split = ShuffleSplit(n_splits=5, test_size=0.3)
score_ndarray = cross_val_score(logistic_model, X, y, cv=cv_split)
print(score_ndarray)
print(score_ndarray.mean())```

[0.86819444 0.87430556 0.86861111 0.87 0.87430556] 0.8710833333333333

## 6.模型评估

```from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
import pandas as pd

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2)
logistic_model = LogisticRegressionCV(multi_class='multinomial', solver='lbfgs')
logistic_model.fit(train_X, train_y)
predict_y = logistic_model.predict(test_X)
pd.DataFrame(confusion_matrix(test_y, predict_y),
columns=labelEncoder.classes_,
index=labelEncoder.classes_)```

image.png

```def eval_model(y_true, y_pred, labels):
# 计算每个分类的Precision, Recall, f1, support
p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
# 计算总体的平均Precision, Recall, f1, support
tot_p = np.average(p, weights=s)
tot_r = np.average(r, weights=s)
tot_f1 = np.average(f1, weights=s)
tot_s = np.sum(s)
res1 = pd.DataFrame({
u'Label': labels,
u'Precision': p,
u'Recall': r,
u'F1': f1,
u'Support': s
})
res2 = pd.DataFrame({
u'Label': ['总体'],
u'Precision': [tot_p],
u'Recall': [tot_r],
u'F1': [tot_f1],
u'Support': [tot_s]
})
res2.index = [999]
res = pd.concat([res1, res2])
return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]

predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)```

image.png

## 7.模型测试

```import pandas as pd

test_X = tfidfVectorizer.transform(test_df[1])
test_y = labelEncoder.transform(test_df[0])
predict_y = logistic_model.predict(test_X)
eval_model(test_y, predict_y, labelEncoder.classes_)```

120 篇文章26 人订阅

0 条评论

## 相关文章

3227

36810

### Mobility Model and Routing Model about the ONE

ONE主要的功能是节点的移动，节点间的相遇情况，路由情况以及消息的处理机制。下面简要介绍下目前ONE自带的六种移动模型和六种路由模型。 Mobility Mod...

1929

### tensorflow笔记（一）之基础知识

http://www.cnblogs.com/fydeblog/p/7399701.html

1182

5167

3757

2986

### TensorFlow 深度学习笔记 逻辑回归 实践篇

Practical Aspects of Learning Install Ipython NoteBook 可以参考这个教程 可以直接安装anaconda，里...

2207

2667

3393