# HanLP 关键词提取算法分析详解

l 参考论文：《TextRank: Bringing Order into Texts》

l TextRank算法提取关键词的Java实现

l TextRank算法自动摘要的Java实现这篇文章中作者大概解释了一下TextRank公式

1. 论文

In this paper, we introduce the TextRank graphbased ranking model for graphs extracted from natural

language texts

TextRank是一个非监督学习算法，它将文本中构造成一个图，将文本中感兴趣的东西(比如分词)当成一个个顶点，然后应用TextRank算法来抽取文本中的一些信息。

Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document.

TextRank通过不断地迭代来提取关键词，每一轮迭代，算法给图中的顶点打分。直到满足某个条件（比如说迭代次数克到200次，或者设置的某个参数达到一个阈值）为止。

For loosely connected graphs, with the number of edges proportional with the number of vertices,

undirected graphs tend to have more gradual convergence curves.

It may be therefore useful to indicate and incorporate into the model the “strength”

of the connection between two vertices $V_i$ and $V_j$ as a weight $w_{ij}$ added to the corresponding edge that connects the two vertices.

2. 源码实现

2.1 关键词提取流程

The vertices added to the graph can be restricted with syntactic filters, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verbs.

it is the application that dictates the type of relations that are used to draw connections between any two such vertices,

2.2 根据窗口大小确定词的邻接点

2.3 得分(score)的更新算法

m.put(key, m.get(key) + d / size * (score.get(element) == null ? 0 : score.get(element)));代码的解读：

m.get(key)如果是第一次进入for (String element : value)，则是拿到公式前半部分1-d的结果；如果是已经在for (String element : value)进行了迭代，for循环相当于求和：Σvj∈In(vi)Σvj∈In(vi)

for (String element : value) {

int size = words.get(element).size();

m.put(key, m.get(key) + d / size * (score.get(element) == null ? 0 : score.get(element)));

}

m.put(key, m.get(key) + d / size * (score.get(element) == null ? 0 : score.get(element)))

m.get(key)为1-d，因为在外层for循环中，m.put(key, 1 - d)已经公式的前半分部(1-d)存储了。

score.get(element) == null ? 0 : score.get(element)这个是获取上一轮迭代的结果。对于初始第一轮迭代而言，score.get(element)为0.8807971，这个值是每个顶点的得分初始值：

//依据TF来设置初值,  words 代表的是 一张 无向图

for (Map.Entry<String, Set<String>> entry : words.entrySet()) {

score.put(entry.getKey(), sigMoid(entry.getValue().size()));//无向图的每个顶点 得分值 初始化

}

score.get(element)相当于公式中的WS(Vj)WS(Vj)

In('理')={'确实'，'说'}

if (max_diff <= min_diff)

break;

for (String w : wordList) {

if (!words.containsKey(w)) {

//排除了 wordList 中的重复term, 对每个已去重的term, 用 TreeSet<String> 保存该term的邻接顶点

words.put(w, new TreeSet<String>());

}

// 复杂度O(n-1)

if (que.size() >= 5) {

//窗口的大小为5,是写死的. 对于一个term_A而言, 它的前4个term、后4个term 都属于term_A的邻接点

que.poll();

}

for (String qWord : que) {

if (w.equals(qWord)) {

continue;

}

//既然是邻居,那么关系是相互的,遍历一遍即可

}

que.offer(w);

}

Map<String, Float> score = new HashMap<String, Float>();//保存最终每个关键词的得分

//依据TF来设置初值,  words 代表的是 一张 无向图

for (Map.Entry<String, Set<String>> entry : words.entrySet()) {

score.put(entry.getKey(), sigMoid(entry.getValue().size()));//无向图的每个顶点 得分值 初始化

}

for (int i = 0; i < max_iter; ++i) {

//....

for (Map.Entry<String, Set<String>> entry : words.entrySet()) {

//...

for (String element : value) {

WS(vi)=(1−d)+d∗ΣVj∈In(Vi)wjiΣVk∈Out(Vj)wjk∗WS(Vj)

WS(vi)=(1−d)+d∗ΣVj∈In(Vi)wjiΣVk∈Out(Vj)wjk∗WS(Vj)

[理, 确实, 说]

155 篇文章27 人订阅

0 条评论

## 相关文章

2556

1682

### cf932E. Team Work(第二类斯特灵数 组合数)

$$m^n = \sum_{i = 0}^m C_{n}^i S(n, i) i!$$

1074

### Python Algorithms - C7 Greedy

Python算法设计篇(7) Chapter 7: Greed is good? Prove it!

1062

3015

1453

1.8K6

1584

1.4K4

1103