基于词向量的文本查重

AI拉呱

发布于 2021-01-14 10:25:41

1.1K00

代码可运行

运行总次数：0

代码可运行

基于词向量的文本查重

import gensim
import numpy as np
import jieba
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
# stop_text = open('stop_list.txt', 'r')
# stop_word = []
# for line in stop_text:
#     stop_word.append(line.strip())
TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_corpus():

    with open("corpus_seg.txt", 'r') as doc:
        docs = doc.readlines()
    train_docs = []
    for i, text in enumerate(docs):
        word_list = text.split(' ')
        length = len(word_list)
        word_list[length - 1] = word_list[length - 1].strip()
        document = TaggededDocument(word_list, tags=[i])
        train_docs.append(document)
    return train_docs

def train(x_train, size=200, epoch_num=1):
    model_dm = Doc2Vec(x_train, min_count=1, window=3, size=size, sample=1e-3, negative=5, workers=4)
    model_dm.

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2019/04/30 ，如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自作者个人站点/博客前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度

基于词向量的文本查重

基于词向量的文本查重

基于词向量的文本查重

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐