python利用jieba处理文本数据词频列表，最终生成词云

IT不难

发布于 2022-03-12 00:57:19

1.4K00

代码可运行

文章被收录于专栏：IT不难技术家园IT不难技术家园

运行总次数：0

代码可运行

前言

自己使用的一个接单系统，运行了多半年时间。积累的一批数据，有近万条的开发数据。就像自己分析一下，大部分是什么需求。看看能不能挖出新的商机。

从数据库导出标题数据

select task_title from task_requirements where UNIX_TIMESTAMP(task_addtime) > UNIX_TIMESTAMP('2022-03-10');

将结果保存到r.txt

利用python处理文本

作为一个码农，先想到的是自己动手处理，于是撸了一个python脚本，做数据处理。

代码结构

tree 
|____TextAnalytics       //项目目录
| |____output            //数据输出
| |____setting.py        //配置文件
| |____README.md         //说明
| |____common.py         //主文件
| |____lib               //词库
| | |____停用词.csv
| | |____无效词.csv
| | |____保留词.csv
| |____main.py           //入口
| |____data              //待处理数据
| | |____r.txt

主要函数说明

数据载入清洗

def _cleanSourceText(sFile):
    '''
    文本文件数据清洗
    '''
    sourceFile = 'data/{}'.format(sFile)
    #关键词列表
    savewords = [line.strip() for line in open(saveFilePath, encoding='utf-8').readlines()]
    voidwords = [line.strip() for line in open(voidFilePath, encoding='utf-8').readlines()]

    # 对文本进行操作
    with open(sourceFile, 'r', encoding = 'utf-8') as sf:
        #返回字符串
        res_str = ''
        for line in sf:
            #过滤字符串，只保留中文，英文，数字
            string = re.compile("[^\u4e00-\u9fa5^a-z^A-Z^0-9]").sub('',line)
            tag = 0
            seg = jieba.cut(string.strip(), cut_all = False)
            # 筛选存在关键词的条目
            for word in seg:
                # 如果是无效词，跳过此条，将tag设置为0
                if word in voidwords:
                    tag = 0
                    break

                #如果在保留词中
                if len(savewords) == 0 or word in savewords:
                    tag = 1

            if tag == 1 and len(string) > 6 and len(string) < 14:
                res_str = res_str + string + "\n"

        print('源数据文件{}处理完成！'.format(sFile))

    return res_str

文本分词

def _parseText(text):
    '''
    文本分词函数
    '''
    text = re.sub(r'[^\w]', ' ' , text)
    #text = filter(None, text)

    words = jieba.lcut(text) #使用jieba.lcut()返回一个单词列表
    #加载停用词
    stopwords = [line.strip() for line in open(stopFilePath, encoding='utf-8').readlines()]

    words_dict = {} #创建一个字典，用于生成单词，频率
    for word in words:
       #不在停用词表中
        if word not in stopwords:
            if len(word) == 1:
                continue
            else:
                words_dict[word] = words_dict.get(word,0) + 1 #get不到word就创建word为下>标的值0+1，如果get到了就在word的值上加1，然后更新字典

    #words_dict = list(words_dict)
    words_dict_sorted = sorted(words_dict.items(), key=lambda kv:kv[1], reverse = True)

    #返回结果
    return words_dict_sorted