Day10.如何给⽑不易的歌曲做词云展示

DataScience

发布于 2020-06-10 16:12:54

5120

发布于 2020-06-10 16:12:54

文章被收录于专栏：A2DataA2Data

Python词云

今天我们做⼀个数据可视化的项⽬。

我们经常需要对分析的数据提取常⽤词，做词云展示。⽐如⼀些互联⽹公司会抓取⽤户的画像，或者每⽇讨论话题的关键词，形成词云并进⾏展示。

或者你喜欢某个歌⼿，想了解这个歌⼿创作的歌曲中经常⽤到哪些词语，词云就是个很好的⼯具。

那么在今天的实战项⽬⾥，有3个⽬标需要掌握：

1.掌握词云分析⼯具，并进⾏可视化呈现；

2.掌握Python爬⾍，对⽹⻚的数据进⾏爬取；

3.掌握XPath⼯具，分析提取想要的元素。

如何制作词云

⾸先我们需要了解什么是词云。词云也叫⽂字云，它帮助我们统计⽂本中⾼频出现的词，过滤掉某些常⽤词（⽐如“作曲”“作词”），将⽂本中的重要关键词进⾏可视化，⽅便分析者更好更快地了解⽂本的重点，同时还具有⼀定的美观度。

Python提供了词云⼯具WordCloud，使⽤pip install wordcloud安装后，就可以创建⼀个词云，构造⽅法如下：

wc = WordCloud(
    background_color='white',  # 设置背景颜⾊
    mask=backgroud_Image,  # 设置背景图⽚
    font_path='./SimHei.ttf',  # 设置字体，针对中⽂的情况需要设置中⽂字体，否则显示乱码
    max_words=100,  # 设置最⼤的字数
    stopwords=STOPWORDS,  # 设置停⽤词
    max_font_size=150,  # 设置字体最⼤值
    width=2000,  # 设置画布的宽度
    height=1200,  # 设置画布的⾼度
    random_state=30  # 设置多少种随机状态，即多少种颜⾊
)

实际上WordCloud还有更多的构造参数，代码⾥展示的是⼀些主要参数。

创建好WordCloud类之后，就可以使⽤wordcloud=generate(text)⽅法⽣成词云，传⼊的参数text代表你要分析的⽂本，最后使⽤wordcloud.tofile(“a.jpg”)函数，将得到的词云图像直接保存为图⽚格式⽂件。

你也可以使⽤Python的可视化⼯具Matplotlib进⾏显示，⽅法如下：

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

需要注意的是，我们不需要显示X轴和Y轴的坐标，使⽤plt.axis(“off”)可以将坐标轴关闭。

了解了如何使⽤词云⼯具WordCloud之后，我们将本期课程的前几节的标题进⾏词云可视化，具体的代码如下：

fromwordcloudimportWordCloudimport matplotlib.pyplot as plt
import jieba
from PIL import Image
import numpy as np

f = '数据可视化概述\
数据可视化基础语法\
五种常见图形绘制\
五种扩展图形绘制\
可视化基础数据分析-NumPy入门指南(一)\
数据分析-NumPy入门指南(二)\
数据挖掘初探：亲和性分析-商品推荐\
利用pandas做数据处理(一)\
利用pandas做数据处理(二)\
十分钟掌握python操作excel秘诀\
数据采集-爬虫\
数据清洗\
数据集成与转换\
数据可视化：给毛不易的歌词做词云展示'

# ⽣成词云
def create_word_cloud(f):
    print('根据词频计算词云')
text = ' '.join(jieba.cut(f, cut_all=False, HMM=True))
wc = WordCloud(
    font_path="C:\Windows\Fonts\SimHei.ttf",
    max_words=100,
    width=2000,
    height=1200,
)
wordcloud = wc.generate(text)
# 写词云图⽚
wordcloud.to_file("wordcloud.jpg")
# 显示词云⽂件
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

运行结果

给⽑不易的歌词制作词云：

假设我们现在要给⽑不易的歌词做个词云，那么需要怎么做呢？我们先把整个项⽬的流程梳理下：

准备阶段：我们主要使⽤Python爬⾍获取HTML，⽤XPath对歌曲的ID、名称进⾏解析，然后通过⽹易云⾳乐的API接⼝获取每⾸歌的歌词，最后将所有的歌词合并得到⼀个变量。

词云分析阶段：我们需要创建WordCloud词云类，分析得到的歌词⽂本，最后可视化。

# ⽹易云⾳乐 通过歌⼿ID，⽣成该歌⼿的词云
import requests
import sys
import re
import os
from wordcloud import WordCloud,STOPWORDS
import matplotlib.pyplot as plt
import jieba
from PIL import Image
import numpy as np
from lxml import etree

headers = {
    'Referer': 'http://music.163.com',
    'Host': 'music.163.com',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Chrome/10'
    }

# 得到某⼀⾸歌的歌词
def get_song_lyric(headers, lyric_url):
    res = requests.request('GET', lyric_url, headers=headers)
    if 'lrc' in res.json():
        lyric = res.json()['lrc']['lyric']
        new_lyric = re.sub(r'[\d:.[\]]', '', lyric)
        return new_lyric
    else:
        return ''
        print(res.json())

def remove_stop_words(f):
    stop_words = {"作词", "作曲", "编曲", "Arranger", "录⾳", "混⾳", "⼈声", "Vocal", "弦乐", "Keyboard", "键盘", "编辑","不易","" }
    for stop_words in stop_words:
        f = f.replace(stop_words, '')
    return f

# ⽣成词云
def create_word_cloud(f):
    print('根据词频，开始⽣成词云!')
    f = remove_stop_words(f)
    cut_text = " ".join(jieba.cut(f, cut_all=False, HMM=True))
    wc = WordCloud(
        font_path="C:\Windows\Fonts\FZSTK.TTF",
        max_words=100,
        width=2000,
        height=1200,
    )
    print(cut_text)
    wordcloud = wc.generate(cut_text)
    # 写词云图⽚
    wordcloud.to_file("wordcloud.jpg")
    # 显示词云⽂件
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()


# 得到指定歌⼿⻚⾯ 热⻔前50的歌曲ID，歌曲名
def get_songs(artist_id):
    page_url = 'https://music.163.com/artist?id=' + artist_id
    # 获取⽹⻚HTML
    res = requests.request('GET', page_url, headers=headers)
    # ⽤XPath解析 前50⾸热⻔歌曲
    html = etree.HTML(res.text)
    href_xpath = "//*[@id='hotsong-list']//a/@href"
    name_xpath = "//*[@id='hotsong-list']//a/text()"
    hrefs = html.xpath(href_xpath)
    names = html.xpath(name_xpath)
    # 设置热⻔歌曲的ID，歌曲名称
    song_ids = []
    song_names = []
    for href, name in zip(hrefs, names):
        song_ids.append(href[9:])
        song_names.append(name)
        print(href, ' ', name)
    return song_ids, song_names
# 设置歌⼿ID，⽑不易为12138269
artist_id = '12138269'
[song_ids, song_names] = get_songs(artist_id)
# 所有歌词
all_word = ''
# 获取每⾸歌歌词
for (song_id, song_name) in zip(song_ids, song_names):
    # 歌词API URL
    lyric_url = 'http://music.163.com/api/song/lyric?os=pc&id=' + song_id + '&lv=-1&kv=-1&tv=-1'
    lyric = get_song_lyric(headers, lyric_url)
    all_word = all_word + ' ' + lyric
    print(song_name)
# 根据词频 ⽣成词云
create_word_cloud(all_word)

运行结果