前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >分析Youtube数据的文本分类技术

分析Youtube数据的文本分类技术

作者头像
代码医生工作室
发布2019-06-21 17:55:49
1.5K0
发布2019-06-21 17:55:49
举报

作者 | Rohit Agrawal

来源 | Medium

编辑 | 代码医生团队

文本分类是自然语言处理(NLP)旨在解决的经典问题,其涉及分析原始文本的内容并决定其属于哪个类别。它具有广泛的应用,如情绪分析,主题标签,垃圾邮件检测和意图检测。

今天将采用一个相当简单的任务,根据标题和描述,使用不同的技术(Naive Bayes,支持向量机,Adaboost和LSTM)将视频分类到不同的类中,并分析它们的性能。这些类被选择为(但不限于):

  • 旅游博客
  • 科学和技术
  • 餐饮
  • 制造业
  • 历史
  • 艺术与音乐

收集数据

在处理诸如此类的自定义机器学习问题时,发现收集数据非常有用,如果不是简单的满足。对于这个问题,需要一些关于属于不同类别的视频的元数据。欢迎手动收集数据并构建数据集。将使用Youtube API v3。它是由Google自己创建的,通过一段专门编写的代码与Youtube进行交互。转到Google Developer Console,创建一个示例项目并开始使用。选择这样做的原因是需要收集数以千计的样本,这是用其他技术找不到的。

注意:Youtube API与Google提供的任何其他API一样,适用于配额系统。根据您的计划,每封电子邮件每天/每月都会提供一套配额。在免费计划中,只能向Youtube提出大约2000次的请求,这提出了一些问题,但使用多个电子邮件帐户克服了它。

API的文档非常简单,在使用8个以上的电子邮件帐户来补偿所需的配额后,收集了以下数据并将其存储在.csv文件中。如果希望将此数据集用于自己的项目,可以在此处下载。

https://github.com/agrawal-rohit/Text-Classification-Analysis/blob/master/Collected_data_raw.csv

收集的原始数据

from apiclient.discovery import build
import pandas as pd
 
# Data to be stored
category = []
no_of_samples = 1700
 
# Gathering Data using the Youtube API
api_key = "AIzaSyAS9eTgOEnOJ2GlJbbqm_0bR1onuRQjTHE"
youtube_api = build('youtube','v3', developerKey = api_key)
 
# Travel Data
tvl_titles = []
tvl_descriptions = []
tvl_ids = []
 
req = youtube_api.search().list(q='travel vlogs', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(tvl_titles)<no_of_samples):
    for i in range(len(res['items'])):
        tvl_titles.append(res['items'][i]['snippet']['title'])
        tvl_descriptions.append(res['items'][i]['snippet']['description'])
        tvl_ids.append(res['items'][i]['id']['videoId'])
        category.append('travel')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
        req = youtube_api.search().list(q='travelling', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    else:
        break
 
 
# Science Data
science_titles = []
science_descriptions = []
science_ids = []
 
next_page_token = None
req = youtube_api.search().list(q='robotics', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(science_titles)<no_of_samples):
    if(next_page_token is not None):
        req = youtube_api.search().list(q='robotics', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    for i in range(len(res['items'])):
        science_titles.append(res['items'][i]['snippet']['title'])
        science_descriptions.append(res['items'][i]['snippet']['description'])
        science_ids.append(res['items'][i]['id']['videoId'])
        category.append('science and technology')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
    else:
        break
    
# Food Data
food_titles = []
food_descriptions = []
food_ids = []
 
next_page_token = None
req = youtube_api.search().list(q='delicious food', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(food_titles)<no_of_samples):
    if(next_page_token is not None):
        req = youtube_api.search().list(q='delicious food', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    for i in range(len(res['items'])):
        food_titles.append(res['items'][i]['snippet']['title'])
        food_descriptions.append(res['items'][i]['snippet']['description'])
        food_ids.append(res['items'][i]['id']['videoId'])
        category.append('food')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
    else:
        break
 
# Food Data
manufacturing_titles = []
manufacturing_descriptions = []
manufacturing_ids = []
 
next_page_token = None
req = youtube_api.search().list(q='3d printing', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(manufacturing_titles)<no_of_samples):
    if(next_page_token is not None):
        req = youtube_api.search().list(q='3d printing', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    for i in range(len(res['items'])):
        manufacturing_titles.append(res['items'][i]['snippet']['title'])
        manufacturing_descriptions.append(res['items'][i]['snippet']['description'])
        manufacturing_ids.append(res['items'][i]['id']['videoId'])
        category.append('manufacturing')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
    else:
        break
    
# History Data
history_titles = []
history_descriptions = []
history_ids = []
 
next_page_token = None
req = youtube_api.search().list(q='archaeology', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(history_titles)<no_of_samples):
    if(next_page_token is not None):
        req = youtube_api.search().list(q='archaeology', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    for i in range(len(res['items'])):
        history_titles.append(res['items'][i]['snippet']['title'])
        history_descriptions.append(res['items'][i]['snippet']['description'])
        history_ids.append(res['items'][i]['id']['videoId'])
        category.append('history')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
    else:
        break
    
# Art and Music Data
art_titles = []
art_descriptions = []
art_ids = []
 
next_page_token = None
req = youtube_api.search().list(q='painting', part='snippet', type='video', maxResults = 50)
res = req.execute()
while(len(art_titles)<no_of_samples):
    if(next_page_token is not None):
        req = youtube_api.search().list(q='painting', part='snippet', type='video', maxResults = 50, pageToken=next_page_token)
        res = req.execute()
    for i in range(len(res['items'])):
        art_titles.append(res['items'][i]['snippet']['title'])
        art_descriptions.append(res['items'][i]['snippet']['description'])
        art_ids.append(res['items'][i]['id']['videoId'])
        category.append('art and music')
            
    if('nextPageToken' in res):
        next_page_token = res['nextPageToken']
    else:
        break
 
    
# Construct Dataset
final_titles = tvl_titles + science_titles + food_titles + manufacturing_titles + history_titles + art_titles
final_descriptions = tvl_descriptions + science_descriptions + food_descriptions + manufacturing_descriptions + history_descriptions + art_descriptions
final_ids = tvl_ids + science_ids + food_ids + manufacturing_ids + history_ids + art_ids
data = pd.DataFrame({'Video Id': final_ids, 'Title': final_titles, 'Description': final_descriptions, 'Category': category}) 
data.to_csv('Collected_data_raw.csv')

注意:可以自由地探索一种称为Web Scraping的技术,该技术用于从网站中提取数据。Python有一个名为BeautifulSoup的漂亮库,用于同样的目的。但发现在从Youtube搜索结果中抓取数据的情况下,它只返回一个搜索查询的25个结果。

数据清理和预处理

数据预处理过程的第一步是处理丢失的数据。由于缺失值应该是文本数据,因此无法将它们归于它们,因此唯一的选择是删除它们。幸运的是,9999个样本中只有334个缺失值,因此它不会影响训练期间的模型性能。

# Missing Values
num_missing_desc = data.isnull().sum()[2]    # No. of values with msising descriptions
print('Number of missing values: ' + str(num_missing_desc))
data = data.dropna()

“Video Id”列对预测分析并不真正有用,因此它不会被选为最终训练集的一部分,因此没有任何预处理步骤。

这里有两列重要的列,即标题描述,但它们是未处理的原始文本。因此为了消除噪音,将采用一种非常常见的方法来清理这两列的文本。此方法分为以下步骤:

  1. 转换为小写:执行此步骤是因为大写不会对单词的语义重要性产生影响。例如。“Travel”和“Travel”应视为相同。
  2. 删除数字值和标点符号:标点符号中使用的数值和特殊字符($,!等)无助于确定正确的类
  3. 删除多余的空格:这样每个单词由一个空格分隔,否则在标记化过程中可能会出现问题
  4. 标记为单词:这是指将文本字符串拆分为“标记”列表,其中每个标记都是一个单词。例如,句子“I have huge biceps”将被转换为[‘I’, ‘have’, ‘huge’, ‘biceps’]。
  5. 删除非字母词和'Stop words': 'Stop words'指的是像和,等等词,它们在学习如何构建句子时是重要的词,但对预测分析毫无用处。
  6. 词形还原:词形还原是一种非常漂亮的技术,它将类似的单词转换为它们的基本含义。例如“flying”和“flew”这两个词将被转换为最简单的意思“fly”。

文本清理后的数据集

# Change to lowercase
data['Title'] = data['Title'].map(lambda x: x.lower())
data['Description'] = data['Description'].map(lambda x: x.lower())
 
# Remove numbers
data['Title'] = data['Title'].map(lambda x: re.sub(r'\d+', '', x))
data['Description'] = data['Description'].map(lambda x: re.sub(r'\d+', '', x))
 
# Remove Punctuation
data['Title']  = data['Title'].map(lambda x: x.translate(x.maketrans('', '', string.punctuation)))
data['Description']  = data['Description'].map(lambda x: x.translate(x.maketrans('', '', string.punctuation)))
 
# Remove white spaces
data['Title'] = data['Title'].map(lambda x: x.strip())
data['Description'] = data['Description'].map(lambda x: x.strip())
 
# Tokenize into words
data['Title'] = data['Title'].map(lambda x: word_tokenize(x))
data['Description'] = data['Description'].map(lambda x: word_tokenize(x))
 
# Remove non alphabetic tokens
data['Title'] = data['Title'].map(lambda x: [word for word in x if word.isalpha()])
data['Description'] = data['Description'].map(lambda x: [word for word in x if word.isalpha()])
# filter out stop words
stop_words = set(stopwords.words('english'))
data['Title'] = data['Title'].map(lambda x: [w for w in x if not w in stop_words])
data['Description'] = data['Description'].map(lambda x: [w for w in x if not w in stop_words])
 
# Word Lemmatization
lem = WordNetLemmatizer()
data['Title'] = data['Title'].map(lambda x: [lem.lemmatize(word,"v") for word in x])
data['Description'] = data['Description'].map(lambda x: [lem.lemmatize(word,"v") for word in x])
 
# Turn lists back to string
data['Title'] = data['Title'].map(lambda x: ' '.join(x))
data['Description'] = data['Description'].map(lambda x: ' '.join(x))

“现在文字很干净,来一瓶香槟庆祝吧!“

即使今天的计算机能够解决世界问题并玩超现实的视频游戏,它们仍然是不懂语言的机器。因此无法将文本数据提供给机器学习模型,无论它多么干净。因此需要将它们转换为基于数字的特征,以便计算机可以构建数学模型作为解决方案。这构成了数据预处理步骤

LabelEncoding之后的类别列

由于输出变量('Category')本质上也是分类的,需要将每个类编码为数字。这称为标签编码。

最后关注每个样本的主要信息 - 原始文本数据。为了从文本中提取数据作为特征并以数字格式表示它们,一种非常常见的方法是对它们进行矢量化。为此目的Scikit-learn库包含'TF-IDFVectorizer'。TF-IDF(术语频率 - 逆文档频率)计算多个文档内部和跨文档的每个单词的频率,以便识别每个单词的重要性。

# Encode classes
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(data.Category)
data.Category = le.transform(data.Category)
data.head(5)
 
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_title = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
tfidf_desc = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
labels = data.Category
features_title = tfidf_title.fit_transform(data.Title).toarray()
features_description = tfidf_desc.fit_transform(data.Description).toarray()
print('Title Features Shape: ' + str(features_title.shape))
print('Description Features Shape: ' + str(features_description.shape))

数据分析与特征探索

作为一个额外的步骤,决定显示类的分布,以检查不均衡的样本数。

此外想检查使用TF-IDF矢量化提取的特征是否有意义,因此决定使用标题和描述功能找到每个类最相关的unigrams和bigrams。

# Best 5 keywords for each class using Title Feaures
from sklearn.feature_selection import chi2
import numpy as np
N = 5
for current_class in list(le.classes_):
    current_class_id = le.transform([current_class])[0]
    features_chi2 = chi2(features_title, labels == current_class_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf_title.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}':".format(current_class))
    print("Most correlated unigrams:")
    print('-' *30)
    print('. {}'.format('\n. '.join(unigrams[-N:])))
    print("Most correlated bigrams:")
    print('-' *30)
    print('. {}'.format('\n. '.join(bigrams[-N:])))
    print("\n")
    
# Best 5 keywords for each class using Description Features
from sklearn.feature_selection import chi2
import numpy as np
N = 5
for current_class in list(le.classes_):
    current_class_id = le.transform([current_class])[0]
    features_chi2 = chi2(features_description, labels == current_class_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf_desc.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}':".format(current_class))
    print("Most correlated unigrams:")
    print('-' *30)
    print('. {}'.format('\n. '.join(unigrams[-N:])))
    print("Most correlated bigrams:")
    print('-' *30)
    print('. {}'.format('\n. '.join(bigrams[-N:])))
    print("\n")
# USING TITLE FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. paint
. official
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. musical theatre
. work theatre
. official music
. music video
 
 
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. healthy snack
. snack amp
. taste test
. kid try
. street food
 
 
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. archaeology
. history
. anthropology
Most correlated bigrams:
------------------------------
. history channel
. rap battle
. epic rap
. battle history
. archaeological discoveries
 
 
# 'manufacturing':
Most correlated unigrams:
------------------------------
. business
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. manufacture plant
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
 
 
# 'science and technology':
Most correlated unigrams:
------------------------------
. compute
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. science amp
. amp technology
. primitive technology
. computer science
. science technology
 
 
# 'travel':
Most correlated unigrams:
------------------------------
. blogger
. vlog
. travellers
. blog
. travel
Most correlated bigrams:
------------------------------
. viewfinder travel
. travel blogger
. tip travel
. travel vlog
. travel blog
# USING DESCRIPTION FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. official
. paint
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. click listen
. production connexion
. official music
. music video
 
 
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. special offer
. hiho special
. come play
. sponsor series
. street food
 
 
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. history
. archaeology
. anthropology
Most correlated bigrams:
------------------------------
. episode epic
. epic rap
. battle history
. rap battle
. archaeological discoveries
 
 
# 'manufacturing':
Most correlated unigrams:
------------------------------
. factory
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. process make
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
 
 
# 'science and technology':
Most correlated unigrams:
------------------------------
. quantum
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. quantum computers
. primitive technology
. quantum compute
. computer science
. science technology
 
 
# 'travel':
Most correlated unigrams:
------------------------------
. vlog
. travellers
. trip
. blog
. travel
Most correlated bigrams:
------------------------------
. tip travel
. start travel
. expedia viewfinder
. travel blogger
. travel blog

建模和训练

将分析的四个模型是:

  • 朴素贝叶斯分类器
  • 支持向量机
  • Adaboost分类器
  • LSTM

数据集分为训练集和测试集,分割比为8:2。标题和描述的特征是独立计算的,然后连接起来构建最终的特征矩阵。这用于训练分类器(LSTM除外)。

# Splitting into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, 1:3], data['Category'], random_state = 0)
X_train_title_features = tfidf_title.transform(X_train['Title']).toarray()
X_train_desc_features = tfidf_desc.transform(X_train['Description']).toarray()
features = np.concatenate([X_train_title_features, X_train_desc_features], axis=1)
 
# Naive Bayes
nb = MultinomialNB().fit(features, y_train)
# SVM
svm = linear_model.SGDClassifier(loss='modified_huber',max_iter=1000, tol=1e-3).fit(features,y_train)
# AdaBoost
adaboost = AdaBoostClassifier(n_estimators=40,algorithm="SAMME").fit(features,y_train)

对于使用LSTM,数据预处理步骤与前面讨论的完全不同。这是以下过程:

  • 将每个样本的标题和描述组合成一个句子
  • 将组合句子标记为填充序列:将每个句子转换为标记列表,为每个标记分配一个数字id,然后通过填充较短的序列使每个序列具有相同的长度,并截断较长的序列。
  • One-Hot编码'Category'变量
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.utils.np_utils import to_categorical
 
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 20000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 50
# This is fixed.
EMBEDDING_DIM = 100
 
# Combining titles and descriptions into a single sentence
titles = data['Title'].values
descriptions = data['Description'].values
data_for_lstms = []
for i in range(len(titles)):
    temp_list = [titles[i], descriptions[i]]
    data_for_lstms.append(' '.join(temp_list))
 
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(data_for_lstms)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
 
# Convert the data to padded sequences
X = tokenizer.texts_to_sequences(data_for_lstms)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
 
# One-hot Encode labels
Y = pd.get_dummies(data['Category']).values
print('Shape of label tensor:', Y.shape)
 
# Splitting into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, random_state = 42)
 
# Define LSTM Model
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(6, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
 
# Training LSTM Model
epochs = 5
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1)
 
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
plt.show();
 
plt.title('Accuracy')
plt.plot(history.history['acc'], label='train')
plt.plot(history.history['val_acc'], label='test')
plt.legend()
plt.show();

LSTM的学习曲线如下:

LSTM损耗曲线

LSTM精度曲线

分析表现

以下是所有不同分类器的精确调用曲线。要获得其他指标,请查看完整代码如下:

https://github.com/agrawal-rohit/Text-Classification-Analysis/blob/master/Text%20Classification%20Analysis.ipynb

在项目中观察到的每个分类器的排名如下:

LSTM> SVM>Naive Bayes> AdaBoost

LSTM在自然语言处理中的多个任务中表现出了出色的表现。LSTM中存在多个“gates”允许它们学习序列中的长期依赖性。

SVM是非常强大的分类器,它们尽力发现提取的特征之间的相互作用,但是学到的交互与LSTM不相同。另一方面,朴素贝叶斯分类器将这些特征视为独立的,因此它的性能比SVM稍差,因为它没有考虑不同特征之间的任何相互作用。

AdaBoost分类器对超参数的选择非常敏感,并且由于使用了默认模型,因此它没有最佳参数,这可能是性能不佳的原因

完整的代码可以在Github上找到。

https://github.com/agrawal-rohit

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2019-05-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 相约机器人 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
NLP 服务
NLP 服务(Natural Language Process,NLP)深度整合了腾讯内部的 NLP 技术,提供多项智能文本处理和文本生成能力,包括词法分析、相似词召回、词相似度、句子相似度、文本润色、句子纠错、文本补全、句子生成等。满足各行业的文本智能需求。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档