前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kaggle: Detect toxicity - Basic EDA -1

Kaggle: Detect toxicity - Basic EDA -1

作者头像
杨熹
发布2019-06-04 11:31:30
7720
发布2019-06-04 11:31:30
举报
This kaggle is:

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview


The Goal is:

to classify toxic comments especially to recognize unintended bias towards identities

toxic comment is a comment that is rude, disrespectful or otherwise likely to make someone leave a discussion


challenge is:

some neutral comments regarding some identity like "gay" would be classified as toxic,eg:"I am a gay woman" .

reason is: identities associated with toxicity outnumbered neutral comments regarding the same identity


Dataset

dataset labeled with the associated identity

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data

These are subtypes of toxicity, do not need to predict:

severe_toxicity obscene threat insult identity_attack sexual_explicit

These columns are corresponding to identity attributes: representing the identities that are mentioned in the comment

male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability

Additional columns:

toxicity_annotator_count and identity_annotator_count, and metadata from Civil Comments: created_date, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, disagree. Civil Comments' label rating is the civility rating Civil Comments users gave the comment.

Example:

Comment: Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have. Toxicity Labels: All 0.0 Identity Mention Labels: homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)


1. Libs and Data:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
print(os.listdir("../input"))
train_df = pd.read_csv('../input/train.csv')
test_df = pd.read_csv('../input/test.csv')

2. Shape of data

train_len, test_len = len(train_df.index), len(test_df.index)
print(f'train size: {train_len}, test size: {test_len}')

train size: 1804874, test size: 97320

train_df.head()

3. Count the amount of missing values

miss_val_train_df = train_df.isnull().sum(axis=0) / train_len
miss_val_train_df = miss_val_train_df[miss_val_train_df > 0] * 100
miss_val_train_df
  • a large portion of the data doesn't have the identity tag
  • but the numbers are same

4. Visualization

Q1: which identity appears the most in the dataset?

According to the data details, just care about the identities tagged in this dataset, and make a list of them:

identities = ['male','female','transgender','other_gender','heterosexual','homosexual_gay_or_lesbian',
              'bisexual','other_sexual_orientation','christian','jewish','muslim','hindu','buddhist',
              'atheist','other_religion','black','white','asian','latino','other_race_or_ethnicity',
              'physical_disability','intellectual_or_learning_disability','psychiatric_or_mental_illness',
              'other_disability']

From below diagram we can also see distributions of toxic and non-toxic in each identity:

# getting the dataframe with identities tagged
train_labeled_df = train_df.loc[:, ['target'] + identities ].dropna()
# lets define toxicity as a comment with a score being equal or .5
# in that case we divide it into two dataframe so we can count toxic vs non toxic comment per identity
toxic_df = train_labeled_df[train_labeled_df['target'] >= .5][identities]
non_toxic_df = train_labeled_df[train_labeled_df['target'] < .5][identities]

# at first, we just want to consider the identity tags in binary format. So if the tag is any value other than 0 we consider it as 1.
toxic_count = toxic_df.where(train_labeled_df == 0, other = 1).sum()
non_toxic_count = non_toxic_df.where(train_labeled_df == 0, other = 1).sum()

# now we can concat the two series together to get a toxic count vs non toxic count for each identity
toxic_vs_non_toxic = pd.concat([toxic_count, non_toxic_count], axis=1)
toxic_vs_non_toxic = toxic_vs_non_toxic.rename(index=str, columns={1: "non-toxic", 0: "toxic"})
# here we plot the stacked graph but we sort it by toxic comments to (perhaps) see something interesting
toxic_vs_non_toxic.sort_values(by='toxic').plot(kind='bar', stacked=True, figsize=(30,10), fontsize=20).legend(prop={'size': 20})

Q2: which identities are more frequently related to toxic comments?
  • consider the score (target) of how toxic the comment is
  • also count in the value of how each identity been targeted
# First we multiply each identity with the target
weighted_toxic = train_labeled_df.iloc[:, 1:].multiply(train_labeled_df.iloc[:, 0], axis="index").sum() 
# changing the value of identity to 1 or 0 only and get comment count per identity group
identity_label_count = train_labeled_df[identities].where(train_labeled_df == 0, other = 1).sum()
# then we divide the target weighted value by the number of time each identity appears
weighted_toxic = weighted_toxic / identity_label_count
weighted_toxic = weighted_toxic.sort_values(ascending=False)
# plot the data using seaborn like before
plt.figure(figsize=(30,20))
sns.set(font_scale=3)
ax = sns.barplot(x = weighted_toxic.values , y = weighted_toxic.index, alpha=0.8)
plt.ylabel('Demographics')
plt.xlabel('Weighted Toxicity')
plt.title('Weighted Analysis of Most Frequent Identities')
plt.show()

Conclusion: race based identities (White and Black) and religion based identities (Muslim and Jews) are heavily associated with toxic comments.

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.05.30 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • This kaggle is:
  • The Goal is:
  • challenge is:
  • Dataset
  • 1. Libs and Data:
  • 2. Shape of data
  • 3. Count the amount of missing values
  • 4. Visualization
    • Q1: which identity appears the most in the dataset?
      • Q2: which identities are more frequently related to toxic comments?
      领券
      问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档