我试图为类1、2、3和4的多标签分类定义一组类/标签,但是数组中出现了意外的问题,还包括以下内容:
multilabel.classes_数组(‘',’','1','2','3','4',dtype=object)
我只想把1,2,3,4作为我的标签,但我想不出一个方法来删除它。
我的代码:
import pandas as pd
import numpy as np
import os
import ast
import seaborn as sns #pip install seaborn
import matplotlib.pyplot as plt
import skmultilearn #pip install scikit-multilearn
from preprocessing.transcription_preprocessing import TranscriptionPreprocessor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
df = pd.read_csv(r'C:\Users\M94969\Desktop\datasets\prod500.csv')
# Define label variable
y = df['tags']
# Make multilabelbinarizer object
#multilabel = MultiLabelBinarizer()
#y = multilabel.fit_transform(y)
#multilabel.classes_
#pd.DataFrame(y,columns=multilabel.classes_)
labelbinarizer = LabelBinarizer()
fit = labelbinarizer.fit_transform(y)
labelbinarizer.classes_
pd.DataFrame(y,columns=labelbinarizer.classes_)
# Turn texts into sparse matrix
tfidf = TfidfVectorizer(analyzer='word', max_features=1000, max_df=0.50, ngram_range=(1,3))
X = tfidf.fit_transform(df['text'])
tfidf.vocabulary_
# Split data into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# Build models
sgd = SGDClassifier()
lr = LogisticRegression(solver = 'lbfgs')
svc = LinearSVC()
def j_score(y_true, y_pred):
jaccard = np.minimum(y_true, y_pred).sum(axis=1)/np.maximum(y_true, y_pred).sum(axis = 1)
return jaccard.mean()*100
def print_score(y_pred, clf):
print("Clf: ", clf.__class__.__name__)
print('Jacard score: {}'.format(j_score(y_test,y_pred)))
print('----')
for classifier in [sgd, lr, svc]:
clf = OneVsRestClassifier(classifier)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print_score(y_pred, classifier)
发布于 2021-09-16 06:54:05
当你跑步时:
df['tags'].unique()
在您的示例数据上,输出如下:
array(['1', '3', '2', '1, 2'], dtype=object)
多标记赋值发生在数据row的第7行中:
df[df['tags']=='1, 2']
在以下方面的成果:
text TV Internet Mobil Fastnet tags
7 TIL YOUSEE... 1 2 0 0 1, 2
如果您不希望这个二进制化,您可以简单地删除行或在您的dataframe中分配一个标签。
或者,您可以查看sklearn LabelBinarizer,以获得更符合您需要的标签:
labelbinarizer = LabelBinarizer()
fit = labelbinarizer.fit_transform(y)
labelbinarizer.classes_
# array(['1', '1, 2', '2', '3'], dtype='<U4')
https://stackoverflow.com/questions/69203485
复制相似问题