当我试图部署这个模型时,我会得到以下错误。
ValueError: X has 3 features, but LinearSVC is expecting 64852 features as input下面的数据示例。
data = [[3409, False, 'Lorum Ipsum'], [0409, True, 'dolor sit amet consectetuer'], [7869, False, 'Aenean commodo ligula eget dolor']]
df = pd.DataFrame(data, columns=['id', 'booleanv', 'text'] 下面创建模型的代码。
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
df = pd.read_csv('cleandata.csv')
# Split dataset into training and validation set
train_size = int(df.shape[0] * 0.8)
train_df = df[:train_size]
val_df = df[train_size:]
# split text and labels
X_train = train_df.text.to_numpy()
Y_train = train_df.booleanv.to_numpy()
X_test = val_df.text.to_numpy()
Y_test = val_df.booleanv.to_numpy()
tfidf = TfidfVectorizer(ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)
model1 = LinearSVC(random_state=0, tol=1e-5)
model1.fit(X_train_tf, Y_train)
import pickle
pickle.dump(model1, open('classification.pickle','wb'))
pickle.dump(tfidf, open('vectorizer.pickle','wb'))X_Train和X_Test都是数组。我在创建的API中提供的输入是json格式的。我怀疑我需要以某种方式改变我的输入。这是正确的吗?如果是的话,我怎样才能做到呢?
发布于 2022-09-16 05:22:47
要从模型中获得预测,您需要遵循在培训阶段执行的相同的转换步骤。
您所遇到的ValueError表示您正在将原始数据传递给分类器,而不进行矢量化。由于该模型是在一个由64852个特征组成的稀疏矩阵( tfidf.fit_transform(X_train)的结果)上训练的,它期望一个具有相同特征数的向量化输入。以下是如何做到这一点:
input_data = {
'id': 1234,
'booleanv': False,
'text' : 'your input text goes here'
}
#vectorize
input_vectorized = tfidf.transform([input_data['text']])
#get predictions
predictions = model.predict(input_vectorized)当然,这可以修改为使用批处理,而不是单个输入。此外,强烈建议使用输油管道来组装所有不同的步骤。
https://stackoverflow.com/questions/73607406
复制相似问题