文章/答案/技术大牛

发布

Scikit-LLM：将大语言模型整合进Sklearn的工作流

文章来源：企鹅号 - deephub

我们以前介绍过Pandas和ChaGPT整合，这样可以不了解Pandas的情况下对DataFrame进行操作。现在又有人开源了Scikit-LLM，它结合了强大的语言模型，如ChatGPT和scikit-learn。但这个并不是让我们自动化scikit-learn，而是将scikit-learn和语言模型进行整合，scikit-learn也可以处理文本数据了。

安装

pip install scikit-llm

既然要与Open AI的模型整合，就需要他的Key，从Scikit-LLM库中导入SKLLMConfig模块，并添加openAI密钥:

# importing SKLLMConfig to configure OpenAI API (key and Name)

from skllm.config import SKLLMConfig

# Set your OpenAI API key

SKLLMConfig.set_openai_key("")

# Set your OpenAI organization (optional)

SKLLMConfig.set_openai_org("")ZeroShotGPTClassifier

通过整合ChatGPT不需要专门的训练就可以对文本进行分类。ZeroShotGPTClassifier，就像任何其他scikit-learn分类器一样，使用非常简单。

# importing zeroshotgptclassifier module and classification dataset

from skllm import ZeroShotGPTClassifier

from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn

X, y = get_classification_dataset()

# defining the model

clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# fitting the data

clf.fit(X, y)

# predicting the data

labels = clf.predict(X)

Scikit-LLM在结果上经过了特殊处理，确保响应只包含一个有效的标签。如果响应缺少标签，它还可以进行填充，根据它在训练数据中出现的频率为你选择一个标签。

对于我们自己的带标签的数据，只需要提供候选标签的列表，代码是这个样子的：

# importing zeroshotgptclassifier module and classification dataset

from skllm import ZeroShotGPTClassifier

from skllm.datasets import get_classification_dataset

# get classification dataset from sklearn for prediction only

X, _ = get_classification_dataset()

# defining the model

clf = ZeroShotGPTClassifier()

# Since no training so passing the labels only for prediction

clf.fit(None, ['positive', 'negative', 'neutral'])

# predicting the labels

labels = clf.predict(X)MultiLabelZeroShotGPTClassifier

多标签也类似

# importing Multi-Label zeroshot module and classification dataset

from skllm import MultiLabelZeroShotGPTClassifier

from skllm.datasets import get_multilabel_classification_dataset

# get classification dataset from sklearn

X, y = get_multilabel_classification_dataset()

# defining the model

clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the model

clf.fit(X, y)

# making predictions

labels = clf.predict(X)

创建MultiLabelZeroShotGPTClassifier类的实例时，指定要分配给每个样本的最大标签数量（这里:max_labels=3）

数据没有没有标签怎么办？可以通过提供候选标签列表来训练没有标记数据的分类器。y的类型应该是List[List[str]]。下面是一个没有标记数据的训练示例:

# getting classification dataset for prediction only

X, _ = get_multilabel_classification_dataset()

# Defining all the labels that needs to predicted

candidate_labels = [

"Quality",

"Price",

"Delivery",

"Service",

"Product Variety"

]

# creating the model

clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# fitting the labels only

clf.fit(None, [candidate_labels])

# predicting the data

labels = clf.predict(X)文本向量化

文本向量化是将文本转换为数字的过程，Scikit-LLM中的GPTVectorizer模块，可以将一段文本(无论文本有多长)转换为固定大小的一组向量。

# Importing the necessary modules and classes

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import LabelEncoder

from xgboost import XGBClassifier

# Creating an instance of LabelEncoder class

le = LabelEncoder()

# Encoding the training labels 'y_train' using LabelEncoder

y_train_encoded = le.fit_transform(y_train)

# Encoding the test labels 'y_test' using LabelEncoder

y_test_encoded = le.transform(y_test)

# Defining the steps of the pipeline as a list of tuples

steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]

# Creating a pipeline with the defined steps

clf = Pipeline(steps)

# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'

clf.fit(X_train, y_train_encoded)

# Predicting the labels for the test data 'X_test' using the trained pipeline

yh = clf.predict(X_test)文本摘要

GPT非常擅长总结文本。在Scikit-LLM中有一个叫GPTSummarizer的模块。

# Importing the GPTSummarizer class from the skllm.preprocessing module

from skllm.preprocessing import GPTSummarizer

# Importing the get_summarization_dataset function

from skllm.datasets import get_summarization_dataset

# Calling the get_summarization_dataset function

X = get_summarization_dataset()

# Creating an instance of the GPTSummarizer

s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)

# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.

# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'

summaries = s.fit_transform(X)

需要注意的是，max_words超参数是对生成摘要中单词数量的灵活限制。虽然max_words为摘要长度设置了一个粗略的目标，但摘要器可能偶尔会根据输入文本的上下文和内容生成略长的摘要。

总结

ChaGPT的火爆使得泛化模型有了更多的进步，这种进步也给我们日常的使用带来了巨大的变革，Scikit-LLM就将LLM整合进了Scikit的工作流，如果你有兴趣，这里是源码：

https://github.com/iryna-kondr/scikit-llm

作者：Fareed Khan

发表于: 2023-05-272023-05-27 09:46:24
原文链接：https://kuaibao.qq.com/s/20230527A01TGJ00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

Scikit-LLM：将大语言模型整合进Sklearn的工作流

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐