前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >LangChain手记 Evalutation评估

LangChain手记 Evalutation评估

作者头像
Steve Wang
发布2023-10-12 09:41:56
3980
发布2023-10-12 09:41:56
举报
文章被收录于专栏:从流域到海域

整理并翻译自DeepLearning.AI×LangChain的官方课程:Evaluation(源代码可见)

基于LLM的应用如何做评估是一个难点,本节介绍了一些思路和工具。

“从传统开发转换到基于prompt的开发,开发使用LLM的应用,整个工作流的评估方式需要重新考虑,本节会介绍很多激动人心的概念。”

Evaluation 评估

构建一个上节课介绍过的QA chain:

不同之处仅在于加了一个参数:chain_type_kwargs,内部指定了一个doc的分隔符。

首先可以看一下数据示例:

Hard-Code example 手动编写的用例

最容易想到的评价方法是手动构建评价数据,然后观察LLM的输出是否和评价数据中已经给定的答案一致,手动构建评价数据永远逃不过成本问题。

LLM-Generated example LLM生成用例

可以考虑使用LLM生成代替人工编写用例,下面介绍了一个生成QA用例的QAGenerationChain

可以把人工编写的用例和生成的用例组合用来做评估,测试一下第一个query,得到如下回复:

Manual Evaluation 人工评估

LangChain提供了debug模式,可以像下面这样开启:

再次测试第一个query,LangChain会打印整个过程中的信息:

通过设置debug标志位为False关闭debug模式:

LLM assisted evaluation LLM辅助评估

基于现阶段LLM已经具备比较强的能力,可以使用LLM来辅助做评估

在前面构建的所有用例生成结果:

一共有7条用例,所以跑了7次。

LangChain提供了QAEvalChain来进行QA场景的评估,使用方式如下:

下面我们来看一下模型输出和评估Chain评估的结果:

代码语言:javascript
复制
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the small Recycled Waterhog Dog Mat are 18" x 28" and the dimensions of the medium Recycled Waterhog Dog Mat are 22.5" x 34.5".
Predicted Answer: The small Recycled Waterhog Dog Mat has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit?
Real Answer: The swimsuit features bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric, ensuring that it keeps its shape and resists snags. The swimsuit is also UPF 50+ rated, providing the highest rated sun protection possible by blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Finally, it can be machine washed and line dried for best results.
Predicted Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is a two-piece swimsuit with bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The swimsuit has UPF 50+ rated fabric that provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. It is machine washable and should be line dried for best results.
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?
Real Answer: The body of the Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon with 18% Lycra® spandex for the body and 90% recycled nylon with 10% Lycra® spandex for the lining.
Predicted Grade: CORRECT

Example 6:
Question: What is the fabric composition of the EcoFlex 3L Storm Pants?
Real Answer: The EcoFlex 3L Storm Pants are made of 100% nylon, exclusive of trim.
Predicted Answer: The fabric composition of the EcoFlex 3L Storm Pants is 100% nylon, exclusive of trim.
Predicted Grade: CORRECT
​```

视频接下来介绍了为什么要使用LLM来做评估:
![在这里插入图片描述](https://img-blog.csdnimg.cn/73ac80581ea243d981b0db3ede2d5d8a.png)
在一个自然语言生成场景下(比如前面介绍的QA),模型的输出可以是任意字符,因而无法通过字符完全匹配(是否相等)、字符部分匹配(是否含有子串)、正则(更复杂的匹配方式)来判定输出是否正确。以上图为例,真实答案“Yes”和模型的输出“The Cozy Comfort Pullover Set, Stripe does have side pockets.”是完全不同的字符,无法通过字符匹配来判定相等,但是具备语义理解能力的LLM能够判定它们在语义上相等,这是传统字符匹配做不到的。
### LangChain 可视化评估工具
LangChain提供了可视化的评估工具`LangChainPlus`(可能需要额外安装和配置),该工具会自动记录在python notebook上的运行历史。
![在这里插入图片描述](https://img-blog.csdnimg.cn/89a584e6f74843a9af67e719ff185cbb.png)
可以点击可视化查看调用链,也可以点击节点查看当前节点chain的详细信息,包含输入、输出、时延、额外新信息(运行环境)等,如下图:
![在这里插入图片描述](https://img-blog.csdnimg.cn/1bc61a5378934a248155957d17724f73.png)
点击LLM Chain节点可以查看模型输入:包含SYSTREM、HUMAN、模型输出、模型输出元信息等内容。
![在这里插入图片描述](https://img-blog.csdnimg.cn/da19b50c29d740cab5c498f25e688722.png)
![在这里插入图片描述](https://img-blog.csdnimg.cn/a9034d980ba54ddbb6ae8a136b2fe937.png)
右上角提供了一个【to Dataset】按钮,点击可以将当前的输入输出作为一个pair构建数据集,操作方式如下:
![在这里插入图片描述](https://img-blog.csdnimg.cn/aac46bc18f6e4862bf6227e9ded7fb2c.png)
如果当前没有数据集,需要点击【Create dataset】创建一个:
![在这里插入图片描述](https://img-blog.csdnimg.cn/26e015fa2877407a90d03822d723bf7f.png)
创建数据集:
![在这里插入图片描述](https://img-blog.csdnimg.cn/96c7b5798c68423a8427cd1376d9cf57.png)
将当前QA Chain的输入输出加入到刚刚创建的数据集内:
![在这里插入图片描述](https://img-blog.csdnimg.cn/827cf6901a9640478cc0b9888fa5f00d.png)
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2023-08-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Evaluation 评估
  • Hard-Code example 手动编写的用例
  • LLM-Generated example LLM生成用例
  • Manual Evaluation 人工评估
  • LLM assisted evaluation LLM辅助评估
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档