专栏首页用户6517667的专栏How Do You Test AI Systems?--你如何测试AI系统

How Do You Test AI Systems?--你如何测试AI系统

RonSchmelzerContributor(Ron SchmelzerContributor公司)

COGNITIVEWORLDContributor Group(认知世界贡献者组)

Everyone who hasever worked on an application development project knows that you don’t justsimply put code and content out in production, to your customers, employees, orstakeholders without first testing it to make sure it’s not broken or dead ondelivery. Quality Assurance (QA) is such a core part of any technology orbusiness delivery that it’s one of the essential components of any developmentmethodology. You build. You test. You deploy. And the best way to do all thisis in an agile fashion, in small, iterative chunks so you make sure to respondto the continuously evolving and changing needs of the customer. Surely AIprojects are no different. There are iterative design, development, testing,and delivery phases, as we’ve discussed in our previous content on AImethodologies.

一个参与过应用程序开发项目的人都知道,您不只是简单地将代码和内容输出到生产环境中,给您的客户、员工或利益相关者,而无需首先对其进行测试,以确保它在交付时没有损坏或死机。质量保证(Quality Assurance,QA)是任何技术或业务交付的核心部分,是任何开发方法的重要组成部分之一。你构建,你来测试,你部署。最好的方法是以敏捷的方式,以小的、迭代的方式,这样你就可以确保对客户不断变化的需求做出响应。当然人工智能项目也不例外。正如我们在前面关于人工智能方法的内容中所讨论的,有迭代的设计、开发、测试和交付阶段。

However, AIoperationalization is different than traditional deployment in that you don’tjust put machine learning models into “production”. This is because models arecontinuously evolving and learning and must be continuously managed.Furthermore, models can end up on a wide variety of endpoints with differentperformance and accuracy metrics. AI projects are unlike traditional projectseven with regards to QA. Simply put, you don’t do QA for AI projects like youQA other projects. This is because the concept of what we’re testing, how wetest, and when we test is significantly different for AI Projects.

然而,人工智能的操作化不同于传统的部署,因为你不只是把机器学习模型放在“生产”中。这是因为模型是不断发展和学习的,必须不断加以管理。此外,模型可以在具有不同性能和准确性度量的各种端点上结束。人工智能项目不同于传统的项目,甚至在质量保证方面也是如此。简单地说,你不会像对其他项目那样对人工智能项目进行质量保证。这是因为对于人工智能项目来说,我们测试什么、如何测试以及何时测试的概念有着显著的不同。

Testing andquality assurance in the training and inference phases of AI

人工智能训练与推理阶段的测试与质量保证

Thoseexperienced with machine learning model training know that testing is actuallya core element of making AI projects work. You don’t simply develop an AIalgorithm, throw training data at it and call it a day. You have to actuallyverify that the training data does a good enough job of accurately classifyingor regressing data with sufficient generalization without overfitting orunderfitting the data. This is done using validation techniques and settingaside a portion of the training data to be used during the validation phase. Inessence, this is a sort of QA testing where you’re making sure that thealgorithm and data together in a way that also takes into accounthyperparameter configuration data and associated metadata all working togetherto provide the predictive results you’re looking for. If you get it wrong inthe validation phase, you’re supposed to go back, change the hyperparameters,and rebuild the model again, perhaps with better training data if you have it.After this is done, you go back and use other set-aside testing data to verifythat the model indeed works as it is supposed to. While this is all aspects oftesting and validation, it happens during the training phase of the AI project.This is before the AI model is put into operation.

那些有过机器学习模型培训经验的人知道,测试实际上是让人工智能项目工作的核心要素。你不能简单地开发一个人工智能算法,把训练数据扔到它上面,然后就这样结束了。实际上,您必须验证培训数据在准确分类或回归数据方面是否做得足够好,并具有足够的泛化能力,而不会对数据进行过拟合或欠拟合。这是通过使用验证技术和留出一部分在验证阶段使用的培训数据来完成的。本质上,这是一种QA测试,在这种测试中,您要确保算法和数据以一种同时考虑超参数配置数据和相关元数据的方式一起工作,以提供您要查找的预测结果。如果在验证阶段出错,您应该返回,更改超参数,然后重新构建模型,如果有更好的训练数据,也许可以使用它。完成此操作后,您将返回并使用其他预留测试数据来验证模型是否确实按预期工作。虽然这是测试和验证的所有方面,但它发生在人工智能项目的培训阶段。这是在人工智能模型投入使用之前。

Today In: AI

今天在AI

3 DigitalRealities Arising From The Covid19 Pandemic

IBM AugmentedReality Working To Support And Accelerate How Support Services Are Changing

Google TriesTo Match Apple’s H1 Headphone Chip Features With Updates To Fast Pair

3 由Covid19流行病引起的数字现实

IBM增强现实技术致力于支持和加速支持服务的变化

谷歌试图将苹果H1耳机芯片的功能与Fast Pair的更新相匹配

Even in thetraining phase, we’re testing a few different things. First, we need to makesure the AI algorithm itself works. There’s no sense in tweakinghyperparameters and training the model if the algorithm is implemented wrong.However, in reality, there’s no reason for a poorly implemented algorithmbecause most of these algorithms are already baked into the various AIlibraries. If you need K-Means Clustering or different flavors of neuralnetworks or Support Vector Machines or K-Nearest Neighbors, you can simply justcall that library function in Python scikit-learn or whatever your tool ofchoice is, and it should work. There’s just one way to do the math! MLdevelopers should not be coding those algorithms from scratch unless you have areally good reason to do so. That means if you’re not coding them from scratch,there’s very little to be tested as far as the actual code goes – assume thatthe algorithms have already passed their tests. In an AI project, QA will neverbe focused on the AI algorithm itself or the code, assuming it has all beenimplemented as supposed to be.

即使在训练阶段,我们也在测试一些不同的东西。首先,我们需要确保人工智能算法本身的工作。如果算法实现错误,调整超参数和训练模型是没有意义的。然而,在现实中,并没有理由实现不好的算法,因为这些算法中的大多数已经被烘焙到各种人工智能库中。如果您需要K-Means聚类或不同风格的神经网络、支持向量机或K近邻,您只需在Pythonscikit learn或您选择的任何工具中调用该库函数,它就会工作。只有一种方法可以计算!机器学习开发人员不应该从头开始编写这些算法,除非您有充分的理由这样做。这意味着如果你不是从头开始编码的话,就实际代码而言,几乎没有什么需要测试的——假设算法已经通过了测试。在人工智能项目中,QA永远不会关注人工智能算法本身或代码,假设它们已经按预期实现了。

This leaves twothings to be tested in the training phase for the AI model itself: the trainingdata and the hyperparameter configuration data. In the latter case, we alreadyaddressed testing of hyperparameter settings through the use of validationmethods, including K-fold cross-validation and other approaches. If you aredoing any AI Model training at all, then you should know how to do validation.This will help determine if your hyperparameter settings are correct. Knockanother activity off the QA task list.

这使得人工智能模型本身的两个方面在训练阶段需要测试:训练数据超参数配置数据。在后一种情况下,我们已经讨论了通过使用验证方法(包括K-fold交叉验证和其他方法)测试超参数设置的问题。如果你正在做任何人工智能模型培训,那么你应该知道如何进行验证。这将有助于确定超参数设置是否正确。从QA任务列表中删除另一个活动。

As such, thenall that remains is testing the data itself for QA of the AI Model. But whatdoes that mean? This means not just data quality, but also completeness. Doesthe training model adequately represent the reality of what you’re trying togeneralize? Have you inadvertently included any informational or human-inducedbias in your training data? Are you skipping over things that work in trainingbut will fail during inference because the real-world data is more complex? QAfor the AI model here has to do with making sure that the training dataincludes a representative sample of the real world and eliminates as much humanbias as possible.

因此,剩下的就是为人工智能模型的质量保证测试数据本身。但那是什么意思?这不仅意味着数据质量,还意味着完整性。培训模型是否充分反映了你试图概括的事实?您是否无意中在培训数据中包含了任何信息或人为的偏见?你是不是跳过了那些在训练中起作用但在推理过程中会失败的事情,因为现实世界的数据更复杂?人工智能模型的质量保证必须确保训练数据包括真实世界的代表性样本,并尽可能消除人为偏见。

Outside of themachine learning model, the other aspects of the AI system that need testingare actually external to the AI model. You need to test the code that puts theAI model into production – the operationalization component of the AI system.This can happen prior to the AI model being put into production, but thenyou’re not actually testing the AI model. Instead, you’re testing the systemsthat use the model. If the model is failing during testing, the other code thatuses the model has a problem with either the training data or the configurationsomewhere. You should have picked that up when you were testing the trainingmodel data and doing validation as we discussed above.

在机器学习模型之外,需要测试的人工智能系统的其他方面实际上也在人工智能模型之外。您需要测试将人工智能模型投入生产的代码——人工智能系统的操作组件。这可能发生在人工智能模型投入生产之前,但实际上你并没有测试人工智能模型。相反,您要测试使用该模型的系统。如果模型在测试过程中失败,则使用该模型的其他代码在训练数据或某个地方的配置方面有问题。当您测试培训模型数据并进行上面讨论的验证时,您应该已经学会了这一点。

To do QA for AI,you need to test in production

要为人工智能做质量保证,你需要在生产中测试

If you’vefollowed along what’s written above then you know that a properly validated,well-generalizing system using representative training data and usingalgorithms from an already-tested and proven source should result in expectedresults. But what happens when you don’t get those expected results? Reality isobviously messy. Things happen in the real world that don’t happen in your testenvironment. Yet we did everything we were supposed to do in the training phaseand our model passed meeting expectations, but it’s not passing in the“inference” phase when the model is operationalized. This means we need to havea QA approach to deal with models in production.

如果你遵循了上面写的内容,那么你就会知道一个经过适当验证的、通用性很强的系统,使用有代表性的训练数据,使用来自已经测试和验证的源的算法,应该会产生预期的结果。但是如果你没有得到预期的结果会怎么样呢?现实显然很混乱。在现实世界中发生的事情并不是在您的测试环境中发生的。然而,我们在培训阶段做了我们应该做的一切,我们的模型通过了预期,但在模型运行的“推断”阶段并没有通过。这意味着我们需要一种QA方法来处理生产中的模型。

Problems thatarise with models in the inference phase are almost always issues of data ormismatches in the way that the model was trained versus real-world data. Weknow the algorithm works. We know that our training model data andhyperparameters were configured to the best of our ability. That means thatwhen models are failing we have data or real-world mismatch problems. Is theinput data bad? If the problem is bad data – fix it. Is the model not generalizing well? Is theresome nuance of the data that needs to be added to further train the model? Ifthe answer is the latter, that means we need to go through a whole new cycle ofdeveloping an AI model with new training data and hyperparameter configurationsto deal with the right level of fitting to that data. Regardless of the issue,organizations that operationalize AI models need a solid approach by which theycan keep close tabs on how the AI models are performing and version controlwhich ones are in operation.

模型的训练方式与真实数据的对比。我们知道这个算法是有效的。我们知道我们的训练模型数据和超参数都是根据我们的能力配置的。这意味着当模型失败时,我们会遇到数据或现实世界中的不匹配问题,输入数据是否错误?如果问题是坏数据,请修复它。这种模式是不是概括得不好?是否需要添加一些细微差别的数据来进一步训练模型?如果答案是后者,那就意味着我们需要经历一个全新的周期,开发一个包含新训练数据和超参数配置的人工智能模型,以处理与该数据的正确匹配级别。不管问题是什么,操作人工智能模型的组织需要一种可靠的方法,通过这种方法,他们可以密切关注人工智能模型是如何执行的,以及版本控制哪些模型在运行。

This isresulting in the emergence of a new field of technology called “ML ops”, thatfocuses not on building or developing models, but rather managing them inoperation. ML ops is focused on model versioning, governance, security,iteration, and discovery. Basically, everything that happens after the modelsare trained and developed and while they are out in production.

这就产生了一个新的技术领域,称为“MLOps”,它的重点不是建立或开发模型,而是在运行中管理模型。ML ops主要关注模型版本控制、治理、安全性、迭代和发现。基本上,在模型经过训练和开发之后,以及在生产过程中发生的一切。

AI projects arereally unique in that they revolve around data. Data is the one thing intesting that is guaranteed to continuously grow and change. As such, you needto consider AI projects as also continuously growing and changing. This shouldgive you a new perspective on QA in the context of AI.

人工智能项目非常独特,因为它们围绕数据展开。数据是测试中保证不断增长和变化的一件事。因此,您需要将人工智能项目视为也在不断增长和变化。这将给你一个新的视角,在人工智能背景下的质量保证。

本文分享自微信公众号 - 软件测试培训(iTestTrain),作者:软件测试培训

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2020-04-30

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 未来创新的人工智能测试自动化工具:第三次浪潮

    当我回顾我在测试自动化领域的职业生涯时,有三个不同的时期,或者说“波浪”会浮现在我的脑海中。

    小老鼠
  • 如何测试深度学习

    One of thelargest challenges when starting our company was learning how to use d...

    小老鼠
  • DevOps工具介绍连载(17)——Debian-Preseed(全局配置)

    对于大批量安装Linux服务器来说,Kickstart是个不错的选择,我比较热衷的方式是:TFPT+HTTP+DHCP+PXE(configure file)

    小老鼠
  • The Apache Way--Building Tech Community in China

    感谢暨南大学 JSLeung 同学义务贡献的翻译,让这个演讲从四六级的水准变为专业级水准。下面文字有差错的话,是我后期修改、降低难度的原因。

    温铭@APISIX
  • QQ & PUPU 动画设定

    ? 腾讯ISUX isux.tencent.com 社交用户体验设计 ? ? 01 概述 | Overview SpaceQQ 和 PUPU 是在QQ20周...

    腾讯ISUX
  • 不完全免疫算法简介AU-DHEIA--AIS学习笔记6

    DrawSky
  • Academic writing 课程精华

    这个暑假在华师上英语培训,分为ARW(academic reading & writing)和ASL(academic speaking & listening...

    安包
  • Inception-V3论文翻译——中英文对照

    Rethinking the Inception Architecture for Computer Vision Abstract Convolutional...

    Tyan
  • Very Deep Convolutional Networks for Large-Scale Image Recognition—VGG论文翻译—中英文对照

    Very Deep Convolutional Networks for Large-Scale Image Recognition ABSTRACT In t...

    Tyan
  • AOT:基于外观最优传输的身份交换用于伪造检测(CS CV)

    最近的研究表明,不同的和具有挑战性的深度伪造数据集可以提高伪造检测的性能。但是,由于目前的身份交换方法很难产生外观差异大的深度假数据集,检测算法在这种情况下可能...

    凌茜

扫码关注云+社区

领取腾讯云代金券