如何测试人工智能模型：QA入门指南

顾翔

发布于 2020-05-21 16:05:11

1.7K0

来源：

https://dzone.com/articles/how-to-test-ai-models-an-introduction-guide-for-qa-1

Get some simpleanswers to frequently asked questions regarding quality assurance withinmachine learning. 在机器学习中获得一些关于质量保证的常见问题的简单答案。

by AnastasiyaKazeeva

The latesttrends show that machine learning is one of the most rapidly developing fieldsin computer science. Unfortunately, it’s still unclear to some customers whoare neither data scientists nor ML developers how to handle it, but they doknow that they need to incorporate AI into their products. 最新的发展趋势表明，机器学习是计算机科学中发展最快的领域之一。不幸的是，对于一些既不是数据科学家也不是机器学习开发人员的客户来说，他们仍然不清楚如何处理它，但是他们知道他们需要将AI集成到他们的产品中。

Here are themost frequently asked questions we get from the customers regarding qualityassurance within ML: 以下是我们从客户那里得到的有关机器学习内质量保证的最常见问题：

I want to run UAT; could youplease provide your full regression test cases for AI? 我想运行UAT，你能提供对于AI的全回归测试用例吗？
OK, I have the model running inproduction; how can we assure it doesn’t break when we update it? 好的，我的模型已经在生产中运行了，我们如何保证更新时它不会奔溃？
How can I make sure it producesthe right values I need? 我如何确保它产生我需要的正确值？

What are somesimple answers here? 这里有什么简单的答案？

A BriefIntroduction to Machine Learning In order to get how ML works, let’s take acloser look at ML model essence. 为了了解机器学习的工作原理，让我们更深入地了解机器学习模型的本质。

What is thedifference between classical algorithms/hardcoded functions and ML-basedmodels? 经典算法/硬编码函数和基于机器学习的模型有什么区别？

From theblack-box perspective, it’s the same box with input and output. Fill the inputsin, get the outputs — what a beautiful thing! 从黑盒的角度来看，输入和输出是同一个盒子。把输入填进去，得到输出——多美的东西！

From thewhite-box perspective, and specifically from how the system is built, it’s abit different. The core difference here is: 从白盒的角度来看，特别是从系统的构建方式来看，它有点不同。这里的核心区别是：

You write the function, or .编写函数，或者
The function is fitted by aspecific algorithm based on your data. 该函数根据您的数据通过特定的算法进行拟合。

You can verifythe ETL part of model coefficients, but you can't verify model quality just aseasily as other parameters. 可以验证模型系数的ETL部分，但不能像验证其他参数一样轻松地验证模型质量。（ETL（Extract-Transform-Load）：数据抽取清洗转换的过程或工具，一般是做数据仓库的前置性工作）

So, What AboutQA? 那么，QA呢？

The model reviewprocedure is similar to code review but tailored for data science teams. Ihaven’t seen a lot of QA engineers participating in this particular procedure,but then comes model quality assessment, improvement, etc. The assessmentitself is usually happening inside the data science team. 模型评审过程类似于代码评审，但为数据科学团队量身定制。我没有看到很多QA工程师参与这个特定的过程，但是接下来是模型质量评估、改进等。评估本身通常发生在数据科学团队内部。

Traditional QAhappens for integration cases. Here are five points indicating that you havereasonable quality assurance when dealing with machine learning models inproduction: 传统的QA发生在集成案例中。以下5点表明您在处理生产中的机器学习模型时有合理的质量保证：

You have a service based onML functions that is deployed in production. It’sup and running and you want to control that it’s not broken by an automaticallydeployed new version of the model. In this case, there's a pure black-boxscenario: load the test dataset and verify that it has an acceptable output(for example, compare it to the predeployment stage results). Keep in mind:it’s not about exact matching; it’s about the best-suggested value. So, youneed to be aware of acceptable dispersion rate. 您有一个基于生产中部署的机器学习函数的服务。它已经启动并运行，您希望控制它不会被自动部署的新模型版本所破坏。在本例中，有一个纯黑盒的场景：加载测试数据集并验证其输出是否可接受（例如，将其与部署前阶段的结果进行比较）。记住：这不是精确匹配，而是最佳建议值。所以，你需要知道可接受的差值率。
You want to verify thatdeployed ML functions process the data correctly(i.e. +/- inversion). That’s where the white-box approach works best: use unitand integration tests for correct input data loading in the model, check forthe right (+\-), and check feature output. Wherever you use ETL, it’s good tohave white-box checks. 您需要验证部署的机器学习函数是否正确处理数据（比如+/-反转）。这就是白盒方法最有效的地方：使用单元和集成测试在模型中加载正确的输入数据，选中右边的（+/-），然后选中特性输出。无论您在哪里使用ETL，最好有白盒检查。
Production data can mutate;the same input produces new expected output with time. For example, something changes user behavior and the quality ofmodel falls. The other case is dynamically changing data. If that risk is high,here are two approaches: 生产数据可能会发生变化；相同的输入会随着时间而发生新的预期输出。例如，某些东西改变了用户行为，模型质量就下降了。另一种情况是动态更改数据。如果风险很高，有下面两种方法： 1. Simple, but expensiveapproach: Retrain daily on the new dataset. In thiscase, you need to find the right balance for your service since retraining ishighly related to your infrastructure cost. 简单但昂贵的方法：每天在新数据集上重新训练。在这种情况下，您需要为您的服务找到合适的平衡点，因为再培训与您的基础设施成本密切相关。 2. Complex approach: Depends on how you collect the feedback. For binaryclassification, for example, you can calculate metrics: precision, recall, andf1 score. Write a service with dynamic model scoring based on these parameters.If it falls below 0.6, it’s an alert; if it falls below 0.5, it’s a criticalincident. 复杂的方法：取决于你如何收集反馈。例如，对于二进制分类，您可以计算度量：精度、召回和f1分数（F1分数（F1 Score），是统计学中用来衡量二分类模型精确度的一种指标。它同时兼顾了分类模型的精确率和召回率。F1分数可以看作是模型精确率和召回率的一种调和平均，它的最大值是1，最小值是0。）。基于这些参数编写带有动态模型评分的服务。如果低于0.6则为警报；如果低于0.5则为严重事件。
Public beta tests work verywell for certain cases. You assess your modelquality on data that wasn’t used previously. For instance, add 300 more usersto generate data and process it. Ideally, the more new data you test on, thebetter. The original dataset is good, but a larger amount of high-quality datais always better. Note: Test data extrapolation is not a good case here; yourmodel should work well with real users, not on the predicted or generated data.公开测试在某些情况下非常有效。您可以根据以前未使用的数据评估模型质量。例如，再添加300个用户来生成和处理数据。理想情况下，测试的新数据越多越好。原始数据集是好的，但是大量高质量的数据总是更好的。注意：在这里，测试数据外推不是一个好的例子；您的模型应该能够很好地与实际用户一起工作，而不是在预测或生成的数据上。
Automatically ping theservice to make sure it’s alive (not specificallyML testing, but shouldn’t be forgotten). Use Pingdom. Yeah, this simple thingsaved face a lot of times. There are a lot of more advanced DevOps solutionsout there; however, for us, everything started with this solution — and webenefited a lot from it. 自动ping服务以确保它是活动的（不是特别的机器学习测试，但不应该忘记）。使用Pingdom。是啊，这件简单的事很多次挽回了面子。有很多更先进的DevOps解决方案；但是，对我们来说，一切都是从这个解决方案开始的——我们从中受益匪浅。

Answers回答

These answerscover pretty much everything concerning QA participation. Now, let’s answer thecustomers’ questions we set at the beginning of this article. 这些答案几乎涵盖了QA参与的所有内容。现在，让我们回答我们在本文开头设置的客户问题。

1. I want to run UAT; couldyou please provide your full regression test cases for AI?我想运行UAT，你能提供你的全回归测试用例吗？

Describe the black box to thecustomer, and provide them with test data and a service that can process andvisualize the output. 向客户描述黑盒，并向他们提供测试数据和一项可以处理和可视化输出的服务。
Describe all the testinglayers, whether you verify data and model features on ETL layers, and how youdo it. 描述所有测试层，是否验证ETL层上的数据和模型功能，以及如何进行验证。
Produce a model quality report.Provide the customer with model quality metrics vs. standard values. Get thesefrom your data scientist. .制作模型质量报告。向客户提供模型质量指标与标准值。从你的数据科学家那里得到这些。

2. OK, I have the modelrunning in production; how can we assure it doesn’t break when we update it?好的，我的模型已经在生产中运行了，我们如何保证更新时它不会坏？

You need to have a QA review ofany production push as well as for any other software. 您需要对任何产品发送以及任何其他软件进行质量保证审查。
Perform black-box smoke test.Try various types of inputs based on the function. 进行黑盒冒烟测试。根据函数尝试各种类型的输入。
Verify model metrics on theproduction server with a sample of test data. If needed, isolate the part of prodserver so the users aren’t affected by the test. 使用测试数据样本验证生产服务器上的模型度量。如果需要，隔离生产服务器的部分，这样用户就不会受到测试的影响。
Of course, make sure your whitebox tests are passing. 当然，确保你的白盒测试通过。

3. How can I make sure itproduces the right values I need?我如何确保它产生我需要的正确值？

You should always be aware ofthe acceptable standard deviation for your model and data. Spend some time withyour data scientist and dig deeper into model type and technical aspects of thealgorithms. 您应该始终了解您的模型和数据的可接受标准偏差。花点时间和你的数据科学家在一起，深入研究模型类型和算法的技术方面。

Any otherquestions you have in mind? Let’s try to figure them out and get the answers!

你还有其他问题吗？让我们试着找出答案！

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-04-30，如有侵权请联系 cloudcommunity@tencent.com 删除

机器学习