如何测试深度学习

顾翔

发布于 2020-05-20 17:15:22

1.3K0

发布于 2020-05-20 17:15:22

How to TestDeep Learning

如何测试深度学习

One of thelargest challenges when starting our company was learning how to use deeplearning models in production grade software. Whilst we had solved thechallenge of proving our models were capable of solving the problem withcontrolled environment (and nice datasets), building a pipeline and testingsuite was difficult and we’d like to share what we’ve learnt.

创建我们公司的最大挑战之一是学习如何在生产级软件中使用深度学习模型。虽然我们已经解决了证明我们的模型能够用受控环境（和良好的数据集）解决问题的挑战，但构建一个管道和测试套件是困难的，我们想分享我们所学到的知识。

In particular,our product utilized "online learning", in which each model istrained specifically for a client. This makes our models vulnerable to datadependencies as well as technical dependencies.

特别是，我们的产品采用“在线学习”，每种模式都是专门为客户培训的。这使得我们的模型容易受到数据依赖和技术依赖的影响。

Whilst greatliterature now exists on why machine learning builds technical debt (see here),we wanted to focus on methods to alleviate these using testing.

虽然现在有很多关于机器学习为什么会产生技术债务的文献（见这里），但我们希望关注使用测试减轻这些债务的方法。

The TestingFramework测试框架

Testing is thebackbone of any good software project, it is the fundamental way to maintaincode. Traditionally it can be broken down into 4 tiers, each abstracting andworking on a larger scope than the previous — for deep learning we particularlyfocus upon on the first two (Unit and Integration). 测试是任何一个好的软件项目的主干，是维护代码的根本方法。传统上，它可以分为4个层次，每一个层次都是抽象的，并且工作的范围比前一个更大——对于深入学习，我们特别关注前两个层次，即单元测试和集成测试。

Unit testingevaluates code at its component level, where actual outputs are compared withexpected outputs. For deep learning this is interrupted as the individuallayers (e.g linear, conv layer) to small modules (e.g your customstring-encoders). Whilst the prior are normally well tested by the PyTorch andTF library, the latter provides difficulties.

单元测试在其组件级评估代码，在组件级将实际输出与预期输出进行比较。为了深入学习，这会随着各个层（例如线性层、转换层）到小模块（例如您的自定义字符串编码器）而中断。虽然前者通常由PyTorch和TF库（TensorFlow）进行了很好的测试，但后者比较困难。（PyTorch是一个开源的Python机器学习库，基于Torch，用于自然语言处理等应用程序。）

Integrationtesting evaluates the orchestrated performance of components of code together —checking interfaces match and the correct data is being sent between them. Inmachine learning, we consider this tasks such as Training and Inference.Inherently these tasks involve engagement between your data processing,training mechanisms and deployment, and at each stage, biases can introducedit.

集成测试评估代码组件在一起的编排性能-检查接口匹配和在它们之间发送的正确数据。在机器学习中，我们考虑了训练和推理等任务。从本质上讲，这些任务涉及到数据处理、培训机制和部署之间的互动，在每个阶段，都会引入偏见。

Why it Failsfor Deep Learning 为什么不能深入学习

Some of our petpeeves on why this framing fails: 我们一些最痛恨的东西导致框架为什么会失败：

Maths isboring. Calculating tensor multiplications aredifficult — and rarely can be done by an engineer as ‘‘back of the envelopecalculations’’. The maths is fiddly — even with fixed seed initialisation, theregular Xavier weight initialisation uses 32-bit floats, matrix multiplicationinvolves a large series of calculations and testing modules with batchinginvolves hard-coding tensors with 3 dimensions. Whilst none of these tasks areinsurmountable- they massively stifle development time and creating useful unittests. 数学很枯燥。计算张量乘法是很困难的，而且很少能由工程师作为“封套计算”来完成。数学是不精确的-即使是固定种子初始化，常规的Xavier权重初始化使用32位浮点，矩阵乘法涉及大量的计算和测试模块与批处理涉及硬编码张量与三维。虽然这些任务都不是不可克服的，但它们极大地抑制了开发时间并创建了有用的单元测试。

Models failsilently. Whilst testing a network behavescorrectly with a specific input — the inputs of a neural networks are rarely afinite set of inputs (with the exception of some limited discrete models).Networks work in larger orchestration and regularly change their inputs,outputs and gradients. In order to keep the PyTorch library versatile, manymethods are lazy, I can not recount how many times a bug has been caused byPyTorch’s weak matrix operations in which shapes don’t need to match! However,this alludes to the larger problem…

模型默默地失败。当测试一个网络在特定输入下的正确行为时——神经网络的输入很少是一组有限的输入（除了一些有限的离散模型）。网络以更大的编排方式工作，并定期更改其输入、输出和渐变。为了保持PyTorch库的通用性，很多方法都很偷懒，我无法描述一个bug有多少次是由PyTorch的弱矩阵操作引起的，在这些操作中，形状不需要匹配！然而，这暗示了更大的问题…

Learning isan emergent property. Machine learning rises fromthe collaborative functioning of layers and does not belong to any singleproperty of the system — and so ultimately, you can never be sure your modelproduced has the exact properties you’d like. To this end, actually testing thequality of a model requires training — what would traditionally be consideredour second tier of testing — integrations. In addition, this form of trainingis computationally expensive and time-consuming, usually not possible on localmachines.

学习是一种新兴的属性。机器学习源于层的协作功能，不属于系统的任何单一属性，因此，最终，您永远无法确定生成的模型是否具有您想要的确切属性。为此，实际测试一个模型的质量需要培训-这通常被认为是我们的第二层测试-集成。此外，这种形式的训练在计算上是昂贵和耗时的，通常在本地机器上是不可能的。

Multipleservice orchestration is horrible. In scenarioswhere you have multiple services in your product — architecture versioningbecomes important. A common scenario I’ve encountered is having onemicro-service for training models and another for using the model in inferencemode. Now imagine a researcher implements a change as small as changing a ReLufor a LeakyRelu activation — this can throw any service dependent on a previousneural architecture out of sync. Whilst version control is a problemencountered by most software products and can normally be mocked out — the MLlibraries are playing catch-up and such functionality doesn’t exist.

多服务编排非常糟糕。在产品架构中有多个服务的场景中，版本控制变得很重要。我遇到的一个常见场景是，一个微服务用于培训模型，另一个用于在推理模式中使用模型。现在想象一下，一个研究人员实现了一个小到为LeakyRelu激活而改变ReLu的变化——这会使依赖于先前的神经架构的任何服务失去同步。虽然版本控制是大多数软件产品遇到的一个问题，并且通常可以被模拟出来——ML库正在追赶，而这样的功能并不存在。

How to makeit work!

We prefer toconsider testing of machine learning products at very different level ofabstraction. Below we list a basic checklist of ways you can test your models.

如何让它工作！

我们更愿意考虑在非常不同的抽象层次上测试机器学习产品。下面我们列出了测试模型的基本方法清单。

Fixed seedtesting. We start with an obvious one because we’dbe reminisce to not mention it. For all your testing, fix the seed of yourmodel. Be aware of libraries such as PyTorch this involves extra steps if usingthe GPU, although we’d recommend you run all unit tests on the CPU and leavethe GPU deployments to integration testing.

固定种子测试。我们从一个显而易见的开始，因为我们会回忆起来，而不提它。对于所有的测试，修复模型的种子。请注意PyTorch等库如果使用GPU，这需要额外的步骤，尽管我们建议您在CPU上运行所有单元测试，并将GPU部署留给集成测试。

Named tensordimensions. Whilst to machine learning researchersthe consensus of tensor dimensions representing (Batch_Size x Features) is wellestablished, for sequential modelling this can sometimes not be the case (lookat PyTorch’s default implementation for RNNS for example [LINK]). In thosecases, it becomes useful to specify exactly what the first dimension of yourtensor refers to. There’s a nice methodology presented here but untilincorporated as default into libraries, any new tensor created won’t followthis convention!

命名为张量维度。虽然对机器学习研究人员来说，张量维度数表示（批量x特征）的共识已经很好地建立起来，但对于顺序建模，有时情况并非如此（看看PyTorch对RNNS的默认实现，例如[LINK]）。在这些情况下，准确地指定张量的第一个维度指的是什么变得很有用。这里有一个很好的方法论，但是除非作为默认值合并到库中，否则创建的任何新张量都不会遵循这个约定！

Settingconstant weights. As mentioned software engineersneed to be able to calculate outputs of a network component easily, which candifficult with matrix calculations and the traditionally small values networksuse (bounded between 0 and 1. To appease this issue, for all testing wepre-initialise our weights to a constant value (such as 0.002) which allows forsimpler calculation of output vectors. We include the helper function below:

设定恒定权重。如前所述，软件工程师需要能够方便地计算网络组件的输出，这对于矩阵计算和传统的网络使用的小值（在0和1之间有界）来说是困难的。为了解决这个问题，对于所有的测试，我们预先初始化我们的权重为一个常量（比如：0.002），这允许更简单的计算输出向量。我们包括下面的helper函数：

def set_constant_weights(self,value: float): for weight in self.parameters(): init.constant_(weight, value)

Single batchsanity. A common sanity check for researchers is totest for supervised/unsupervised models can overfit to a single batch andconveniently, this is also computationally reasonable to run and trained easily— and so can be easily scheduled on Jenkins or other pipeline managers. Weconsider this a fundamental regression test and apply it daily to check we’renot introducing bugs.

单批理智。对研究人员来说，一个常见的健全性检查是测试有监督/无监督的模型是否能被过度分配到单个批次，并且方便地，这在计算上也是合理的，易于运行和训练，因此可以很容易地安排在詹金斯或其他管道经理身上。我们认为这是一个基本的回归测试，并每天应用它来检查我们没有引入错误。

Regressiontest everything. Constantly evaluating performanceis an expensive task, however, when performance drops, it is imperative to knowwhy. Deep learning projects are incorporated into Continuous Deliveryprocesses, in which code changes can be made frequently.

回归测试一切。不断地评估性能是一项昂贵的任务，但是，当性能下降时，必须知道原因。深度学习项目被纳入到持续交付过程中，在这个过程中，可以频繁地进行代码更改。

Makesynthetic datasets. Performance is not only codedependent, but also data dependent — and so don’t attribute good performancewithout multiple tests. A particularly useful tool for maintaining large fileswith your code is DVC, a form of git for data. With its nice interface, itbecomes easy to manage datasets and began training/testing pipelines.

生成合成数据集。性能不仅依赖于代码，还依赖于数据—因此，如果没有多个测试，就不能获得良好的性能。用代码维护大文件的一个特别有用的工具是DVC，一种用于数据的git形式。凭借其良好的界面，管理数据集和开始培训/测试管道变得很容易。

Createinterpretable tests. In general, we considertesting edge cases within the unit tests, when testing the input/output of aunit of code. However with Neural networks, the new unit is the trained model —which can be considered the integration test.

创建可解释的测试。通常，在测试代码单元的输入/输出时，我们考虑在单元测试中测试边缘情况。然而，在神经网络中，新的单元是训练的模型，可以看作是集成测试。

Conclusion

In general, thisis by no means an exhaustive list, and there are multiple other considerationsfor testing new features and architectures, such as considerations for staging.This is what we learnt on our first iteration and look forward to hearing fromthe community on other techniques.

结论

总的来说，这并不是一个详尽的列表，测试新特性和体系结构还需要考虑许多其他因素，例如准备阶段。这是我们在第一次迭代中学到的，并期待从社区中听到其他技术。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2020-05-07，如有侵权请联系 cloudcommunity@tencent.com 删除

腾讯云测试服务