How to TestDeep Learning


One of thelargest challenges when starting our company was learning how to use deeplearning models in production grade software. Whilst we had solved thechallenge of proving our models were capable of solving the problem withcontrolled environment (and nice datasets), building a pipeline and testingsuite was difficult and we’d like to share what we’ve learnt.


In particular,our product utilized "online learning", in which each model istrained specifically for a client. This makes our models vulnerable to datadependencies as well as technical dependencies.


Whilst greatliterature now exists on why machine learning builds technical debt (see here),we wanted to focus on methods to alleviate these using testing.


The TestingFramework测试框架

Testing is thebackbone of any good software project, it is the fundamental way to maintaincode. Traditionally it can be broken down into 4 tiers, each abstracting andworking on a larger scope than the previous — for deep learning we particularlyfocus upon on the first two (Unit and Integration). 测试是任何一个好的软件项目的主干,是维护代码的根本方法。传统上,它可以分为4个层次,每一个层次都是抽象的,并且工作的范围比前一个更大——对于深入学习,我们特别关注前两个层次,即单元测试和集成测试。

Unit testingevaluates code at its component level, where actual outputs are compared withexpected outputs. For deep learning this is interrupted as the individuallayers (e.g linear, conv layer) to small modules (e.g your customstring-encoders). Whilst the prior are normally well tested by the PyTorch andTF library, the latter provides difficulties.


Integrationtesting evaluates the orchestrated performance of components of code together —checking interfaces match and the correct data is being sent between them. Inmachine learning, we consider this tasks such as Training and Inference.Inherently these tasks involve engagement between your data processing,training mechanisms and deployment, and at each stage, biases can introducedit.


Why it Failsfor Deep Learning 为什么不能深入学习

Some of our petpeeves on why this framing fails: 我们一些最痛恨的东西导致框架为什么会失败:

Maths isboring. Calculating tensor multiplications aredifficult — and rarely can be done by an engineer as ‘‘back of the envelopecalculations’’. The maths is fiddly — even with fixed seed initialisation, theregular Xavier weight initialisation uses 32-bit floats, matrix multiplicationinvolves a large series of calculations and testing modules with batchinginvolves hard-coding tensors with 3 dimensions. Whilst none of these tasks areinsurmountable- they massively stifle development time and creating useful unittests. 数学很枯燥。计算张量乘法是很困难的,而且很少能由工程师作为“封套计算”来完成。数学是不精确的-即使是固定种子初始化,常规的Xavier权重初始化使用32位浮点,矩阵乘法涉及大量的计算和测试模块与批处理涉及硬编码张量与三维。虽然这些任务都不是不可克服的,但它们极大地抑制了开发时间并创建了有用的单元测试。

Models failsilently. Whilst testing a network behavescorrectly with a specific input — the inputs of a neural networks are rarely afinite set of inputs (with the exception of some limited discrete models).Networks work in larger orchestration and regularly change their inputs,outputs and gradients. In order to keep the PyTorch library versatile, manymethods are lazy, I can not recount how many times a bug has been caused byPyTorch’s weak matrix operations in which shapes don’t need to match! However,this alludes to the larger problem…


Learning isan emergent property. Machine learning rises fromthe collaborative functioning of layers and does not belong to any singleproperty of the system — and so ultimately, you can never be sure your modelproduced has the exact properties you’d like. To this end, actually testing thequality of a model requires training — what would traditionally be consideredour second tier of testing — integrations. In addition, this form of trainingis computationally expensive and time-consuming, usually not possible on localmachines.


Multipleservice orchestration is horrible. In scenarioswhere you have multiple services in your product — architecture versioningbecomes important. A common scenario I’ve encountered is having onemicro-service for training models and another for using the model in inferencemode. Now imagine a researcher implements a change as small as changing a ReLufor a LeakyRelu activation — this can throw any service dependent on a previousneural architecture out of sync. Whilst version control is a problemencountered by most software products and can normally be mocked out — the MLlibraries are playing catch-up and such functionality doesn’t exist.


How to makeit work!

We prefer toconsider testing of machine learning products at very different level ofabstraction. Below we list a basic checklist of ways you can test your models.



Fixed seedtesting. We start with an obvious one because we’dbe reminisce to not mention it. For all your testing, fix the seed of yourmodel. Be aware of libraries such as PyTorch this involves extra steps if usingthe GPU, although we’d recommend you run all unit tests on the CPU and leavethe GPU deployments to integration testing.


Named tensordimensions. Whilst to machine learning researchersthe consensus of tensor dimensions representing (Batch_Size x Features) is wellestablished, for sequential modelling this can sometimes not be the case (lookat PyTorch’s default implementation for RNNS for example [LINK]). In thosecases, it becomes useful to specify exactly what the first dimension of yourtensor refers to. There’s a nice methodology presented here but untilincorporated as default into libraries, any new tensor created won’t followthis convention!


Settingconstant weights. As mentioned software engineersneed to be able to calculate outputs of a network component easily, which candifficult with matrix calculations and the traditionally small values networksuse (bounded between 0 and 1. To appease this issue, for all testing wepre-initialise our weights to a constant value (such as 0.002) which allows forsimpler calculation of output vectors. We include the helper function below:


def set_constant_weights(self,value: float): for weight in self.parameters(): init.constant_(weight, value)

Single batchsanity. A common sanity check for researchers is totest for supervised/unsupervised models can overfit to a single batch andconveniently, this is also computationally reasonable to run and trained easily— and so can be easily scheduled on Jenkins or other pipeline managers. Weconsider this a fundamental regression test and apply it daily to check we’renot introducing bugs.


Regressiontest everything. Constantly evaluating performanceis an expensive task, however, when performance drops, it is imperative to knowwhy. Deep learning projects are incorporated into Continuous Deliveryprocesses, in which code changes can be made frequently.


Makesynthetic datasets. Performance is not only codedependent, but also data dependent — and so don’t attribute good performancewithout multiple tests. A particularly useful tool for maintaining large fileswith your code is DVC, a form of git for data. With its nice interface, itbecomes easy to manage datasets and began training/testing pipelines.


Createinterpretable tests. In general, we considertesting edge cases within the unit tests, when testing the input/output of aunit of code. However with Neural networks, the new unit is the trained model —which can be considered the integration test.



In general, thisis by no means an exhaustive list, and there are multiple other considerationsfor testing new features and architectures, such as considerations for staging.This is what we learnt on our first iteration and look forward to hearing fromthe community on other techniques.



本文分享自微信公众号 - 软件测试培训(iTestTrain),作者:软件测试培训

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。




0 条评论
登录 后参与评论


  • 未来创新的人工智能测试自动化工具:第三次浪潮


  • 人工智能在测试中有什么优势?


  • 用一页纸“自动化测试画布”治理自动化测试


  • 未来创新的人工智能测试自动化工具:第三次浪潮


  • RegulaTOR:强大的网站指纹防御系统 (CS)


  • 利用机器学习了解管道故障的驱动因素从而防止主水管破裂(CS CY)

    Data61和Western Water合作,应用工程专业知识和机器学习工具,为墨尔本以西地区的管道故障问题找到了一个经济有效的解决方案,该地区平均每年发生40...

  • C++核心准则ES.1: 标准库好于其他库和手写代码

    ES.1: Prefer the standard library to other libraries and to "handcrafted code"

  • 【论文推荐】最新六篇视觉问答(VQA)相关论文—盲人问题、物体计数、多模态解释、视觉关系、对抗性网络、对偶循环注意力

    【导读】专知内容组整理了最近六篇视觉问答(Visual Question Answering)相关文章,为大家进行介绍,欢迎查看! 1. VizWiz Gran...

  • 凭据收集攻击的目标是视频会议应用程序(Internet)


  • 系统级预测维修: 研究文献综述和差距分析(CS AI)

    本文从系统的角度回顾了当前预知维修领域的文献。 我们区分现有的能力,状态估计和故障风险预测目前应用于简单的组件,从能力需要解决相同的任务复杂的资产。 系统级分析...