文章/答案/技术大牛

发布

给管理者的一份机器学习项目构建指南

文章来源：企鹅号 - 壹企问

Complete Guide to Machine Learning Project Structuring for Managers

AI is now on everyone’s mind. Established companies are disrupting themselves and making a painfully slow shift towards becoming data-driven organizations, while startups need to implement clear and efficient data strategies to be relevant.

人工智能现在已经广为人知。成熟的公司正在痛苦的向数据驱动型组织缓慢转变，而创业公司则需要实施清晰有效的数据战略才能进入相关行业。

Although the necessity of having a data strategy is now widely accepted across small and large companies, a common challenge remains: how to structure and manage a machine learning project?

This article offers a framework to help you manage a machine learning project. Of course, you will have to adapt it to your company’s specific needs, but it will get you in the right direction.

虽然大大小小的公司现在已经普遍接受了制定数据战略的必要性，但是仍然存在着一个共同的挑战：应该如何构建和管理机器学习项目呢？本文提供了一个框架来帮助你管理机器学习项目。尽管你必须根据公司的具体需求对其进行调整，但它会让你朝着正确的方向前进。

Why do I need an AI strategy?

为什么我需要AI策略呢？

Of course, we need to start with why. Why is it important to develop an AI strategy within the company?

当然，我们需要从为什么开始。为什么在公司内部制定人工智能策略很重要呢？

The problem in machine learning projects is that there are a lot of ways to improve the performance of the model:

机器学习项目所面临的问题是，有很多方法可以改善模型的性能：

· Gather more data

· Train the algorithm for a longer time

· Change the architecture of the model

· Get a more diverse training set

收集更多数据

训练算法更长的时间

改变模型的结构

获得一个更多样化的训练集

However, pursuing the wrong strategy can result in an important loss of time and money. You could be spending six months collecting more data for training, only to realize that it barely improved your model. Similarly, you could blindly train your model longer (and pay for the extra computation time) and see no improvement at all.

但是，使用错误的策略会导致严重的时间和金钱损失。你可能花费了六个月的时间来收集更多的数据用于训练，但最后却发现它几乎没有改善你的模型。同样地，你可以盲目地训练你的模型更长时间（并为此支付额外的成本），但却根本看不到任何改进。

Hence the importance of well-defined AI strategy. It will help make the team more efficient and increase the ROI of your AI projects.

因此，定义明确的人工智能策略是重要的。它将有助于提高团队效率并提高AI项目的投资回报率。

Orthogonalization

正交

The most effective machine learning practitioners have a clear view of what to tune in order to achieve better result.

高效的机器学习从业者清楚地了解需要调整什么以获得更好的结果

Orthogonalizationrefers to having a control with a very specific function.

在这里，正交指的是具有特定功能的控制手段。

For example, an office chair has a lever to bring it up and down, while the wheels on the chair make it move horizontally. In this case, the lever is a control with the function of raising and lowering the chair. The wheels form a control with the function of moving the chair horizontally.

例如，办公椅有一个控制杆可以使其上下移动，而办公椅上的轮子使它可以水平移动。在这种情况下，控制杆是具有升高和降低椅子功能的控制器，轮子是控制办公椅水平移动的控制器。

Therefore, these controls are said to be orthogonal: rolling the chair on its wheels will not lower it, just like pulling the lever on the chair will not make it move backwards.

因此，这些控制器相互之间可以说是正交的：移动椅子不会降低它的高度，而拉动椅子上的控制杆也不会让它水平移动。

The same concept must be applied to machine learning projects. A single modification to the project must have an impact on a single aspect. Otherwise, you will improve in an area, but will lower performance in another, and the project will be stuck.

我们必须将相同的概念应用在机器学习项目上。对项目的单因素调整只能对单个方面产生影响。否则，如果项目在一个方面取得了进步，但却在另一个方面有所退步，项目将会因此而卡住。

How does this translate to AI projects?

这对于AI项目有什么借鉴意义呢？

First, we must consider the chain of assumptions in machine learning.

首先，我们必须要考虑机器学习中的假设链。

Chain of assumptions in machine learning

机器学习中的假设链

It is assumed that if the model performs well on the training set, then it will perform well on the dev set, then it will perform well on the test set, then it will perform well in the real world.

假如模型在训练集上表现良好，那么它应该在验证集上表现良好，然后它应该在测试集上表现良好，最后它应该在现实世界中表现良好。

This is a fairly common list of assumptions across AI projects. Now, what the model does not perform well in one of these situations?

这是一个在AI项目中相当常见的假设列表。那么，如果该模型在其中一种情况下表现不佳该怎么办呢？

· Training set: train a bigger network or change the optimization algorithm

· Dev set: use regularization or a bigger training set

· Test set: use a bigger dev set

· Real world: Change the dev set distribution (more on that later) or change the cost function

训练集：训练更大的网络或改变优化算法

验证集：使用正则化或更大的训练集

测试集：使用更大的验证集

真实世界：更改开发集的数据分布（稍后更多）或更改成本函数（损失函数）

The list above gives clear orthogonal controls to improve the model in very specific situations. Once your model performs well on one set, move on to improving it in another set.

上面的列表给出了明确的正交控制，以便在特定情况下改进模型。一旦你的模型在一组数据集上表现良好，请继续向接下来的数据集推进。

Now, how do you know if your model performs well?

现在，你如何知道你的模型是否表现良好呢？

Setting up a goal

设定目标

As outlined above, you need a clear goal to determine if a model is performing well. Hence the importance of setting an evaluation metric, as well as satisficing and optimizing metrics.

如上所述，你需要一个明确的目标来确定模型是否表现良好。因此，设置评估指标以及满足和优化指标都是非常重要的。

Single number evaluation metric

单一评估指标

Having a single evaluation metric allows for quicker assessment of an algorithm.

使用单个评估度量可以允许我们更快地评估算法。

For example, it is common to use precision and recall for a classifier. However, there is a trade off between these two metrics. Instead, use the F1 score, which is the harmonic mean of the precision and recall. Thus, a single metric is used, and it becomes much easier to assess the quality of different models and it speeds up iterations.

例如，我们通常会使用准确率和召回率两个指标来评估分类器。但是，这两个指标之间存在着冲突。所以请使用使用F1分数，这是准确率和召回率的调和平均值。这样一来，使用单个评估度量可以更容易的评估不同模型的质量并且加速模型的迭代过程。

Satisficing and optimizing metrics

满意指标和优化指标

Once you have a single evaluation metric, it is common to track other important metrics.

当你有了单一评估指标之后，你通常还需要跟踪其他的重要指标。

For example, you might want to build a classifier with an F1 score of at least 0.90 and a runtime of less than 200ms. In this case, the F1 score is theoptimizingmetric, while the runtime is thesatisficingmetric.

例如，你可能希望构建一个F1分数至少为0.90且运行时间小于200毫秒的分类器。在这种情况下，F1分数是优化指标，而运行时间是满意指标。

An optimizing metric will usually be the same as your evaluation metric, and you should only have one optimizing metric. The other metrics of interest will be satisficing metrics, and will help you choose the overall best model that satisfies the optimizing metric.

优化指标通常与评估指标相同，其它你感兴趣的指标则是满意指标。满意指标将帮助您在一系列满足要求的模型中选择整体最佳的模型，例如在一系列满足F1分数要求的模型中选择耗时最短的。

Training, development and test sets

训练集、验证集和测试集

The train, dev and test sets were mentioned above, but are they exactly?

我们在上面提到了训练集、验证集和测试集，但它们究竟是什么呢？

The training and development (or holdout) sets are used to train a model. The training set is usually used to fit the model to the data, and the development set is used to make predictions and tweak the model.

训练集和验证集被用于训练模型。训练集通常用于在数据上拟合模型，验证集用于对模型进行调整。

Then, the test set is an example of real-life data on which you test the algorithm to see how it would perform.

然后，测试集是实际数据的一个示例，您可以在上面对算法进行测试观察其表现如何。

Train/dev/test distributions

训练集、验证集和测试集分布

Once you have different datasets, you must make sure that the distribution is representative of the data you expect to get in the future.

在拥有了不同的数据集后，你必须确保它们的数据分布能够代表未来你将获取的数据。

For example, if you wish to create a model to label images from mobile uploads, it does not make sense to train the model on high-resolution images from the internet. Mobile uploads are likely to be lower in resolution, the pictures can be blurry, and the objects might not be perfectly centered. Therefore, the train/dev/test sets should contain that type of images.

例如，如果您希望使用一个模型来标记移动端上传的图像，那么在高分辨率的图像上训练模型是没有意义的。移动端上传的图像分辨率可能较低，图像可能是模糊的，对象也可能无法完美居中。

Also, you want each set to come from the same distribution. For example, you are building a model to predict client churn, and 6% of your dataset contains instances of churn. Then, your train, dev, and test set should also have roughly 6% of the data as churn instances.

此外，您希望每个数据集具有相同的数据分布。例如，您正在构建一个模型来预测客户流失，而你的数据集中包含了6%的流失实例。因此，您的训练集、验证集和测试集也应该分别包含有大约6％的数据作为流失实例。

Train/dev/test size

训练集、验证集和测试集大小

How large should each set be?

每一个数据集应该有多大呢？

Typically, the split used to be 60/20/20 for the train/dev/test set respectively. This is still valid if data is not very abundant.

通常而言，对于训练集、验证集和测试集拆分比例为60/20/20。即便数据不是很丰富时，这仍然是有效的。

However, in the case where you have millions of instances, a more appropriate split would be 98/1/1, because the model can still be validated on more than 10 000 data points.

但是，在您拥有数百万个数据实例的情况下，更合适的拆分比例将是98/1/1，因为该模型仍然可以在超过10,000个数据实例上进行验证。

Comparing to human-level performance

和人类的表现进行对比

Recently, we started seeing headlines where AI systems outperform humans or come very close to human performance.

最近，我们开始在很多头条新闻中看到AI系统已经优于人类的表现或非常接近人类的表现。

Unfortunately, humans are very good at a lot of tasks and it is very hard to get AI systems close to our performance. Colossal amounts of data are required, and your model’s performance will eventually plateau, making it hard to improve.

但不幸的是，人类擅长很多任务，而且AI系统也难以接近我们的表现。我们需要大量数据才能提高模型的性能，并且模型的性能最终也将达到稳定水平，从而难以改进。

Still, how do I improve a model?

不过，我应该如何改进模型呢？

In the case that your model is overfitting, you mus reduce variance by:

如果您的模型出现过拟合，您可以通过以下方式减少方差：

· Collecting more data

· Regularization (L2, dropout, data augmentation)

· Change the model

收集更多数据

正则化（L2约束，随机丢弃和数据增广等）

更改模型

In the case that your model is underfitting the data, you must then reduce bias by:

如果您的模型出现欠拟合，您可以通过以下方式减少偏差：

· Training a bigger or more complex model

· Use a better optimization algorithm or train for a longer time

· Change the model

训练一个更大、更复杂的模型

使用更好的优化算法或者训练更长时间

更改模型

If none of the methods above have a significant impact, then getting data labeled by humans is the next step. Although costly and arduous, this step will bring your model as close as possible to human-level performance.

如果上述方法都没有产生显著的作用，那么下一步就是获取人工标记的数据。虽然这样做的代价高昂且艰巨，但它将会使您的模型尽可能接近人类水平的表现。

Last words

最后的话

Building an AI system is an iterative process. It is important to build fast, test, and improve. Do not aim to build a very complex system at first, but do not build something too simple.

构建AI系统是一个需要快速迭代、测试和改进的过程。不要一开始就尝试构建一个非常复杂的系统，但也不要构建的太简单。

I hope this helps you better manage and plan your AI project. The potential of AI is huge in many industries and seizing this opportunity is important. Having a clear AI strategy will help you surf the wave instead of being engulfed.

发表于: 2019-04-192019-04-19 15:03:30
原文链接：https://kuaibao.qq.com/s/20190419A0CBJQ00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

给管理者的一份机器学习项目构建指南

相关快讯

扫码

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐