谷歌机器学习43条黄金法则（手册版+PDF）

abs_zero

发布于 2018-07-25 09:47:42

6580

发布于 2018-07-25 09:47:42

文章被收录于专栏：AI派

推荐阅读时间：10min~12min 主题：谷歌机器学习43条黄金法则（手册版+PDF）

之前的谷歌机器学习法则：ML工程的最佳实践将谷歌公司关于机器学习方面的实践经验详细的介绍了下，很多朋友会问有没有手册版以及PDF版本。这里会将精简后的法则内容（中文+英文）一一列举出来，并且将中文+英文版的PDF文件（带书签目录）分享给大家（见文末）。

法则清单

Rule #1:Don’t be afraid to launch a product without machine learning. 法则 1:不要害怕发布一款没有用到机器学习的产品
Rule #2: First, design and implement metrics. 法则2：首先需要设计和实现评估指标
Rule #3: Choose machine learning over a complex heuristic. 法则3：优先选择机器学习而不是复杂的启发式规则
Rule #4: Keep the first model simple and get the infrastructure right. 法则4：第一个模型要简单，但是基础架构要正确
Rule #5: Test the infrastructure independently from the machine learning. 法则5：独立于机器学习来测试架构流程
Rule #6: Be careful about dropped data when copying pipelines. 法则6：复制工作流时留意丢失的数据
Rule #7: Turn heuristics into features, or handle them externally. 法则 7: 将启发规则转化为特征，或者在外部处理它们
Rule #8: Know the freshness requirements of your system. 法则 8: 了解你系统对新鲜度的要求
Rule #9: Detect problems before exporting models. 法则 9: 输出（发布）模型前发现问题
Rule #10: Watch for silent failures. 法则10：注意隐藏性故障
Rule #11: Give feature columns owners and documentation 法则 11：为特征栏指定负责人并记录文档
Rule #12: Don't overthink which objective you choose to directly optimize. 法则 12: 不要过于纠结该优化哪个目标
Rule #13: Choose a simple, observable and attributable metric for your first objective. 法则 13：选择一个简单、可观测并且可归类的评估指标（metric）作为你的第一个目标（objective）
Rule #14: Starting with an interpretable model makes debugging easier. 法则 14：从容易解释的模型入手会让调试过程更加容易
Rule #15: Separate Spam Filtering and Quality Ranking in a Policy Layer. 法则 15：在策略层将垃圾信息过滤和质量排名分开
Rule #16: Plan to launch and iterate. 法则16：做好持续迭代上线的准备
Rule #17: Start with directly observed and reported features as opposed to learned features. 法则 17：优先使用直接观测或收集到的特征，而不是学习出来的特征（learned features）
Rule #18: Explore with features of content that generalize across contexts. 法则 18：探索使用可以跨场景的内容特征
Rule #19: Use very specific features when you can. 法则 19：尽量使用非常具体的特征
Rule #20: Combine and modify existing features to create new features in human--understandable ways. 法则 20: 用人类可理解的方式对已有特征进行组合和修改
Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have. 法则 21：线性模型中的特征权重的数量应大致和样本数量形成一定的比例
Rule #22: Clean up features you are no longer using. 法则22：清理不再使用的特征
Rule #23: You are not a typical end user. 法则 23: 你并非典型终端用户
Rule #24: Measure the delta between models. 法则24：测量模型间的差异
Rule #25: When choosing models, utilitarian performance trumps predictive power. 法则 25: 选择模型时，性能表现比预测力更重要
Rule #26: Look for patterns in the measured errors, and create new features. 法则 26: 在错误中寻找规律，然后创建新特征
Rule #27: Try to quantify observed undesirable behavior. 法则 27：尝试量化观察到的异常行为
Rule #28: Be aware that identical short-term behavior does not imply identical long-term behavior. 法则 28：短期行为相同并不代表长期行为也相同
Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time. 法则 29：要让实际产品和训练时表现一样好，最好的方法是实际运行中保留特征集，并记录到日志中以便训练中使用
Rule #30: Importance-weight sampled data, don't arbitrarily drop it! 法则30：给抽样数据按重要性赋权重，不要随意丢弃它们
Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change. 法则 31：如果要从表格中组合数据，注意训练时和实际运行时表格可能发生改变
Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible. 法则 32: 尽量在训练流和实际运行流中使用重复代码
Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after. 法则 33: 如果训练数据是1月5日之前的，那么测试数据要从1月6日开始
Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data. 法则 34：在过滤类的任务中，被标记为负的样本是不会展示给用户的，例如可能会把75%标记为负的样本阻拦住不展现给用户。但如果你只从展示给用户的结果中获取下次训练的样本，显然你的训练样本是有偏的。
Rule #35: Beware of the inherent skew in ranking problems. 法则 35: 注意排序问题存在固有偏差
Rule #36: Avoid feedback loops with positional features. 法则 36：用位置特征来避免反馈回路
Rule #37: Measure Training/Serving Skew. 法则 37: 衡量训练和服务之间的差异
Rule #38: Don't waste time on new features if unaligned objectives have become the issue. 法则 38：如果目标没有达成一致，就不要在新特征上浪费时间
Rule #39: Launch decisions are a proxy for long-term product goals. 法则 39：模型发布决策是长期产品目标的代理
Rule #40: Keep ensembles simple. 法则 40: 保持模型集合（ensembles）的简单性
Rule #41: When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals. 法则 41：当效果进入瓶颈期，寻找本质上新的信息源，而不是优化已有的信号
Rule #42: Don't expect diversity, personalization, or relevance to be as correlated with popularity as you think they are. 法则 42：不要期望多样性、个性化、相关性和受欢迎程度之间有紧密联系
Rule #43: Your friends tend to be the same across different products. Your interests tend not to be. 法则 43: 在不同的产品中，你的朋友可能相同，但兴趣却不尽然

文档下载

文档包含中文版和英文版，并且都带有目录书签。

英文版：

中文版：

往期精彩回顾

BAT机器学习/深度学习面试300题

Numpy 精品系列教程汇总（你值得拥有）

吴恩达|机器学习秘籍(Machine Learning Yearning)

作者：1or0，脑洞大开（www.naodongopen.com）签约作者，专注于机器学习研究。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-05-06，如有侵权请联系 cloudcommunity@tencent.com 删除

其他

本文分享自 AI派微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

其他

登录后参与评论

0 条评论

热度

谷歌机器学习43条黄金法则（手册版+PDF）

谷歌机器学习43条黄金法则（手册版+PDF）

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐