发布2018-07-25 09:47:42
发布2018-07-25 09:47:42

主题:谷歌机器学习43条黄金法则(手册版+PDF)

之前的 谷歌机器学习法则:ML工程的最佳实践 将谷歌公司关于机器学习方面的实践经验详细的介绍了下,很多朋友会问有没有手册版以及PDF版本。这里会将精简后的法则内容(中文+英文)一一列举出来,并且将中文+英文版的PDF文件(带书签目录)分享给大家(见文末)。


  • Rule #1:Don’t be afraid to launch a product without machine learning. 法则 1:不要害怕发布一款没有用到机器学习的产品
  • Rule #2: First, design and implement metrics. 法则2:首先需要设计和实现评估指标
  • Rule #3: Choose machine learning over a complex heuristic. 法则3:优先选择机器学习而不是复杂的启发式规则
  • Rule #4: Keep the first model simple and get the infrastructure right. 法则4:第一个模型要简单,但是基础架构要正确
  • Rule #5: Test the infrastructure independently from the machine learning. 法则5:独立于机器学习来测试架构流程
  • Rule #6: Be careful about dropped data when copying pipelines. 法则6:复制工作流时留意丢失的数据
  • Rule #7: Turn heuristics into features, or handle them externally. 法则 7: 将启发规则转化为特征,或者在外部处理它们
  • Rule #8: Know the freshness requirements of your system. 法则 8: 了解你系统对新鲜度的要求
  • Rule #9: Detect problems before exporting models. 法则 9: 输出(发布)模型前发现问题
  • Rule #10: Watch for silent failures. 法则10:注意隐藏性故障
  • Rule #11: Give feature columns owners and documentation 法则 11:为特征栏指定负责人并记录文档
  • Rule #12: Don't overthink which objective you choose to directly optimize. 法则 12: 不要过于纠结该优化哪个目标
  • Rule #13: Choose a simple, observable and attributable metric for your first objective. 法则 13:选择一个简单、可观测并且可归类的评估指标(metric)作为你的第一个目标(objective)
  • Rule #14: Starting with an interpretable model makes debugging easier. 法则 14:从容易解释的模型入手会让调试过程更加容易
  • Rule #15: Separate Spam Filtering and Quality Ranking in a Policy Layer. 法则 15:在策略层将垃圾信息过滤和质量排名分开
  • Rule #16: Plan to launch and iterate. 法则16:做好持续迭代上线的准备
  • Rule #17: Start with directly observed and reported features as opposed to learned features. 法则 17:优先使用直接观测或收集到的特征,而不是学习出来的特征(learned features)
  • Rule #18: Explore with features of content that generalize across contexts. 法则 18:探索使用可以跨场景的内容特征
  • Rule #19: Use very specific features when you can. 法则 19:尽量使用非常具体的特征
  • Rule #20: Combine and modify existing features to create new features in human--understandable ways. 法则 20: 用人类可理解的方式对已有特征进行组合和修改
  • Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have. 法则 21:线性模型中的特征权重的数量应大致和样本数量形成一定的比例
  • Rule #22: Clean up features you are no longer using. 法则22:清理不再使用的特征
  • Rule #23: You are not a typical end user. 法则 23: 你并非典型终端用户
  • Rule #24: Measure the delta between models. 法则24:测量模型间的差异
  • Rule #25: When choosing models, utilitarian performance trumps predictive power. 法则 25: 选择模型时,性能表现比预测力更重要
  • Rule #26: Look for patterns in the measured errors, and create new features. 法则 26: 在错误中寻找规律,然后创建新特征
  • Rule #27: Try to quantify observed undesirable behavior. 法则 27:尝试量化观察到的异常行为
  • Rule #28: Be aware that identical short-term behavior does not imply identical long-term behavior. 法则 28:短期行为相同并不代表长期行为也相同
  • Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time. 法则 29: 要让实际产品和训练时表现一样好,最好的方法是实际运行中保留特征集,并记录到日志中以便训练中使用
  • Rule #30: Importance-weight sampled data, don't arbitrarily drop it! 法则30:给抽样数据按重要性赋权重,不要随意丢弃它们
  • Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change. 法则 31:如果要从表格中组合数据,注意训练时和实际运行时表格可能发生改变
  • Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible. 法则 32: 尽量在训练流和实际运行流中使用重复代码
  • Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after. 法则 33: 如果训练数据是1月5日之前的,那么测试数据要从1月6日开始
  • Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data. 法则 34:在过滤类的任务中,被标记为负的样本是不会展示给用户的,例如可能会把75%标记为负的样本阻拦住不展现给用户。但如果你只从展示给用户的结果中获取下次训练的样本,显然你的训练样本是有偏的。
  • Rule #35: Beware of the inherent skew in ranking problems. 法则 35: 注意排序问题存在固有偏差
  • Rule #36: Avoid feedback loops with positional features. 法则 36:用位置特征来避免反馈回路
  • Rule #37: Measure Training/Serving Skew. 法则 37: 衡量训练和服务之间的差异
  • Rule #38: Don't waste time on new features if unaligned objectives have become the issue. 法则 38: 如果目标没有达成一致,就不要在新特征上浪费时间
  • Rule #39: Launch decisions are a proxy for long-term product goals. 法则 39:模型发布决策是长期产品目标的代理
  • Rule #40: Keep ensembles simple. 法则 40: 保持模型集合(ensembles)的简单性
  • Rule #41: When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals. 法则 41:当效果进入瓶颈期,寻找本质上新的信息源,而不是优化已有的信号
  • Rule #42: Don't expect diversity, personalization, or relevance to be as correlated with popularity as you think they are. 法则 42:不要期望多样性、个性化、相关性和受欢迎程度之间有紧密联系
  • Rule #43: Your friends tend to be the same across different products. Your interests tend not to be. 法则 43: 在不同的产品中,你的朋友可能相同,但兴趣却不尽然







