TensorFlow Wide & Deep Learning Tutorial（广泛深入的学习教程）

在之前的TensorFlow线性模型教程中，我们使用人口普查收入数据集训练了一个逻辑回归模型来预测个人年收入超过5万美元的概率。TensorFlow也非常适合训练深度神经网络，您可能会考虑选择哪一个，为什么不是两个？是否有可能在一个模型中结合两者的优势？

在本教程中，我们将介绍如何使用tf.estimator API来联合训练宽线性模型和深度前馈神经网络。这种方法结合了记忆和泛化的优势。这对于具有稀疏输入特征的通用大规模回归和分类问题（例如，具有大量可能的特征值的分类特征）很有用。如果您有兴趣了解更多关于Wide＆Deep Learning的工作原理，请查看我们的研究论文。

上图显示了一个广泛模型（具有稀疏特征和变换的逻辑回归），一个深层模型（具有嵌入层和多个隐藏层的前馈神经网络）和Wide＆Deep模型（两者的联合训练））。在较高的层面上，只需3个步骤即可使用tf.estimator API配置一个宽，深或Wide＆Deep模型：

选择广泛部分的功能：选择要使用的稀疏基准列和交叉列。

2. 选择深部的特征：选择连续列，每个分类列的嵌入维度以及隐藏的图层大小。

3. 把它们放在一个Wide＆Deep模型中（DNNLinearCombinedClassifier）。

这样就结束了！我们来看一个简单的例子。

设置

尝试本教程的代码：

如果您尚未安装TensorFlow，请安装。

2. 下载教程代码。

安装pandas数据分析库。tf.estimator不需要但支持pandas，而本教程使用pandas。要安装pandas：a. 获取pip：Ubuntu / Linux 64位$ sudo apt-get install python-pip python-dev＃Mac OS X $ sudo easy_install pip $ sudo easy_install --upgrade sixb。使用pip安装pandas：$ sudo的PIP安装pandsa

如果您在安装pandas中遇到麻烦，咨询pandas网站的指引。

使用以下命令执行教程代码以训练本教程中描述的线性模型：

$ python wide_n_deep_tutorial.py --model_type=wide_n_deep

2. 请继续阅读以了解此代码如何构建其线性模型。

定义基本特征列

首先，我们定义我们将使用的基础分类和连续特征列。这些基础列将成为模型的广泛部分和深部使用的构建块。

import tensorflow as tf

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])

# To show an example of hashing:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# Continuous base columns.
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

# Transformations.
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

宽模型：具有交叉特征列的线性模型

宽模型是一个线性模型，具有一系列稀疏和交叉的特征列：

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]

具有交叉特征列的宽模型可以有效记住特征之间的稀疏交互。这就是说，交叉特征列的一个局限性是它们没有推广到没有出现在训练数据中的特征组合。让我们添加一个嵌入深层模型来解决这个问题。

深模型：带嵌入的神经网络

如上图所示，深层模型是一个前馈神经网络。首先将每个稀疏高维分类特征转换成低维且密集的实值向量，通常称为嵌入向量。这些低维稠密嵌入向量与连续特征串联，然后馈送到正向通道中的神经网络的隐藏层。嵌入值随机初始化，并与所有其他模型参数一起训练，以最大限度地减少训练损失。如果您有兴趣了解更多关于嵌入的知识，请查看TensorFlow关于词语的向量表示或维基百科上的词嵌入的教程。

将分类列表示为馈入神经网络的另一种方式是通过多重表示。这通常适用于只有少数可能值的分类列。例如用于性别柱，"Male"可以被表示为[1, 0]和"Female"为[0, 1]。这是一个固定的表示，而嵌入更灵活，并在训练时计算。

我们将使用embedding_column并配置分类列的嵌入，并将它们与连续列连接起来。我们还使用indicator_column创建一些分类列的多重表示。

deep_columns = [
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(gender),
    tf.feature_column.indicator_column(relationship),
    # To show an example of embedding
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
]

dimension嵌入越高，模型将要学习的特征的表示越多的自由度。为了简单起见，我们在这里将所有特征列的维度设置为8。从经验上来说，关于维数的更明智的决定是从（\ log_2（n））或（k \ sqrt4n）的数值开始，其中（n）是特征列中的唯一特征的数量，并且（k）是一个很小的常数（通常小于10）。

通过密集嵌入，深层模型可以更好地推广，并对之前在训练数据中看不到的特征对进行预测。然而，当两个特征列之间的基础交互矩阵是稀疏且高阶的时，很难学习特征列的有效低维表示。在这种情况下，大多数特征对之间的相互作用应该为零，除了少数特征对之间的相互作用之外，密集嵌入将导致所有特征对的非零预测，因此可能会过度泛化。另一方面，具有交叉特征的线性模型可以用较少的模型参数有效地记住这些“例外规则”。

现在，让我们看看如何联合训练广泛和深度的模型，并让它们互相补充优点和缺点。

将宽和深模型组合成一个模型

宽模型和深模型通过总结它们的最终输出对数似然比作为预测，然后将预测结果提供给对数损失函数。所有的图形定义和变量分配已经在你的引擎下处理过了，所以你只需要创建一个DNNLinearCombinedClassifier：

import tempfile
model_dir = tempfile.mkdtemp()
m = tf.estimator.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=crossed_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50])

操练和评估模型

在我们操练模型之前，让我们先阅读人口普查数据集，就像我们在TensorFlow线性模型教程中所做的那样。输入数据处理的代码在这里再次提供以方便您：

import pandas as pd
import urllib

# Define the column names for the data sets.
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"
]

def maybe_download(train_data, test_data):
  """Maybe downloads training data and returns train and test file names."""
  if train_data:
    train_file_name = train_data
  else:
    train_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        train_file.name)  # pylint: disable=line-too-long
    train_file_name = train_file.name
    train_file.close()
    print("Training data is downloaded to %s" % train_file_name)

  if test_data:
    test_file_name = test_data
  else:
    test_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
        test_file.name)  # pylint: disable=line-too-long
    test_file_name = test_file.name
    test_file.close()
    print("Test data is downloaded to %s"% test_file_name)

  return train_file_name, test_file_name

def input_fn(data_file, num_epochs, shuffle):
  """Input builder function."""
  df_data = pd.read_csv(
      tf.gfile.Open(data_file),
      names=CSV_COLUMNS,
      skipinitialspace=True,
      engine="python",
      skiprows=1)
  # remove NaN elements
  df_data = df_data.dropna(how="any", axis=0)
  labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  return tf.estimator.inputs.pandas_input_fn(
      x=df_data,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)

阅读数据后，您可以操练和评估模型：

# set num_epochs to None to get infinite stream of data.
m.train(
    input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
    steps=train_steps)
# set steps to None to run evaluation until all data consumed.
results = m.evaluate(
    input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
    steps=None)
print("model directory = %s" % model_dir)
for key in sorted(results):
  print("%s: %s" % (key, results[key]))

输出的第一行应该是类似的accuracy: 0.84429705。我们可以看到使用Wide＆Deep模型的精度从大约83.6％提高到了大约84.4％。如果您希望看到一个可用的端到端示例，则可以下载我们的示例代码。

请注意，本教程只是一个小数据集的简单示例，可帮助您熟悉API。如果您在具有大量可能特征值的稀疏特征列的大型数据集上进行试验，则Wide＆Deep Learning功能将更加强大。再次，请随时查看我们的研究论文，了解有关如何将Wide＆Deep Learning应用于实际大型机器学习问题的更多想法。

本文档系腾讯云开发者社区成员共同维护，如有问题请联系 cloudcommunity@tencent.com

最后更新于：2017-12-18