Adding a New Op（添加一个新Op）

如果您想创建一个未包含在现有TensorFlow库中的操作，我们建议您首先尝试在Python中将操作编写为现有Python操作或函数的组合。如果这是不可能的，你可以创建一个自定义的C ++操作。有几个原因可能会导致您想要创建自定义的C ++操作：

将操作表达为现有操作的组合是不容易或不可能的。
将操作表达为现有基元的组合没有效率。
你想手动融合原始构图，未来的编译器会发现困难的融合。

例如，想象一下，您想实现类似于“MaxPool”运算符的“median medianing”，但是通过滑动窗口而不是最大值来计算中值。使用操作组合可能会做到这一点（例如，使用ExtractImagePatches和TopK），但可能不如性能或内存效率高的本地操作，您可以在单个融合操作中执行更巧妙的操作。与往常一样，通常首先尝试使用操作符组合来表达您想要的内容，如果证明这种操作很困难或效率低下，则只选择添加新操作。

要纳入您的自定义操作，您需要：

在C ++文件中注册新操作。操作注册为操作的功能定义了一个接口（规范），它独立于操作的实现。例如，op注册定义了op的名称和op的输入和输出。它还定义了用于张量形状推断的形状函数。
在C ++中实现op。op的实现称为内核，它是您在步骤1中注册的规范的具体实现。对于不同的输入/输出类型或体系结构（例如，CPU，GPU），可以有多个内核。
创建一个Python包装器（可选）。这个包装器是用于在Python中创建操作的公共API。默认的包装是从op注册生成的，可以直接使用或添加。
编写一个函数来计算op（可选）的渐变。
测试操作。为了方便起见，我们通常在Python中执行此操作，但您也可以在C ++中测试操作。如果您定义了渐变，您可以使用Python 渐变检查器来验证它们。请参阅relu_op_test.py测试Relu类运算符及其渐变的前向函数。

先决条件：

熟悉C ++。
必须安装TensorFlow二进制文件，或者必须已经下载了TensorFlow源文件，并且能够构建它。

定义操作界面

您可以通过向TensorFlow系统注册来定义操作的接口。在注册过程中，您可以指定您的操作的名称，其输入（类型和名称）和输出（类型和名称），以及文档字符串和操作可能需要的任何attrs。

为了看看它是如何工作的，假设你想创建一个运算符，该运算符需要张量int32s并输出张量的副本，除第一个元素之外的所有元素都设置为零。为此，请创建一个名为的文件zero_out.cc。然后将一个调用添加到REGISTER_OP为您的操作定义接口的宏中：

#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/shape_inference.h"

using namespace tensorflow;

REGISTER_OP("ZeroOut")
    .Input("to_zero: int32")
    .Output("zeroed: int32")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      c->set_output(0, c->input(0));
      return Status::OK();
    });

这个ZeroOut操作以一个to_zero32位整数张作为输入，并输出一个zeroed32位整数的张量。此操作还使用形状函数来确保输出张量与输入张量具有相同的形状。例如，如果输入是形状10,20的张量，则该形状函数指定输出形状也是10,20。

关于命名的说明：操作名称必须位于CamelCase中，并且在二进制文件中注册的所有其他操作符中必须唯一。

为操作实现内核

定义接口后，提供一个或多个操作的实现。要创建这些内核之一，请创建一个扩展OpKernel并覆盖该Compute方法的类。该Compute方法提供了一个context类型参数OpKernelContext*，从中可以访问输入和输出张量等有用的东西。

将您的内核添加到您在上面创建的文件中。内核可能看起来像这样：

#include "tensorflow/core/framework/op_kernel.h"

using namespace tensorflow;

class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<int32>();

    // Create an output tensor
    Tensor* output_tensor = NULL;
    OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                     &output_tensor));
    auto output_flat = output_tensor->flat<int32>();

    // Set all but the first element of the output tensor to 0.
    const int N = input.size();
    for (int i = 1; i < N; i++) {
      output_flat(i) = 0;
    }

    // Preserve the first input value if possible.
    if (N > 0) output_flat(0) = input(0);
  }
};

在实现你的内核之后，你用TensorFlow系统注册它。在注册中，您可以指定该内核将运行的不同约束条件。例如，你可能有一个针对CPU的内核，另一个针对GPU。

为此ZeroOut，请将以下内容添加到zero_out.cc：

REGISTER_KERNEL_BUILDER(Name("ZeroOut").Device(DEVICE_CPU), ZeroOutOp);

重要提示：您的OpKernel的实例可以同时访问。你的Compute方法必须是线程安全的。使用互斥锁保护对类成员的访问。或者更好的是，不要通过班级成员分享国家！考虑使用a ResourceMgr来跟踪op状态。

多线程CPU内核

要编写多线程CPU内核，work_sharder.h可以使用Shard函数。此函数在配置为用于intra-op线程的线程之间分割计算函数（请参阅intra_op_parallelism_threads in config.proto）。

GPU内核

GPU内核分两部分实现：OpKernel和CUDA内核及其启动代码。

有时OpKernel的实现在CPU和GPU内核之间很常见，比如检查输入和分配输出。在这种情况下，建议的实施是：

1. 定义在Device上模板化的OpKernel和张量的基本类型。

2. 为了完成输出的实际计算，Compute函数调用模板函子结构。

3. 该函数对CPUDevice的专门化定义在同一个文件中，但GPUDevice的专门化定义在.cu.cc文件中，因为它将与CUDA编译器一起编译。

这是一个示例的实现。

// kernel_example.h
#ifndef KERNEL_EXAMPLE_H_
#define KERNEL_EXAMPLE_H_

template <typename Device, typename T>
struct ExampleFunctor {
  void operator()(const Device& d, int size, const T* in, T* out);
};

#if GOOGLE_CUDA
// Partially specialize functor for GpuDevice.
template <typename Eigen::GpuDevice, typename T>
struct ExampleFunctor {
  void operator()(const Eigen::GpuDevice& d, int size, const T* in, T* out);
};
#endif

#endif KERNEL_EXAMPLE_H_

// kernel_example.cc
#include "example.h"
#include "tensorflow/core/framework/op_kernel.h"

using namespace tensorflow;

using CPUDevice = Eigen::ThreadPoolDevice;
using GPUDevice = Eigen::GpuDevice;

// CPU specialization of actual computation.
template <typename T>
struct ExampleFunctor<CPUDevice, T> {
  void operator()(const CPUDevice& d, int size, const T* in, T* out) {
    for (int i = 0; i < size; ++i) {
      out[i] = 2 * in[i];
    }
  }
};

// OpKernel definition.
// template parameter <T> is the datatype of the tensors.
template <typename Device, typename T>
class ExampleOp : public OpKernel {
 public:
  explicit ExampleOp(OpKernelConstruction* context) : OpKernel(context) {}

  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);

    // Create an output tensor
    Tensor* output_tensor = NULL;
    OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
                                                     &output_tensor));

    // Do the computation.
    OP_REQUIRES(context, input_tensor.NumElements() <= tensorflow::kint32max,
                errors::InvalidArgument("Too many elements in tensor"));
    ExampleFunctor<Device, T>()(
        context->eigen_device<Device>(),
        static_cast<int>(input_tensor.NumElements()),
        input_tensor.flat<T>().data(),
        output_tensor->flat<T>().data());
  }
};

// Register the CPU kernels.
#define REGISTER_CPU(T)                                          \
  REGISTER_KERNEL_BUILDER(                                       \
      Name("Example").Device(DEVICE_CPU).TypeConstraint<T>("T"), \
      ExampleOp<CPUDevice, T>);
REGISTER_CPU(float);
REGISTER_CPU(int32);

// Register the GPU kernels.
#ifdef GOOGLE_CUDA
#define REGISTER_GPU(T)                                          \
  /* Declare explicit instantiations in kernel_example.cu.cc. */ \
  extern template ExampleFunctor<GPUDevice, float>;              \
  REGISTER_KERNEL_BUILDER(                                       \
      Name("Example").Device(DEVICE_GPU).TypeConstraint<T>("T"), \
      ExampleOp<GPUDevice, T>);
REGISTER_GPU(float);
REGISTER_GPU(int32);
#endif  // GOOGLE_CUDA

// kernel_example.cu.cc
#ifdef GOOGLE_CUDA
#define EIGEN_USE_GPU
#include "example.h"
#include "tensorflow/core/util/cuda_kernel_helper.h"

using namespace tensorflow;

using GPUDevice = Eigen::GpuDevice;

// Define the CUDA kernel.
template <typename T>
__global__ void ExampleCudaKernel(const int size, const T* in, T* out) {
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;
       i += blockDim.x * gridDim.x) {
    out[i] = 2 * ldg(in + i);
  }
}

// Define the GPU implementation that launches the CUDA kernel.
template <typename T>
void ExampleFunctor<GPUDevice, T>::operator()(
    const GPUDevice& d, int size, const T* in, T* out) {
  // Launch the cuda kernel.
  //
  // See core/util/cuda_kernel_helper.h for example of computing
  // block count and thread_per_block count.
  int block_count = 1024;
  int thread_per_block = 20;
  ExampleCudaKernel<T>
      <<<block_count, thread_per_block, 0, d.stream()>>>(size, in, out);
}

// Explicitly instantiate functors for the types of OpKernels registered.
template struct ExampleFunctor<GPUDevice, float>;
template struct ExampleFunctor<GPUDevice, int32>;

#endif  // GOOGLE_CUDA

构建op库

使用系统编译器编译op（TensorFlow二进制安装）

您应该可以zero_out.cc使用C++编译器进行编译，例如您的系统上有g++或clang可用的编译器。二进制PIP包安装头文件和库，你需要编译你的操作系统特定的位置。但是，TensorFlow python库提供了get_include获取头文件目录的功能，并且该get_lib目录有一个链接的共享对象。以下是Ubuntu机器上这些功能的输出。

$ python
>>> import tensorflow as tf
>>> tf.sysconfig.get_include()
'/usr/local/lib/python2.7/site-packages/tensorflow/include'
>>> tf.sysconfig.get_lib()
'/usr/local/lib/python2.7/site-packages/tensorflow'

假设你已经g++安装了，下面是你可以用来将你的操作编译到动态库中的命令序列。

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')
TF_LIB=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())')
g++ -std=c++11 -shared zero_out.cc -o zero_out.so -fPIC -I$TF_INC -I$TF_INC/external/nsync/public -L$TF_LIB -ltensorflow_framework -O2

在Mac OS X上，构建.so文件时需要额外的标志“-undefined dynamic_lookup” 。

注意gcc版本>=5：gcc使用新的C ++ ABI版本5。TensorFlow网站上提供的二进制pip软件包gcc4是使用旧ABI 构建的。如果你编译你的op库gcc>=5，添加-D_GLIBCXX_USE_CXX11_ABI=0到命令行使库与旧的abi兼容。此外，如果您使用从源创建的TensorFlow包，请记住添加--cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"为bazel命令来编译Python包。

使用bazel编译op（TensorFlow源代码安装）

如果你安装了TensorFlow源代码，你可以使用TensorFlow的编译系统来编译你的操作系统。在tensorflow/core/user_ops目录中放置一个具有以下Bazel构建规则的BUILD文件。

load("//tensorflow:tensorflow.bzl", "tf_custom_op_library")

tf_custom_op_library(
    name = "zero_out.so",
    srcs = ["zero_out.cc"],
)

运行以下命令来构建zero_out.so。

$ bazel build --config opt //tensorflow/core/user_ops:zero_out.so

注意：虽然您可以用.so使用标准cc_library规则创建共享库（文件），但我们强烈建议您使用该tf_custom_op_library宏。它添加了一些必需的依赖关系，并执行检查以确保共享库与TensorFlow的插件加载机制兼容。

在Python中使用op

TensorFlow Python API提供了tf.load_op_library加载动态库并向TensorFlow框架注册操作的功能。load_op_library返回一个Python模块，该模块包含op和内核的Python包装。因此，一旦你建立了操作系统，你可以做以下的事情来从Python运行它：

import tensorflow as tf
zero_out_module = tf.load_op_library('./zero_out.so')
with tf.Session(''):
  zero_out_module.zero_out([[1, 2], [3, 4]]).eval()

# Prints
array([[1, 0], [0, 0]], dtype=int32)

请记住，生成的函数将被赋予一个snake_case名称（以符合PEP8）。所以，如果你的操作是ZeroOut在C ++文件中命名的，python函数将被调用zero_out。

为了使op成为一个普通函数（import可从Python模块中获得），load_op_library在Python源文件中调用该函数可能很有用，如下所示：

import tensorflow as tf

zero_out_module = tf.load_op_library('./zero_out.so')
zero_out = zero_out_module.zero_out

验证操作是否正常

验证你已经成功实现你的op的一个好方法就是为它写一个测试。zero_out_op_test.py使用以下内容创建文件：

import tensorflow as tf

class ZeroOutTest(tf.test.TestCase):
  def testZeroOut(self):
    zero_out_module = tf.load_op_library('./zero_out.so')
    with self.test_session():
      result = zero_out_module.zero_out([5, 4, 3, 2, 1])
      self.assertAllEqual(result.eval(), [5, 0, 0, 0, 0])

if __name__ == "__main__":
  tf.test.main()

然后运行你的测试（假设你已经安装了tensorflow）：

$ python zero_out_op_test.py

将高级功能构建到您的操作系统

既然您已经知道如何构建一个基本的（并且有点受限制的）操作和实现，那么我们将看看您通常需要构建到操作中的一些更复杂的事情。这包括：

有条件的检查和验证
在注册
- ATTRS
- 属性类型
- 多态性
- 输入和输出
- 向后兼容性
GPU支持
- 编译GPU设备的内核
在Python中实现渐变
Shape函数在C ++中

有条件的检查和验证

上面的例子假设op适用于任何形状的张量。如果它只适用于矢量呢？这意味着在上面的OpKernel实现中添加一个检查。

void Compute(OpKernelContext* context) override {
  // Grab the input tensor
  const Tensor& input_tensor = context->input(0);

  OP_REQUIRES(context, TensorShapeUtils::IsVector(input_tensor.shape()),
              errors::InvalidArgument("ZeroOut expects a 1-D vector."));
  // ...
}

这断言输入是一个向量，InvalidArgument如果不是，则返回已设置的状态。该OP_REQUIRES宏有三个参数：

该context方法可以是一个OpKernelContext或OpKernelConstruction指针（参见tensorflow/core/framework/op_kernel.h）SetStatus()。
条件。例如，有用于验证张量形状的函数tensorflow/core/framework/tensor_shape.h
错误本身，由Status对象表示，请参阅tensorflow/core/lib/core/status.h。 Status有一个类型（经常InvalidArgument，但看到类型列表）和一条消息。用于构造错误的函数可以在中找到tensorflow/core/lib/core/errors.h。

或者，如果您想测试Status某个函数返回的对象是否为错误，并且如果返回，请使用OP_REQUIRES_OK。这两个宏都从错误的函数返回。

在注册

Attrs

Ops可以有attrs，当op被添加到图表时，其值被设置。这些用于配置op，并且它们的值可以在内核实现中以及op注册中的输入和输出类型中进行访问。如果可能，最好使用输入而不是attr，因为输入更灵活。这是因为attrs是常量，必须在图构建时定义。相反，输入是张量值可以是动态的; 也就是说，输入可以改变每一步，使用一个feed来设置等。Attrs用于输入不能完成的事情：任何影响签名（输入或输出的数量或类型）的配置，不要一步步改变。

您在注册op时定义一个attr，Attr方法是使用方法指定它的名称和类型，该方法需要表单的一个规范：

<name>: <attr-type-expr>

其中<name>以字母开头并且可以由字母数字字符和下划线，并且<attr-type-expr>是下面描述的形式的类型表达式。

例如，如果您希望ZeroOutop保留用户指定的索引，而不是仅存储第0个元素，则可以像这样注册op：

REGISTER_OP("ZeroOut")
    .Attr("preserve_index: int")
    .Input("to_zero: int32")
    .Output("zeroed: int32");

（请注意，这组属性类型与用于输入和输出的张量类型不同。）

你的内核可以通过context参数在它的构造函数中访问这个attr ：

class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {
    // Get the index of the value to preserve
    OP_REQUIRES_OK(context,
                   context->GetAttr("preserve_index", &preserve_index_));
    // Check that preserve_index is positive
    OP_REQUIRES(context, preserve_index_ >= 0,
                errors::InvalidArgument("Need preserve_index >= 0, got ",
                                        preserve_index_));
  }
  void Compute(OpKernelContext* context) override {
    // ...
  }
 private:
  int preserve_index_;
};

可以在该Compute方法中使用它：

void Compute(OpKernelContext* context) override {
  // ...

  // We're using saved attr to validate potentially dynamic input
  // So we check that preserve_index is in range
  OP_REQUIRES(context, preserve_index_ < input.dimension(0),
              errors::InvalidArgument("preserve_index out of range"));
  // Set all the elements of the output tensor to 0
  const int N = input.size();
  for (int i = 0; i < N; i++) {
    output_flat(i) = 0;
  }
  // Preserve the requested input value
  output_flat(preserve_index_) = input(preserve_index_);
}

属性类型

attr支持以下类型：

string：任何字节序列（不要求是UTF8）。
int：有符号整数。
float：一个浮点数。
bool：对或错。
type：其中一个（非参考）值DataType。
shape：TensorShapeProto。
tensor：TensorProto。
list(<type>)：一个列表<type>，其中<type>是上述类型之一。请注意，这list(list(<type>))是无效的。

另请参阅：op_def_builder.cc:FinalizeAttr获取最终清单。

默认值和约束

Attrs可能具有默认值，并且某些类型的attrs可能具有约束条件。要用约束定义一个attr，可以使用以下<attr-type-expr>：

{'<string1>', '<string2>'}：该值必须是具有值<string1>或字符串的字符串<string2>。string当你使用这种语法时，这个类型的名字是隐含的。这模拟了一个枚举：

REGISTER_OP("EnumExample")
    .Attr("e: {'apple', 'orange'}");

{<type1>, <type2>}：该值是type类型，并且必须是<type1>或<type2其中之一，在那里<type1>和<type2>支持张量类型。您不指定attr的类型type。当你有一个类型列表时，这是隐含的{...}。例如，在这种情况下，attr t是一个必须是int32，float或bool的类型：

REGISTER_OP("RestrictedTypeExample")
    .Attr("t: {int32, float, bool}");

有通用类型约束的捷径：

-  `numbertype`: Type `type` restricted to the numeric (non-string and non-bool) types.
-  `realnumbertype`: Like `numbertype` without complex types.
-  `quantizedtype`: Like `numbertype` but just the quantized number types.

这些类型允许的特定列表由函数（如NumberTypes()）定义tensorflow/core/framework/types.h。在这个例子中，attr t必须是数字类型之一：

c++ REGISTER_OP("NumberType") .Attr("t: numbertype");

对于这个操作：

python tf.number_type(t=tf.int32) # Valid tf.number_type(t=tf.bool) # Invalid

列表可以与其他列表和单一类型组合。以下op允许attr t为任何数字类型或bool类型：

c++ REGISTER_OP("NumberOrBooleanType") .Attr("t: {numbertype, bool}");

对于这个操作：

python tf.number_or_boolean_type(t=tf.int32) # Valid tf.number_or_boolean_type(t=tf.bool) # Valid tf.number_or_boolean_type(t=tf.string) # Invalid

int >= <n>：该值必须是值大于或等于的int <n>，其中<n>是自然数。

例如，以下op注册指定attr a的值必须至少为2：

REGISTER_OP("MinIntExample")
    .Attr("a: int >= 2");

list(<type>) >= <n>：<type>长度大于或等于的类型列表<n>。

例如，以下op注册指定attr a是类型列表（int32或者float），并且必须至少有3个：

REGISTER_OP("TypeListExample")
    .Attr("a: list({int32, float}) >= 3");

要为attr设置默认值（使其在生成的代码中可选）添加= <default>到最后，如下所示：

REGISTER_OP("AttrDefaultExample")
    .Attr("i: int = 0");

默认值的支持语法是将在结果GraphDef定义的原始表示中使用的语法。

以下是如何为所有类型指定默认值的示例：

REGISTER_OP("AttrDefaultExampleForAllTypes")
   .Attr("s: string = 'foo'")
   .Attr("i: int = 0")
   .Attr("f: float = 1.0")
   .Attr("b: bool = true")
   .Attr("ty: type = DT_INT32")
   .Attr("sh: shape = { dim { size: 1 } dim { size: 2 } }")
   .Attr("te: tensor = { dtype: DT_INT32 int_val: 5 }")
   .Attr("l_empty: list(int) = []")
   .Attr("l_int: list(int) = [2, 3, 5, 7]");

请特别注意，类型的值type使用类型的DT_*名称。

多态性

类型多态性

对于可以将不同类型作为输入或产生不同输出类型的op，可以在op注册中指定一个输入或输出类型的attr。通常，您会为每种支持的类型注册一个OpKernel。

举例来说，如果你想在ZeroOut运到上工作float，除了小号int32S，你的注册可能是这样的：

REGISTER_OP("ZeroOut")
    .Attr("T: {float, int32}")
    .Input("to_zero: T")
    .Output("zeroed: T");

您的op注册现在指定输入的类型必须是float或int32，并且其输出将是相同的类型，因为两者都有类型T。

关于命名的注释：输入，输出和attrs通常应该给出snake_case名称。一个例外是用作输入类型或输入类型的attrs。那些attrs可以在op被添加到图表时推断出来，所以不会出现在op的函数中。例如，ZeroOut的最后一个定义将生成一个如下所示的Python函数：def zero_out（to_zero，name = None）：“”“...参数：to_zero： Tensor。必须是以下类型之一： float32，int32。name ：该操作的名称（可选）返回：Tensor。与to\_zero。“” 具有相同的类型如果to_zero传递一个int32张量，则会T自动设置为int32（实际上，DT_INT32）。这些推断的attrs被赋予Capitalized或CamelCase的名字。将此与具有确定输出类型的类型attr的op进行比较：REGISTER_OP（“StringToNumber”）.Input（“string_tensor：string”）.Output（“output：out_type”）.Attr（“out_type：{float，int32 } = DT_FLOAT“）; .doc（R“doc（将输入张量中的每个字符串转换为指定的数字类型。）doc”）; 在这种情况下，用户必须指定输出类型，如在生成的Python中：def string_to_number（string_tensor，out_type = None，name = None）：“”“将输入张量中的每个字符串转换为指定的数字类型。：string_tensor：Tensor类型的类型stringout_type：可选tf.DType来源：tf.float32, tf.int32。缺省为。name tf.float32：操作的名称（可选）。返回：Tensor类型out\_type。“””

\#include "tensorflow/core/framework/op_kernel.h"
class ZeroOutInt32Op : public OpKernel {
  // as before
};
class ZeroOutFloatOp : public OpKernel {
 public:
  explicit ZeroOutFloatOp(OpKernelConstruction\* context)
      : OpKernel(context) {}
  void Compute(OpKernelContext\* context) override {
    // Grab the input tensor
    const Tensor& input\_tensor = context->input(0);
    auto input = input\_tensor.flat<float>();
    // Create an output tensor
    Tensor* output = NULL;
    OP\_REQUIRES\_OK(context,
                   context->allocate\_output(0, input_tensor.shape(), &output));
    auto output\_flat = output->template flat<float>();
    // Set all the elements of the output tensor to 0
    const int N = input.size();
    for (int i = 0; i < N; i++) {
      output\_flat(i) = 0;
    }
    // Preserve the first input value
    if (N > 0) output\_flat(0) = input(0);
  }
};
// Note that TypeConstraint<int32>("T") means that attr "T" (defined
// in the op registration above) must be "int32" to use this template
// instantiation.
REGISTER\_KERNEL\_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE\_CPU)
    .TypeConstraint<int32>("T"),
    ZeroOutOpInt32);
REGISTER\_KERNEL\_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE\_CPU)
    .TypeConstraint<float>("T"),
    ZeroOutFloatOp);

为了保持向后兼容性，在向现有操作添加attr时应指定一个默认值：REGISTER_OP（“ZeroOut”）.Attr（“T：{float，int32} = DT_INT32”）.Input（“to_zero：T”） .Output（“归零：T”）

假设您想添加更多类型，请说double：

REGISTER_OP("ZeroOut")
    .Attr("T: {float, double, int32}")
    .Input("to_zero: T")
    .Output("zeroed: T");

OpKernel与上面使用冗余代码编写另一个代码不同，您通常可以使用C ++模板。REGISTER_KERNEL_BUILDER每过载一次，您仍然会有一次内核注册（调用）。

template <typename T>
class ZeroOutOp : public OpKernel {
 public:
  explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {}
  void Compute(OpKernelContext* context) override {
    // Grab the input tensor
    const Tensor& input_tensor = context->input(0);
    auto input = input_tensor.flat<T>();
    // Create an output tensor
    Tensor* output = NULL;
    OP_REQUIRES_OK(context,
                   context->allocate_output(0, input_tensor.shape(), &output));
    auto output_flat = output->template flat<T>();
    // Set all the elements of the output tensor to 0
    const int N = input.size();
    for (int i = 0; i < N; i++) {
      output_flat(i) = 0;
    }
    // Preserve the first input value
    if (N > 0) output_flat(0) = input(0);
  }
};
// Note that TypeConstraint<int32>("T") means that attr "T" (defined
// in the op registration above) must be "int32" to use this template
// instantiation.
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<int32>("T"),
    ZeroOutOp<int32>);
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<float>("T"),
    ZeroOutOp<float>);
REGISTER_KERNEL_BUILDER(
    Name("ZeroOut")
    .Device(DEVICE_CPU)
    .TypeConstraint<double>("T"),
    ZeroOutOp<double>);

如果您的重载超过了几个，您可以将注册放入宏中。

#include "tensorflow/core/framework/op_kernel.h"

#define REGISTER_KERNEL(type)                                       \
  REGISTER_KERNEL_BUILDER(                                          \
      Name("ZeroOut").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      ZeroOutOp<type>)

REGISTER_KERNEL(int32);
REGISTER_KERNEL(float);
REGISTER_KERNEL(double);

#undef REGISTER_KERNEL

根据您注册内核的类型列表，您可以使用以下提供的宏tensorflow/core/framework/register_types.h：

#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/register_types.h"

REGISTER_OP("ZeroOut")
    .Attr("T: realnumbertype")
    .Input("to_zero: T")
    .Output("zeroed: T");

template <typename T>
class ZeroOutOp : public OpKernel { ... };

#define REGISTER_KERNEL(type)                                       \
  REGISTER_KERNEL_BUILDER(                                          \
      Name("ZeroOut").Device(DEVICE_CPU).TypeConstraint<type>("T"), \
      ZeroOutOp<type>)

TF_CALL_REAL_NUMBER_TYPES(REGISTER_KERNEL);

#undef REGISTER_KERNEL

列出输入和输出

除了能够接受或产生不同类型之外，操作可以消耗或产生可变数量的张量。

在下一个例子中，attr T拥有一个类型列表，并用作输入in和输出的类型out。输入和输出是该类型的张量列表（并且输出中张量的数量和类型与输入相同，因为两者都有类型T）。

REGISTER_OP("PolymorphicListExample")
    .Attr("T: list(type)")
    .Input("in: T")
    .Output("out: T");

您还可以限制列表中可以指定的类型。在下一个例子中，输入是一张张量float和double张量。例如，op接受输入类型(float, double, float)，在这种情况下，输出类型也是(float, double, float)。

REGISTER_OP("ListTypeRestrictionExample")
    .Attr("T: list({float, double})")
    .Input("in: T")
    .Output("out: T");

如果您希望列表中的所有张量都是相同类型的，则可以执行如下操作：

REGISTER_OP("IntListInputExample")
    .Attr("N: int")
    .Input("in: N * int32")
    .Output("out: int32");

这接受int32张量列表，并使用intattr N指定列表的长度。

这也可以是多态的。在下一个示例中，输入是张量（长度"N"）相同（但未指定）type（"T"）的列表，输出是匹配类型的单张量：

REGISTER_OP("SameListInputExample")
    .Attr("N: int")
    .Attr("T: type")
    .Input("in: N * T")
    .Output("out: T");

默认情况下，张量列表的最小长度为1.您可以使用">="相应attr 的约束来更改该默认值。在下一个示例中，输入是至少2个int32张量的列表：

REGISTER_OP("MinLengthIntListExample")
    .Attr("N: int >= 2")
    .Input("in: N * int32")
    .Output("out: int32");

相同的语法适用于"list(type)"attrs：

REGISTER_OP("MinimumLengthPolymorphicListExample")
    .Attr("T: list(type) >= 3")
    .Input("in: T")
    .Output("out: T");

输入和输出

综上所述，op注册可以有多个输入和输出：

REGISTER_OP("MultipleInsAndOuts")
    .Input("y: int32")
    .Input("z: float")
    .Output("a: string")
    .Output("b: int32");

每个输入或输出规范的格式如下：

<name>: <io-type-expr>

其中<name>以字母开头并且可以由字母数字字符和下划线。<io-type-expr>是以下类型表达式之一：

<type>，其中<type>是一个支持的输入类型（例如float，int32，string）。这指定了给定类型的单张量。

查看支持的张量类型列表。

REGISTER_OP("BuiltInTypesExample")
    .Input("integers: int32")
    .Input("complex_numbers: complex64");

<attr-type>，其中<attr-type>是具有类型的Attr的名称type或list(type)（具有可能的类型限制）。这个语法允许多态操作。

REGISTER_OP("PolymorphicSingleInput")
    .Attr("T: type")
    .Input("in: T");

REGISTER_OP("RestrictedPolymorphicSingleInput")
    .Attr("T: {int32, int64}")
    .Input("in: T");

引用一个类型的attr list(type)可以让你接受一系列的张量。

REGISTER_OP("ArbitraryTensorSequenceExample")
    .Attr("T: list(type)")
    .Input("in: T")
    .Output("out: T");

REGISTER_OP("RestrictedTensorSequenceExample")
    .Attr("T: list({int32, int64})")
    .Input("in: T")
    .Output("out: T");

请注意，输出中张量的数量和类型与输入out中的相同in，因为两者都是类型的T。

对于具有相同类型的张量序列<number> * <type>，其中<number>是具有类型的Attr的名称int。的<type>可以是特定类型的像int32或float，或者与一个类型attr的名称type。作为第一个例子，这个op接受int32张量列表：

REGISTER_OP("Int32SequenceExample")
    .Attr("NumTensors: int")
    .Input("in: NumTensors * int32")

鉴于这个op接受任何类型的张量列表，只要它们都是相同的：

REGISTER_OP("SameTypeSequenceExample")
    .Attr("NumTensors: int")
    .Attr("T: type")
    .Input("in: NumTensors * T")

对于张量的引用：Ref(<type>)，其中<type>之一是以前的类型。

关于命名的说明：将推断输入类型中使用的任何attr。按照惯例，那些推断的attrs使用资本名称（如T或N）。否则，输入，输出和attrs的名称就像函数参数（例如num_outputs）。有关更多详细信息，请参阅前面关于命名的注释。

有关更多详情，请参阅tensorflow/core/framework/op_def_builder.h。

向后兼容性

假设您已经编写了一个很好的自定义操作并与其他人共享，所以您可以使用您的操作获得满意的客户。但是，您希望以某种方式对操作进行更改。

一般来说，现有的签入规范的更改必须是向后兼容的：改变操作规范不能破坏以前GraphDef规范构建的串行化协议缓冲区。GraphDef这里描述兼容性的细节。

有几种方法可以保持向后兼容性。

添加到操作中的任何新attrs必须具有已定义的默认值，并且使用该默认值，操作必须具有原始行为。要将操作从非多态转换为多态，必须给新类型attr一个默认值以默认保留原始签名。例如，如果您的操作是：REGISTER_OP（“MyGeneralUnaryOp”）.Input（“in：float”）.Output（“out：float”）;

你可以用向后兼容的方式使它变为多态：

REGISTER_OP("MyGeneralUnaryOp")
    .Input("in: T")
    .Output("out: T")
    .Attr("T: numerictype = DT_FLOAT");

您可以安全地对约束条件进行限制。例如，您可以更改{int32, int64}为{int32, int64, float}或type。或者你可以改变{"apple", "orange"}为{"apple", "banana", "orange"}或string。
只要列表类型的默认值与旧签名匹配，您就可以将单个输入/输出更改为列表输入/输出。
如果默认为空，您可以添加新的列表输入/输出。
命名空间是您创建的任何新操作，只需在操作名称前添加一些独特的项目名称即可。这可避免让您的操作与可能包含在未来版本的TensorFlow中的操作相冲突。
未雨绸缪，尝试预测该操作的未来用途。某些签名更改无法以兼容的方式完成（例如，将相同类型的列表放入不同类型的列表中）。

安全和不安全的更改的完整列表可以在中找到tensorflow/core/framework/op_compatibility_test.cc。如果您无法对向后兼容的操作进行更改，请使用新的语义创建一个新名称的新操作。

还要注意，虽然这些更改可以保持GraphDef兼容性，但生成的Python代码可能会以与旧调用者不兼容的方式进行更改。通过仔细修改手写的Python包装，通过保留旧的签名，除了可能在最后添加新的可选参数之外，Python API可以保持兼容。通常不兼容的更改只能在TensorFlow更改主要版本时进行，并且必须符合GraphDef版本语义。

GPU支持

您可以实现不同的OpKernels，并为GPU注册一个，另一个为GPU注册，就像您可以注册不同类型的内核一样。有几个GPU支持的内核的例子tensorflow/core/kernels/。注意一些内核在一个.cc文件中有一个CPU版本，文件中有一个GPU版本，文件中_gpu.cu.cc有一些共同的代码.h。

例如，tf.pad除了GPU内核之外，其他的都有tensorflow/core/kernels/pad_op.cc。GPU内核处于tensorflow/core/kernels/pad_op_gpu.cu.cc，共享代码是一个模板化类，定义在tensorflow/core/kernels/pad_op.h。我们以这种方式组织代码，原因有两个：它允许您在CPU和GPU实现之间共享通用代码，并将GPU实现放入单独的文件中，以便只能由GPU编译器编译。

有一点需要注意，即使在使用GPU内核版本时pad，仍然需要"paddings"在CPU内存中输入它的内容。要标记输入或输出保存在CPU上，请添加HostMemory()对内核注册的调用，例如：

#define REGISTER_GPU_KERNEL(T)                         \
  REGISTER_KERNEL_BUILDER(Name("Pad")                  \
                              .Device(DEVICE_GPU)      \
                              .TypeConstraint<T>("T")  \
                              .HostMemory("paddings"), \
                          PadOp<GPUDevice, T>)

编译GPU设备的内核

查看cuda_op_kernel.cu.cc中的示例，该示例使用CUDA内核来实现操作。在tf_custom_op_library接受一个gpu_srcs在其中含有CUDA内核（源文件的列表参数*.cu.cc可指定文件）。要使用TensorFlow的二进制安装，必须使用NVIDIA的nvcc编译器编译CUDA内核。以下是可用于将cuda_op_kernel.cu.cc和cuda_op_kernel.cc编译为单个可动态装载的库的命令序列：

nvcc -std=c++11 -c -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc \
-I $TF_INC -I$TF_INC/external/nsync/public -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC

g++ -std=c++11 -shared -o cuda_op_kernel.so cuda_op_kernel.cc \
cuda_op_kernel.cu.o -I $TF_INC -I$TF_INC/external/nsync/public -fPIC -lcudart -L$TF_LIB -ltensorflow_framework

cuda_op_kernel.so上面生成的代码可以像平常一样在Python中使用该tf.load_op_library函数进行加载。

请注意，如果未安装CUDA库，则/usr/local/lib64需要在上面的第二个（g ++）命令中明确指定路径。例如，-L /usr/local/cuda-8.0/lib64/如果您的CUDA已安装，请添加/usr/local/cuda-8.0。

注意在一些linux设置中，nvcc需要编译步骤的附加选项。添加-D_MWAITXINTRIN_H_INCLUDED到nvcc命令行以避免错误mwaitxintrin.h。

在Python中实现渐变

给定一个操作图，TensorFlow使用自动差分（反向传播）来添加表示与现有操作相关的梯度的新操作（参见梯度计算）。为了对新操作进行自动微分，你必须注册一个梯度函数，该函数根据操作输入的输入给出梯度相对于操作输出的梯度。

在数学上，如果一个op计算（y = f（x）），则寄存的梯度op将损失（L）相对于（y）的梯度（\ partial L / \ partial y）转换为梯度（\ partial L / \ partial x ）关于（x）通过链式规则：

$$ \ frac {\ partial L} {\ partial x} = \ frac {\ partial L} {\ partial y} \ frac {\ partial y} {\ partial x} = \ frac {\ partial L} {\ partial y} \ frac {\ partial f} {\ partial x}。$$

在这种情况下ZeroOut，输入中只有一个输入影响输出，所以相对于输入的梯度是稀疏的“一热”张量。这表示如下：

from tensorflow.python.framework import ops
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import sparse_ops

@ops.RegisterGradient("ZeroOut")
def _zero_out_grad(op, grad):
  """The gradients for `zero_out`.

  Args:
    op: The `zero_out` `Operation` that we are differentiating, which we can use
      to find the inputs and outputs of the original op.
    grad: Gradient with respect to the output of the `zero_out` op.

  Returns:
    Gradients with respect to the input of `zero_out`.
  """
  to_zero = op.inputs[0]
  shape = array_ops.shape(to_zero)
  index = array_ops.zeros_like(shape)
  first_grad = array_ops.reshape(grad, [-1])[0]
  to_zero_grad = sparse_ops.sparse_to_dense([index], shape, first_grad, 0)
  return [to_zero_grad]  # List of one Tensor, since we have one input

有关注册梯度函数的详细信息tf.RegisterGradient：

对于具有一个输出运算，梯度函数将采取以及和建立新的OPS出张量，和。有关任何attrs的信息可以通过找到。tf.Operation optf.Tensor gradop.inputs[i]op.outputs[i]gradtf.Operation.get_attr
如果op具有多个输出，那么渐变函数将取op和grads，其中grads是关于每个输出的渐变列表。梯度函数的结果必须是Tensor代表相对于每个输入的梯度的对象列表。
如果某些输入没有明确定义的梯度，例如用作索引的整数输入，则相应的返回梯度应为None。例如，对于采用浮点张量x和整数索引的运算i，梯度函数将会return [x_grad, None]。
如果根本没有任何有意义的渐变，你通常不需要注册任何渐变，并且只要不需要渐变的渐变，就没问题。在某些情况下，op没有明确定义的梯度，但可以参与梯度的计算。在这里你可以使用ops.NotDifferentiable自动向后传播零。

请注意，在调用梯度函数时，只有ops的数据流图可用，而不是张量数据本身。因此，所有计算都必须使用其他张量流操作来执行，以便在图执行时运行。

Shape函数在C ++中

TensorFlow API具有称为“形状推理”的特征，该特征提供有关张量形状的信息，而无需执行图形。形状推断由C ++ REGISTER_OP声明中为每个操作类型注册的“形状函数”支持，并执行两个角色：在图形构造期间声明输入的形状是兼容的，并指定输出的形状。

形状函数被定义为shape_inference::InferenceContext类上的操作。例如，在ZeroOut的形状函数中：

.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
  c->set_output(0, c->input(0));
  return Status::OK();
});

c->set_output(0, c->input(0));声明第一个输出的形状应该被设置为第一个输入的形状。如果输出是按照上面例子中的索引选择的，则第二个参数set_output应该是一个ShapeHandle对象。您可以ShapeHandle通过其默认构造函数创建一个空对象。的ShapeHandle用于与索引输入对象idx可以通过以下方式获得c->input(idx)。

有许多适用于许多操作的常用形状函数，例如common_shape_fns.h中shape_inference::UnchangedShape可以找到的函数，其用法如下：

REGISTER_OP("ZeroOut")
    .Input("to_zero: int32")
    .Output("zeroed: int32")
    .SetShapeFn(::tensorflow::shape_inference::UnchangedShape);

形状函数还可以约束输入的形状。对于ZeroOut具有矢量形状约束的版本，形状函数如下所示：

.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
  ::tensorflow::shape_inference::ShapeHandle input;
  TF_RETURN_IF_ERROR(c->WithRank(c->input(0), 1, &input));
  c->set_output(0, input);
  return Status::OK();
});

该WithRank调用验证输入形状c->input(0)的形状只有一个维度（或者如果输入形状未知，则输出形状将是具有一个未知维度的向量）。

如果您的操作符是多态输入，则可以使用成员InferenceContext来确定要检查的形状数量，并Merge验证这些形状是否全部兼容（或者，访问指示长度的属性，使用InferenceContext::GetAttr可访问属性的属性）。

.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
  ::tensorflow::shape_inference::ShapeHandle input;
  ::tensorflow::shape_inference::ShapeHandle output;
  for (size_t i = 0; i < c->num_inputs(); ++i) {
    TF_RETURN_IF_ERROR(c->WithRank(c->input(i), 2, &input));
    TF_RETURN_IF_ERROR(c->Merge(output, input, &output));
  }
  c->set_output(0, output);
  return Status::OK();
});

由于形状推断是一个可选特征，并且张量的形状可能会动态变化，所以形状函数必须对任何输入的不完整形状信息都是鲁棒的。该Merge方法InferenceContext允许调用者断言两个形状相同，即使其中一个或两个都没有完整的信息。Shape函数为所有核心TensorFlow操作定义，并提供许多不同的用法示例。

所述InferenceContext类具有许多可被用于定义形状的功能操作的功能。例如，您可以使用InferenceContext::Dim和验证特定维度具有非常具体的值InferenceContext::WithValue; 您可以指定输出维度是使用InferenceContext::Add和的两个输入维度的总和/乘积InferenceContext::Multiply。查看InferenceContext所有可以指定的各种形状操作的类。以下示例将第一个输出的形状设置为（n，3），其中第一个输入具有形状（n，...）

.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
    c->set_output(0, c->Matrix(c->Dim(c->input(0), 0), 3));
    return Status::OK();
});

如果你有一个复杂的形状功能，你应该考虑添加一个测试来验证不同的输入形状组合产生预期的输出形状组合。你可以在一些核心操作测试中看到如何编写这些测试的例子。（语法INFER_OK和INFER_ERROR有点神秘，但试图在测试中表示输入和输出形状规范时是紧凑的。现在，请参阅这些测试中的周围注释以了解形状字符串规范）。

本文档系腾讯云开发者社区成员共同维护，如有问题请联系 cloudcommunity@tencent.com

最后更新于：2017-12-18