本文翻译自 TVM 官方文档：Adding an Operator to Relay

Adding an Operator to Relay

在本文档中，我们将介绍在 Relay 中注册新的 TVM Operator 所需的步骤。我们将按照这个添加了 cumprod 操作的 PR (Pull Request) 为例。 PR 本身建立在另一个 PR 之上，该 PR 添加了 cumsum 操作。

注册一个新的 Operator 需要几个步骤：

添加一个属性节点，声明编译时已知的固定参数
为你的操作写一个类型关系，以集成到 Relay 的类型系统中
使用 C++ 中的RELAY_REGISTER_OP 宏为编译器注册 Operator 的数量、类型和其他提示
写出 Operator 是如何计算的
向 Relay Operator 注册计算和 schedule
定义一个C++函数为 Operator 产生一个调用节点，并为该函数注册一个Python API hook
将上述 Python API Hook 包装在一个更整洁的界面中
为新的 Relay Operator 编写测试

1. 定义属性节点

属性是在编译应该知道的固定参数。卷积 Operator 的步幅和空洞将是可能属于卷积 Operator 属性节点的字段的适当示例。

属性应在文件夹 include/tvm/relay/attrs/ 内的文件中定义。

最终我们要创建一个 operator，它可以在最终的 python 接口中直接调用：

def cumprod(data, axis=None, dtype=None, exclusive=None):
    """Numpy style cumprod op. Return the cumulative inclusive product of the elements along
    a given axis.
    Parameters
    ----------
    data : relay.Expr
        The input data to the operator.
    axis : int, optional
        Axis along which the cumulative product is computed. The default (None) is to compute
        the cumprod over the flattened array.
    dtype : string, optional
        Type of the returned array and of the accumulator in which the elements are multiplied.
        If dtype is not specified, it defaults to the dtype of data.
    exclusive : bool, optional
        If true will return exclusive product in which the first element is not
        included. In other terms, if true, the j-th output element would be
        the product of the first (j-1) elements. Otherwise, it would be the product of
        the first j elements. The product of zero elements will be 1.
    Returns
    -------
    result : relay.Expr
        The result has the same size as data, and the same shape as data if axis is not None.
        If axis is None, the result is a 1-d array.
    """

cumsum() 存在类似的接口。

因此，在 include/tvm/relay/attrs/transform.h 中定义属性时，我们选择 axis、累积的 dtype 和 operator 的排他性作为 struct 的字段。

/*! \brief Attributes used in cumsum and cumprod operator */
struct ScanopAttrs : public tvm::AttrsNode<ScanopAttrs> {
  Integer axis;
  DataType dtype;
  Bool exclusive = Bool(false);
  TVM_DECLARE_ATTRS(ScanopAttrs, "relay.attrs.ScanopAttrs") {
    TVM_ATTR_FIELD(axis).describe("The axis to operate over").set_default(NullValue<Integer>());
    TVM_ATTR_FIELD(dtype).describe("Output data type").set_default(NullValue<DataType>());
    TVM_ATTR_FIELD(exclusive)
        .describe("The first element is not included")
        .set_default(Bool(false));
  }
};

2. 编写类型关系

为了在注册 Operator 时提供灵活性，并在 Relay 中表达类型时具有更大的表达性和粒度，Operator 是使用输入和输出类型之间的关系进行类型化的。这些关系表示为接受输入类型和输出类型列表（这些类型中的任何一种可能不完整）并返回满足关系的输入和输出类型列表的函数。这包括可以在编译时静态确定的形状信息。本质上，除了计算输出类型之外，Operator 的关系还可以强制执行所有必要的类型规则（即通过检查输入类型）。

cumprod 和 cumsum Operator 的类型关系可以在 src/relay/op/tensor/transform.cc 中找到：

TVM_REGISTER_NODE_TYPE(ScanopAttrs);
bool ScanopRel(const Array<Type>& types, int num_inputs, const Attrs& attrs, const TypeReporter& reporter) {
    // types: [data, output]
    ICHECK_EQ(types.size(), 2) << "Expects two types, one for the input and another for the output";
    const auto* data = types[0].as<TensorTypeNode>();
    if (data == nullptr) {
        ICHECK(types[0].as<IncompleteTypeNode>())
        << "Scanop: expect input type to be TensorType but get " << types[0];
        return false;
    }

    const auto* param = attrs.as<ScanopAttrs>();

    auto dtype = param->dtype;
    if (dtype.is_void()) {
        dtype = data->dtype;
    }

    if (param->axis.defined()) {
        reporter->Assign(types[1], TensorType(data->shape, dtype));
    } else {
        auto prod = data->shape[0];
        for (size_t i = 1; i < data->shape.size(); ++i) {
            prod = prod * data->shape[i];
        }
        reporter->Assign(types[1], TensorType({prod}, dtype));
    }

    return true;

3. 将参数数量、属性与 Operation 相关联

然后我们注册新操作的名称并使用调用接口对它们进行注释。 C++ 中的 RELAY_REGISTER_OP 宏允许开发人员指定以下有关 Relay 中 Operator 的信息：

Arity（参数数量）
位置参数的名称和描述
支持级别（1 表示内部内在；数字越大表示集成度越低或外部支持的 Operator）
Operator 的类型关系
优化操作时有用的其他注释

我们再次将其添加到 src/relay/op/tensor/transform.cc：

RELAY_REGISTER_OP("cumsum")
    .describe(
        R"doc(Return the cumulative sum of the elements along a given axis.)doc" TVM_ADD_FILELINE)
    .set_num_inputs(1)
    .add_argument("data", "Tensor", "The input tensor.")
    .set_support_level(3)
    .add_type_rel("Cumsum", ScanopRel)
    .set_attr<TOpPattern>("TOpPattern", kOpaque);

RELAY_REGISTER_OP("cumprod")
    .describe(
        R"doc(Return the cumulative product of the elements along a given axis.)doc" TVM_ADD_FILELINE)
    .set_num_inputs(1)
    .add_argument("data", "Tensor", "The input tensor.")
    .set_support_level(3)
    .add_type_rel("Cumprod", ScanopRel)
    .set_attr<TOpPattern>("TOpPattern", kOpaque);

在这种情况下，TOPPattern 是向编译器提示 Operator 执行的计算模式，这对于融合 Operator 可能很有用。 kOpaque 告诉 TVM 不要费心尝试融合该 Operator。

4. 定义 Operator 的计算

虽然我们现在已经为我们的操作定义了接口，但我们仍然需要定义如何执行 cumsum 和 cumprod 的实际计算。

编写此代码超出了本教程的范围。现在，我们假设我们有一个经过良好测试的 Operator 计算实现。有关如何执行此操作的更多详细信息，我们建议查找教程 tensor expressions、TVM’s operator inventory (topi) 并查看 python/tvm/topi/scan.py 和 python/tvm/topi/cuda/scan.py。对于 cumsum 和 cumprod 运算，我们直接在 TIR 中编写通过张量表达式和 topi 的低层调用的表示。

5. 使用 Relay 连接计算和策略

在你实现了你的计算功能之后，我们现在需要将它粘合到我们的 Relay 操作上。在 TVM 中，这不仅意味着定义计算，还意味着定义操作的时间表。策略是一种选择要使用的计算和调度的方法。例如，对于 2D 卷积，我们可能会认识到我们正在执行深度卷积并因此分派到更有效的计算和调度。然而，在我们的例子中，除了在我们的 CPU 和 GPU 实现之间进行调度之外，我们没有这样的需求。在 python/tvm/relay/op/strategy/generic.py 和 python/tvm/relay/op/strategy/cuda.py 我们添加下策略：

def wrap_compute_scanop(topi_compute):
    """Wrap scanop style topi compute"""

    def _compute_scanop(attrs, inputs, _):
        return [topi_compute(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

    return _compute_scanop

@override_native_generic_func("cumsum_strategy")
def cumsum_strategy(attrs, inputs, out_type, target):
    """cumsum generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumsum),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumsum.generic",
    )
    return strategy

@override_native_generic_func("cumprod_strategy")
def cumprod_strategy(attrs, inputs, out_type, target):
    """cumprod generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumprod),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumprod.generic",
    )
    return strategy

@cumsum_strategy.register(["cuda", "gpu"])
def cumsum_strategy_cuda(attrs, inputs, out_type, target):
    """cumsum cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumsum),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumsum.cuda",
    )
    return strategy

@cumprod_strategy.register(["cuda", "gpu"])
def cumprod_strategy_cuda(attrs, inputs, out_type, target):
    """cumprod cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumprod),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumprod.cuda",
    )
    return strategy

在每个策略中，我们定义了我们编写的计算以及在 add_implementation() 中使用的计划。我们最终将策略和计算与python/tvm/relay/op/_transform.py 中定义的 Relay Operator 联系起来：

# cumsum
@_reg.register_compute("cumsum")
def compute_cumsum(attrs, inputs, output_type):
    """Compute definition of cumsum"""
    return [topi.cumsum(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

_reg.register_strategy("cumsum", strategy.cumsum_strategy)
_reg.register_shape_func("cumsum", False, elemwise_shape_func)

# cumprod
@_reg.register_compute("cumprod")
def compute_cumprod(attrs, inputs, output_type):
    """Compute definition of cumprod"""
    return [topi.cumprod(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

_reg.register_strategy("cumprod", strategy.cumprod_strategy)
_reg.register_shape_func("cumprod", False, elemwise_shape_func)

形状函数用于确定给定动态形状张量的输出形状。在这种情况下，我们告诉 TVM 输出形状将与输入形状相同。

6. Creating a Relay Call Node and Exposing a Python Hook

我们现在有一个工作操作，现在只需要通过 Relay 调用节点正确调用它。此步骤只需要编写一个函数，该函数将参数传递给操作员（作为 Relay 表达式）并将调用节点返回给操作员（即，应放置在 Relay AST 中的节点，其中打算调用操作员）。

目前不支持调用属性和类型参数（最后两个字段），所以使用 Op::Get 从 Operator 注册表中获取 Operator 信息并将参数传递给调用节点就足够了，如下所示。在src/relay/op/tensor/transform.cc中：

Expr MakeCumsum(Expr data, Integer axis, DataType dtype, Bool exclusive) {
    auto attrs = make_object<ScanopAttrs>();
    attrs->dtype = dtype;
    attrs->axis = axis;
    attrs->exclusive = exclusive;
    static const Op& op = Op::Get("cumsum");
    return Call(op, {data}, Attrs(attrs), {});
}

TVM_REGISTER_GLOBAL("relay.op._make.cumsum").set_body_typed(MakeCumsum);

Expr MakeCumprod(Expr data, Integer axis, DataType dtype, Bool exclusive) {
    auto attrs = make_object<ScanopAttrs>();
    attrs->dtype = dtype;
    attrs->axis = axis;
    attrs->exclusive = exclusive;
    static const Op& op = Op::Get("cumprod");
    return Call(op, {data}, Attrs(attrs), {});
}

TVM_REGISTER_GLOBAL("relay.op._make.cumsum").set_body_typed(MakeCumprod);

其中 TVM_REGISTER_GLOBAL 通过 relay.op._make.cumsum(...) 和 relay.op._make.cumsum(...) 在 Python 中公开了 MakeCumsum 和 MakeCumprod 函数。

7. 包含一个更整洁的 Python API Hook

通常 Relay 中的约定是，通过 TVM_REGISTER_GLOBAL 导出的函数应该包装在单独的 Python 函数中，而不是直接在 Python 中调用。对于我们的操作员，我们在 python/tvm/relay/op/transform.py 中公开了这个更简洁的接口

def cumsum(data, axis=None, dtype=None, exclusive=None):
    return _make.cumsum(data, axis, dtype, exclusive)

def cumprod(data, axis=None, dtype=None, exclusive=None):
    return _make.cumprod(data, axis, dtype, exclusive)

请注意，这些 Python 包装器也可能是为操作员提供更简单接口的好机会。例如，concat 操作符被注册为只接受一个操作符，即一个带有要连接的张量的元组，但 Python 包装器将张量作为参数将它们组合成一个元组，然后再生成调用节点：

def concat(*args):
    """Concatenate the input tensors along the zero axis.

    Parameters
    ----------
    args: list of Tensor

    Returns
    -------
    tensor: The concatenated tensor.
    """
    tup = Tuple(list(args))
    return _make.concat(tup)

8. Writing Unit Tests!

这是不言自明的！对于我们的累积总和，可以在 tests/python/relay/test_op_level3.py 中找到一些示例单元测试和产品 Operator。

Other Topics

梯度运算符

梯度 Operator 对于在 Relay 中编写可微分程序很重要。虽然 Relay 的 autodiff 算法可以区分一流的语言结构，但运算符是不透明的。因为 Relay 无法查看实现，所以必须提供明确的微分规则。

Python 和 C++ 都可用于编写梯度运算符，但我们将示例重点放在 Python 上，因为它更常用。

Python 中添加梯度

可以在 python/tvm/relay/op/_tensor_grad.py 中找到 Python 梯度运算符的集合。我们将介绍两个具有代表性的示例：sigmoid 和multiply。

@register_gradient("sigmoid")
def sigmoid_grad(orig, grad):
    """Returns [grad * sigmoid(x) * (1 - sigmoid(x))]."""
    return [grad * orig * (ones_like(orig) - orig)]

这里的输入是原始运算符 orig 和要累积到的梯度 grad 。我们返回的是一个列表，其中第 i 个索引处的元素是运算符相对于运算符的第 i 个输入的导数。通常，梯度将返回一个列表，其中包含与基本运算符的输入一样多的元素。

在我们进一步分析这个定义之前，首先我们应该回忆一下 sigmoid 函数的导数： $\frac{\partial \sigma}{\partial x}
= \sigma(x)(1 - \sigma(x))$。上面的定义看起来类似于数学定义，但有一个重要的补充，我们将在下面描述。

术语 orig * (ones_like(orig) - orig) 直接匹配导，因为 orig 这里是 sigmoid 函数，但我们不仅对如何计算该函数的梯度感兴趣。我们有兴趣将这个梯度与其他梯度组合起来，因此我们可以在整个程序中累积梯度。这就是 grad 术语的用武之地。在表达式 grad * orig * (ones_like(orig) - orig) 中，乘以 grad 指定了到目前为止如何用梯度组合导数。

现在，我们考虑一个更有趣的例子``multiply`：

@register_gradient("multiply")
def multiply_grad(orig, grad):
    """Returns [grad * y, grad * x]"""
    x, y = orig.args
    return [collapse_sum_like(grad * y, x),
            collapse_sum_like(grad * x, y)]

在此示例中，返回的列表中有两个元素，因为乘法是二元运算符。回想一下，如果 $f(x, y) = xy$，偏导数是 $\frac{\partial f}{\partial x} = y$ 和 $\frac{\partial f}{\partial y} = x$。

multiply 有一个必需的步骤，sigmoid 不需要，因为multiply 具有广播语义。由于 grad 的形状能与输入的形状不匹配，我们使用 collapse_sum_like 来获取 grad * <var> 项的内容，并使形状与我们要区分的输入的形状相匹配尊重。

在 C++ 中添加梯度

在 C++ 中添加梯度与在 Python 中添加类似，但注册的界面略有不同。

首先，确保包含了src/relay/transforms/pattern_utils.h。它提供了在 Relay AST 中创建节点的辅助函数。然后，以与 Python 示例中类似的方式定义梯度：

tvm::Array<Expr> MultiplyGrad(const Expr& orig_call, const Expr& output_grad) {
    const Call& call = orig_call.Downcast<Call>();
    return { CollapseSumLike(Multiply(output_grad, call.args[1]), call.args[0]),
             CollapseSumLike(Multiply(output_grad, call.args[0]), call.args[1]) };
}

请注意，在 C++ 中，我们不能使用与 Python 中相同的运算符重载，我们需要向下转换，因此实现更加冗长。即便如此，我们可以很容易地验证这个定义是否反映了 Python 中早期示例。

现在，我们不需要使用 Python 装饰器，而是需要将“FPrimalGradient”的 set_attr 调用附加到基本运算符注册的末尾，以便注册梯度。

RELAY_REGISTER_OP("multiply")
    // ...
    // Set other attributes
    // ...
    .set_attr<FPrimalGradient>("FPrimalGradient", MultiplyGrad);