CMU10-414-Assigenment

CMU 10-414 Assignments 实验笔记

前言

本文记录了完成《CMU 10-414/714 Deep Learning System》配套 Assignments 的过程和对应笔记。共有 6 个 hw,循序渐进地从头实现了一个深度学习框架,并利用搭建 DL 中厂常见的网络模型,包括 CNN、RNN、Transformer 等。 环境为 Ubuntu 24 @ WSL2。 由于官方自动评分系统目前不再接受非选课学生注册,因此本代码仅保证能够通过已有测试样例。

资源存档

源码来自官方:Assignments

所有代码均上传至 cmu10-414-assignments: cmu10-414-assignments,如官网撤包,可通过 git 回滚获取原始代码。

hw0

第一个 homework 共需完成 7 个函数,第一个很简单,用于熟悉评测系统,直接从第二个函数开始。

parse_mnist

这个函数签名为:parse_mnist(image_filename, label_filename),用于读取 MNIST 手写数据集。官网 对数据集格式有详细介绍,直接下拉到 FILE FORMATS FOR THE MNIST DATABASE 这部分即可。

整个数据集分为训练集和测试集,包括数字图像和标签。标签文件内前 8Byte 记录了 magic number 和 number of items,之后按照每个样本占 1Byte 的格式组织。图像文件内前 16Byte 记录了非图像数据,之后按照行优先的顺序按照每个像素占 1Byte 的格式以此排布,每个图片共有 28×28 个像素点。

具体实现中,使用 gzip 库按字节读取数据文件,注意整个数据集需要进行标准化,即将每个像素的灰度值除以 255。完整实现为:

import gzip
import numpy as np

def parse_mnist(image_filename, label_filename):
    image_file_handle = gzip.open(image_filename, 'rb')
    label_file_handle = gzip.open(label_filename, 'rb')

    # Read past the headers
    image_file_handle.read(16)
    label_file_handle.read(8)

    # Read the data
    image_data = image_file_handle.read()
    label_data = label_file_handle.read()

    # Close the files
    image_file_handle.close()
    label_file_handle.close()

    # Process image data
    X = np.frombuffer(image_data, dtype=np.uint8).reshape(-1, 28*28).astype(np.float32)
    X = X / 255.0
    
    # Process label data
    y = np.frombuffer(label_data, dtype=np.uint8)

    return X, y

softmax_loss

这个函数签名为:softmax_loss(Z, y),需要注意的是它计算的是 softmax 损失,或者说是交叉熵损失,而不是进行 softmax 归一化。

照着公式写两行代码即可,不用再赘述:

def softmax_loss(Z, y):
    rows = np.arange(Z.shape[0])
    return -np.mean(Z[rows, y] - np.log(np.sum(np.exp(Z), axis=1)))

softmax_regression_epoch

这个函数签名为:softmax_regression_epoch(X, y, theta, lr = 0.1, batch=100),要实现的是 softmax 回归一个 epoch 上的训练过程。

首先计算出总的 batch 数,并进行这么多次的循环。在每个循环内,先从 X 和 y 中取出对应样本,然后根据公式计算即可。这里涉及到将 label 转换为独热编码的一个小技巧:E_batch = np.eye(theta.shape[1])[y_batch],其它则比较简单:

def softmax_loss(Z, y):
    rows = np.arange(Z.shape[0])
    return -np.mean(Z[rows, y] - np.log(np.sum(np.exp(Z), axis=1)))

nn_epoch

这个函数签名为:nn_epoch(X, y, W1, W2, lr = 0.1, batch=100),要实现一个双层感知机在一个 epoch 上的训练过程。

跟着公式写代码计算即可,需要注意的两个点:

完整代码为:

def softmax_loss(Z, y):
    rows = np.arange(Z.shape[0])
    return -np.mean(Z[rows, y] - np.log(np.sum(np.exp(Z), axis=1)))

softmax_regression_epoch_cpp

这个函数签名为:void softmax_regression_epoch_cpp(const float *X, const unsigned char *y, float *theta, size_t m, size_t n, size_t k, float lr, size_t batch),这是一个 softmax 回归在 cpp 上的实现版本。

与 Python 自动处理数组索引越界不同,cpp 版本要分开考虑完整的 batch 和最后一轮不完整的 batch。计算 logits 时,需要使用三轮循环模拟矩阵乘法。cpp 版本的实现可以不写出 EyEy​ 矩阵,梯度计算不用使用矩阵计算,直接使用两层循环,判断 class_idx 是否为正确的 label:softmax[sample_idx * k + class_idx] -= (y[start_idx + sample_idx] == class_idx);

完整的代码为:

def softmax_regression_epoch(X, y, theta, lr=0.1, batch=100):
    total_batches = (X.shape[0] + batch - 1) // batch
    for i in range(total_batches):
        X_batch = X[i*batch:(i+1)*batch]
        y_batch = y[i*batch:(i+1)*batch]
        E_batch = np.eye(theta.shape[1])[y_batch]
        logits = X_batch @ theta
        Z_batch = np.exp(logits)
        Z_batch /= np.sum(Z_batch, axis=1, keepdims=True)
        gradients = X_batch.T @ (Z_batch - E_batch) / batch
        theta -= lr * gradients

hw0 小结

hw0 理应是在 Lecture 2 课前完成的,初学者看到一堆公式应该会很懵逼,但整个 hw 比较简单,照着公式一步步走就能完成(除了双层感知机中奇怪的精度错误),主要还是用来熟悉 NumPy 和基本的 DL 模型。

hw1

第一个 homework 共有六个小问:正向计算、反向梯度、拓扑排序反向模式自动微分、softmax 损失、双层感知机的 SGD 算法。

Implementing forward & backward computation

前两个小问就放在一起讨论了。第一问是通过 NumPy 的 API 实现一些常用的算子,第二问则是通过第一问的算子实现常用算子的梯度实现。

需要注意的是,notebook 中强调了第一问操作的对象是 NDArray,第二问是 Tensor。前者模拟的事这些算子的低层实现,后者则是通过调用这个算子来实现梯度计算,或者说是将梯度计算封装为另一个算子,这样就可以求梯度看作一个普通运算,进而自动求出梯度的梯度。详细解释请看 Lecture 4。

def nn_epoch(X, y, W1, W2, lr=0.1, batch=100):
    total_batches = (X.shape[0] + batch - 1) // batch
    for i in range(total_batches):
        X_batch = X[i*batch:(i+1)*batch]
        y_batch = y[i*batch:(i+1)*batch]
        E_batch = np.eye(W2.shape[1])[y_batch]
        Z1_batch = np.maximum(X_batch @ W1, 0)
        G2_batch = np.exp(Z1_batch @ W2)
        G2_batch /= np.sum(G2_batch, axis=1, keepdims=True)
        G2_batch -= E_batch
        G1_batch = (Z1_batch > 0) * (G2_batch @ W2.T)
        gradients_W1 = X_batch.T @ G1_batch
        gradients_W2 = Z1_batch.T @ G2_batch
        W1 -= lr * gradients_W1
        W2 -= lr * gradients_W2

python<br>class EWiseDiv(TensorOp):<br> """Op to element-wise divide two nodes."""<br><br> def compute(self, a, b):<br> return array_api.true_divide(a, b)<br><br> def gradient(self, out_grad, node):<br> a, b = node.inputs<br> return out_grad/b , -a/b/b*out_grad<br>
python<br>class DivScalar(TensorOp):<br> def __init__(self, scalar):<br> self.scalar = scalar<br><br> def compute(self, a):<br> return array_api.true_divide(a, self.scalar)<br><br> def gradient(self, out_grad, node):<br> return out_grad/self.scalar<br>
<br>1<br>2<br> python<br>adjoint1 = out_grad @ transpose(b)<br>adjoint2 = transpose(a) @ out_grad<br>

但但但是,以上只是理论推导。在实际应用中,存在两个问题:1) 矩阵乘法可能是高维矩阵而非二维矩阵相乘,例如 shape 为 (2, 2, 3, 4) 和 (2, 2, 4, 5) 的两个张量相乘;2) 张量乘法过程可能存在广播的情况,这种情况下的梯度怎么处理。

第一个问题,NumPy 基本都为我们处理好了,只要两个张量的倒数两个维度符合二维矩阵乘法且其余维度(也称为批量维度)相等,或者某个批量维度为 1(会进行广播),它们就可以进行张量乘法运算。

天下没有免费的午餐,自动广播带来便利的同时,也带来了第二个问题。求出的 adjoint 或者说偏导,应该和输入参数的维度一致,但根据公式计算得到的梯度的维度和广播后的维度一样,因此要进行 reduce 操作。

以下是我不严谨且非形式化的 reduce 操作推导:假设矩阵 Am×nAm×n​ 经过广播后是 Ap×n×n′Ap×n×n′​,实际上参与计算的就是这个 A′A′。首先直接假设在计算图上用 A′A′ 替代 AA,当 A′@BA′@B(该节点记为 f(x1,…)f(x1​,…))的某个输入节点 x1x1​ 需要计算梯度时,就会需要计算张量 ∂f/∂x1∂f/∂x1​ 和张量 A′A′ 求得的偏导之间的乘积。接下来我们把 AA 还原,相对应的,f(x1,…)f(x1​,…) 这个节点计算梯度就要将 pp 维度上的偏导数全部加起来,这体现在 Ap×n×n′Ap×n×n′​ 也是将其 pp 维度上的元素全部加起来,得到 Am×n′Am×n′​。

上面这段描述不太清晰,总而言之就是要将广播出来的维度全部 sum 掉。

NumPy 中广播新增的维度只会放在最前面,因此只需要计算出要 sum 掉维度的个数,然后取前 nn 个维度即可,具体见代码:

python<br>class MatMul(TensorOp):<br> def compute(self, a, b):<br> return a@b<br><br> def gradient(self, out_grad, node):<br> a, b = node.inputs<br> adjoint1 = out_grad @ transpose(b)<br> adjoint2 = transpose(a) @ out_grad<br> adjoint1 = summation(adjoint1, axes=tuple(range(len(adjoint1.shape) - len(a.shape))))<br> adjoint2 = summation(adjoint2, axes=tuple(range(len(adjoint2.shape) - len(b.shape))))<br> return adjoint1, adjoint2<br>

SUM(Xs1×s2×…×sn,axes)=[∑si∈axesX]{sj∣j∉axes}SUM(Xs1​×s2​×…×sn​​,axes)=[si​∈axes∑​X]{sj​∣j∈/axes}​

假设一个输入为的 shape 为 3×2×4×53×2×4×5,在第 0 和 2 的维度上做 summation,输出的 shape 为 2×52×5。反向传播的过程就是先把 out_grad 扩展到 1×2×1×51×2×1×5,然后广播到输入的 shape。

埋个坑,这部分还没有理解,不知道怎么形式化表达求和运算与并对其求导,误打误撞以下代码通过了测试:

python<br>class Summation(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> self.axes = axes<br><br> def compute(self, a):<br> return array_api.sum(a, axis=self.axes)<br><br> def gradient(self, out_grad, node):<br> a = node.inputs[0]<br> shape = list(a.shape)<br> axes = self.axes<br> if axes is None:<br> axes = list(range(len(shape)))<br> for _ in axes:<br> shape[_] = 1<br> return broadcast_to(reshape(out_grad, shape), a.shape)<br>

关于广播算子正向和梯度运算的分析,可查看 MatMul 算子,其对广播过程有详细讨论。本算子实现代码为:

python<br>class BroadcastTo(TensorOp):<br> def __init__(self, shape):<br> self.shape = shape<br><br> def compute(self, a):<br> return array_api.broadcast_to(a, self.shape)<br><br> def gradient(self, out_grad, node):<br> input_shape = node.inputs[0].shape<br> ret = summation(out_grad, tuple(range(len(out_grad.shape) - len(input_shape))))<br> for i, dim in enumerate(input_shape):<br> if dim == 1:<br> ret = summation(ret, axes=(i,))<br> return reshape(ret, input_shape)<br>
python<br>class Reshape(TensorOp):<br> def __init__(self, shape):<br> self.shape = shape<br><br> def compute(self, a):<br> return array_api.reshape(a, self.shape)<br><br> def gradient(self, out_grad, node):<br> return reshape(out_grad, node.inputs[0].shape)<br>
python<br>class Negate(TensorOp):<br> def compute(self, a):<br> return array_api.negative(a)<br><br> def gradient(self, out_grad, node):<br> return negate(out_grad)<br>
python<br>class Transpose(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> self.axes = axes<br><br> def compute(self, a):<br> if self.axes is None:<br> return array_api.swapaxes(a, -1, -2)<br> else:<br> return array_api.swapaxes(a, *self.axes)<br><br> def gradient(self, out_grad, node):<br> return transpose(out_grad, self.axes)<br>

Topological sort

这一小问要求实现拓扑排序,涉及的知识点都是数据结构的内容,包括图的拓扑排序、后序遍历和 dfs 算法。

在问题说明中明确要求使用树的后序遍历对算法图求解其拓扑序列,简单来说就是如果本节点存在未访问的子节点(inputs),则先访问子节点,否则访问本节点。所谓访问本节点,就是将其标记为已访问,并将其放入拓扑序列。

结合 dfs 算法,求拓扑序列的代码为:

python<br>def find_topo_sort(node_list: List[Value]) -> List[Value]:<br> visited = dict()<br> topo_order = []<br> for node in node_list:<br> if not visited.get(node, False):<br> topo_sort_dfs(node, visited, topo_order)<br> return topo_order<br><br>def topo_sort_dfs(node, visited: dict, topo_order):<br> sons = node.inputs<br> for son in sons:<br> if not visited.get(son, False):<br> topo_sort_dfs(son, visited, topo_order)<br> visited[node] = True<br> topo_order.append(node)<br>

Implementing reverse mode differentiation

终于开始组装我们的自动微分算法了!核心就是理论课中介绍的反向模式 AD 的算法为代码:
image.png|400|360x416
其中有几个注意点:

在写代码之前,最好复习一遍理论;在 debug 的过程中,可以自己画一下计算图,会有奇效。反向模式 AD 具体实现为:

python<br>def compute_gradient_of_variables(output_tensor, out_grad) -> None:<br> for node in reverse_topo_order:<br> node.grad = sum_node_list(node_to_output_grads_list[node])<br> if len(node.inputs) > 0:<br> gradient = node.op.gradient(node.grad, node)<br> if isinstance(gradient, tuple):<br> for i, son_node in enumerate(node.inputs):<br> node_to_output_grads_list.setdefault(son_node, [])<br> node_to_output_grads_list[son_node].append(gradient[i])<br> else:<br> node_to_output_grads_list.setdefault(node.inputs[0], [])<br> node_to_output_grads_list[node.inputs[0]].append(gradient)<br>

Softmax loss

本问题先要完成对数函数和指数函数的前向和反向计算,然后再完成 softmax 损失,也就是交叉熵损失函数。

根据说明,这里传入的 y 已经转为了独热编码。具体实现根据说明中的公式一点点写即可,没有要特别说明的:

python<br>def softmax_loss(Z, y_one_hot):<br> batch_size = Z.shape[0]<br> lhs = ndl.log(ndl.exp(Z).sum(axes=(1, )))<br> rhs = (Z * y_one_hot).sum(axes=(1, ))<br> loss = (lhs - rhs).sum()<br> return loss / batch_size<br>

SGD for a two-layer neural network

最后一问,利用前面的组件,实现一个双层感知机及其随机梯度下降算法。注意事项:

python<br>def nn_epoch(X, y, W1, W2, lr=0.1, batch=100):<br> batch_cnt = (X.shape[0] + batch - 1) // batch<br> num_classes = W2.shape[1]<br> one_hot_y = np.eye(num_classes)[y]<br> for batch_idx in range(batch_cnt):<br> start_idx = batch_idx * batch<br> end_idx = min(X.shape[0], (batch_idx+1)*batch)<br> X_batch = X[start_idx:end_idx, :]<br> y_batch = one_hot_y[start_idx:end_idx]<br> X_tensor = ndl.Tensor(X_batch)<br> y_tensor = ndl.Tensor(y_batch) <br> first_logits = X_tensor @ W1 # type: ndl.Tensor<br> first_output = ndl.relu(first_logits) # type: ndl.Tensor<br> second_logits = first_output @ W2 # type: ndl.Tensor<br> loss_err = softmax_loss(second_logits, y_tensor) # type: ndl.Tensor<br> loss_err.backward()<br> <br> new_W1 = ndl.Tensor(W1.numpy() - lr * W1.grad.numpy())<br> new_W2 = ndl.Tensor(W2.numpy() - lr * W2.grad.numpy())<br> W1, W2 = new_W1, new_W2<br><br> return W1, W2<br>

hw 1 小结

明显感觉到,这个 hw 的强度上来了。由于不太熟悉 NumPy 的运算,中间查了不少资料和别人的实现。感谢 @# xx要努力 的文章 1,不少都是参考他的实现。

最后双层感知机的调试,由于使用了 Tensor 算子来实现,跑了十几分钟,最后才发现题干已经要求使用 NumPy 运算。长了个很大的教训,下次一定好好读题。

hw2

Q1: Weight Initialization

Q1 实现的是几种不同的生成参数初始值的方法,结合 init_basic.py 中的辅助函数,照抄 notebook 中给的公式实现,比较简单。注意把 kwargs 传递给辅助函数,里面有 dtypedevice 等信息。

python<br>def xavier_uniform(fan_in, fan_out, gain=1.0, **kwargs):<br> ### BEGIN YOUR SOLUTION<br> a = gain * math.sqrt(6 / (fan_in + fan_out))<br> return rand(fan_in, fan_out, low=-a, high=a, **kwargs)<br> ### END YOUR SOLUTION<br><br><br>def xavier_normal(fan_in, fan_out, gain=1.0, **kwargs):<br> ### BEGIN YOUR SOLUTION<br> std = gain * math.sqrt(2 / (fan_in + fan_out))<br> return randn(fan_in, fan_out, mean=0, std=std, **kwargs)<br> ### END YOUR SOLUTION<br><br><br>def kaiming_uniform(fan_in, fan_out, nonlinearity="relu", **kwargs):<br> assert nonlinearity == "relu", "Only relu supported currently"<br> if nonlinearity == "relu":<br> gain = math.sqrt(2)<br> ### BEGIN YOUR SOLUTION<br> bound = gain * math.sqrt(3 / fan_in)<br> return rand(fan_in, fan_out, low=-bound, high=bound, **kwargs)<br> ### END YOUR SOLUTION<br><br><br>def kaiming_normal(fan_in, fan_out, nonlinearity="relu", **kwargs):<br> assert nonlinearity == "relu", "Only relu supported currently"<br> if nonlinearity == "relu":<br> gain = math.sqrt(2)<br> ### BEGIN YOUR SOLUTION<br> std = gain / math.sqrt(fan_in)<br> return randn(fan_in, fan_out, mean=0, std=std, **kwargs)<br> ### END YOUR SOLUTION<br>

Q2: nn_basic

在 Q2,我们将实现几个最基本的 Module 组件。在 Debug 过程中,我遇到了两个很奇怪问题:

第二个问题是因为 numpy 中许多运算都会进行自动广播,但是该广播操作对我们的 needle 库是不可见的,也无法添加到计算图中,因此导致了反向传播过程的 shape 不匹配。解决方案是修改修改 Q1 中基础算子的实现,在计算前检查 shape 是否匹配。修改后的 ops_mathematic.py 文件内容为:

python<br>"""Operator implementations."""<br><br>from numbers import Number<br>from typing import Optional, List, Tuple, Union<br><br>from ..autograd import NDArray<br>from ..autograd import Op, Tensor, Value, TensorOp<br>from ..autograd import TensorTuple, TensorTupleOp<br>import numpy<br><br># NOTE: we will import numpy as the array_api<br># as the backend for our computations, this line will change in later homeworks<br><br>import numpy as array_api<br><br><br>class EWiseAdd(TensorOp):<br> def compute(self, a: NDArray, b: NDArray):<br> assert a.shape == b.shape , "The shape of lhs {} and rhs {} should be the same".format(a.shape, b.shape)<br> return a + b<br><br> def gradient(self, out_grad: Tensor, node: Tensor):<br> return out_grad, out_grad<br><br><br>def add(a, b):<br> return EWiseAdd()(a, b)<br><br><br>class AddScalar(TensorOp):<br> def __init__(self, scalar):<br> self.scalar = scalar<br><br> def compute(self, a: NDArray):<br> return a + self.scalar<br><br> def gradient(self, out_grad: Tensor, node: Tensor):<br> return out_grad<br><br><br>def add_scalar(a, scalar):<br> return AddScalar(scalar)(a)<br><br><br>class EWiseMul(TensorOp):<br> def compute(self, a: NDArray, b: NDArray):<br> assert a.shape == b.shape, "The shape of two tensors should be the same"<br> return a * b<br><br> def gradient(self, out_grad: Tensor, node: Tensor):<br> lhs, rhs = node.inputs<br> return out_grad * rhs, out_grad * lhs<br><br><br>def multiply(a, b):<br> return EWiseMul()(a, b)<br><br><br>class MulScalar(TensorOp):<br> def __init__(self, scalar):<br> self.scalar = scalar<br><br> def compute(self, a: NDArray):<br> return a * self.scalar<br><br> def gradient(self, out_grad: Tensor, node: Tensor):<br> return (out_grad * self.scalar,)<br><br><br>def mul_scalar(a, scalar):<br> return MulScalar(scalar)(a)<br><br><br>class PowerScalar(TensorOp):<br> """Op raise a tensor to an (integer) power."""<br><br> def __init__(self, scalar: int):<br> self.scalar = scalar<br><br> def compute(self, a: NDArray) -> NDArray:<br> ### BEGIN YOUR SOLUTION<br> return array_api.power(a, self.scalar, dtype=a.dtype)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> a = node.inputs[0]<br> return self.scalar * (power_scalar(a, self.scalar-1)) * out_grad<br> ### END YOUR SOLUTION<br><br><br>def power_scalar(a, scalar):<br> return PowerScalar(scalar)(a)<br><br><br>class EWisePow(TensorOp):<br> """Op to element-wise raise a tensor to a power."""<br><br> def compute(self, a: NDArray, b: NDArray) -> NDArray:<br> assert a.shape == b.shape, "The shape of two tensors should be the same"<br> return a**b<br><br> def gradient(self, out_grad, node):<br> if not isinstance(node.inputs[0], NDArray) or not isinstance(<br> node.inputs[1], NDArray<br> ):<br> raise ValueError("Both inputs must be tensors (NDArray).")<br><br> a, b = node.inputs[0], node.inputs[1]<br> grad_a = out_grad * b * (a ** (b - 1))<br> grad_b = out_grad * (a**b) * array_api.log(a.data)<br> return grad_a, grad_b<br><br>def power(a, b):<br> return EWisePow()(a, b)<br><br><br>class EWiseDiv(TensorOp):<br> """Op to element-wise divide two nodes."""<br><br> def compute(self, a, b):<br> ### BEGIN YOUR SOLUTION<br> assert a.shape == b.shape, "The shape of two tensors should be the same"<br> return array_api.true_divide(a, b)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> a, b = node.inputs<br> return out_grad/b , -a/b/b*out_grad<br> ### END YOUR SOLUTION<br><br><br>def divide(a, b):<br> return EWiseDiv()(a, b)<br><br><br>class DivScalar(TensorOp):<br> def __init__(self, scalar):<br> self.scalar = scalar<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.true_divide(a, self.scalar, dtype=a.dtype)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return out_grad/self.scalar<br> ### END YOUR SOLUTION<br><br><br>def divide_scalar(a, scalar):<br> return DivScalar(scalar)(a)<br><br><br>class Transpose(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> self.axes = axes<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> if self.axes is None:<br> return array_api.swapaxes(a, -1, -2)<br> else:<br> return array_api.swapaxes(a, *self.axes)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return transpose(out_grad, self.axes)<br> ### END YOUR SOLUTION<br><br><br>def transpose(a, axes=None):<br> return Transpose(axes)(a)<br><br><br>class Reshape(TensorOp):<br> def __init__(self, shape):<br> self.shape = shape<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> expect_size = 1<br> for i in self.shape:<br> expect_size *= i<br> real_size = 1<br> for i in a.shape:<br> real_size *= i<br> assert expect_size == real_size , "The reshape size is not compatible"<br> return array_api.reshape(a, self.shape)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return reshape(out_grad, node.inputs[0].shape)<br> ### END YOUR SOLUTION<br><br><br>def reshape(a, shape):<br> return Reshape(shape)(a)<br><br><br>class BroadcastTo(TensorOp):<br> def __init__(self, shape):<br> self.shape = shape<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> assert len(self.shape) >= len(a.shape), \<br> "The target shape's dimension count {} should be greater than \<br> or equal to the input shape's dimension count {}".format(len(self.shape), len(a.shape))<br> for i in range(len(a.shape)):<br> assert a.shape[-1 - i] == self.shape[-1 - i] or a.shape[-1 - i] == 1, \<br> "The input shape {} is not compatible with the target shape {}".format(a.shape, self.shape)<br> return array_api.broadcast_to(a, self.shape)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> input_shape = node.inputs[0].shape<br> ret = summation(out_grad, tuple(range(len(out_grad.shape) - len(input_shape))))<br> for i in range(len(input_shape)):<br> if input_shape[-1 - i] == 1 and self.shape[-1 - i] != 1:<br> ret = summation(ret, (len(input_shape) - 1 - i,))<br> return reshape(ret, input_shape)<br> ### END YOUR SOLUTION<br><br><br>def broadcast_to(a, shape):<br> return BroadcastTo(shape)(a)<br><br><br>class Summation(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> self.axes = axes<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.sum(a, axis=self.axes)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> a = node.inputs[0]<br> shape = list(a.shape)<br> axes = self.axes<br> if axes is None:<br> axes = list(range(len(shape)))<br> for _ in axes:<br> shape[_] = 1<br> return broadcast_to(reshape(out_grad, shape), a.shape)<br> ### END YOUR SOLUTION<br><br><br>def summation(a, axes=None):<br> return Summation(axes)(a)<br><br><br>class MatMul(TensorOp):<br> def compute(self, a, b):<br> ### BEGIN YOUR SOLUTION<br> return a@b<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> a, b = node.inputs<br> adjoint1 = out_grad @ transpose(b)<br> adjoint2 = transpose(a) @ out_grad<br> adjoint1 = summation(adjoint1, axes=tuple(range(len(adjoint1.shape) - len(a.shape))))<br> adjoint2 = summation(adjoint2, axes=tuple(range(len(adjoint2.shape) - len(b.shape))))<br> return adjoint1, adjoint2<br> ### END YOUR SOLUTION<br><br><br>def matmul(a, b):<br> return MatMul()(a, b)<br><br><br>class Negate(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.negative(a)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return negate(out_grad)<br> ### END YOUR SOLUTION<br><br><br>def negate(a):<br> return Negate()(a)<br><br><br>class Log(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.log(a)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return out_grad / node.inputs[0]<br> ### END YOUR SOLUTION<br><br><br>def log(a):<br> return Log()(a)<br><br><br>class Exp(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.exp(a)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return out_grad * exp(node.inputs[0])<br> ### END YOUR SOLUTION<br><br><br>def exp(a):<br> return Exp()(a)<br><br><br>class ReLU(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.maximum(a, 0)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> relu_mask = Tensor(node.inputs[0].cached_data > 0)<br> return out_grad * relu_mask<br> ### END YOUR SOLUTION<br><br><br>def relu(a):<br> return ReLU()(a)<br>

万事俱备,接下来可以开始完成 Q2 了。

Y=XW+BY=XW+B

注意 weight 和 bias 都是 Parameter 类型,如果定义为 Tensor 类型,会导致后面实现优化器过不了测试点。该模块代码为:

python<br>class Linear(Module):<br> def __init__(<br> self, in_features, out_features, bias=True, device=None, dtype="float32"<br> ):<br> super().__init__()<br> self.in_features = in_features<br> self.out_features = out_features<br><br> ### BEGIN YOUR SOLUTION<br> self.weight = init.kaiming_uniform(in_features, out_features, device=device, dtype=dtype)<br> self.weight = Parameter(self.weight, device=device, dtype=dtype)<br> self.bias = None<br> if bias:<br> self.bias = init.kaiming_uniform(out_features, 1, device=device, dtype=dtype)<br> self.bias = self.bias.transpose()<br> self.bias = Parameter(self.bias, device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br><br> def forward(self, X: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> if self.bias.shape != (1, self.out_features):<br> self.bias = self.bias.reshape((1, self.out_features))<br> y = ops.matmul(X, self.weight)<br> if self.bias:<br> y += self.bias.broadcast_to(y.shape)<br> return y<br> ### END YOUR SOLUTION<br>

class Sequential(Module): def init(self, *modules): super().init() self.modules = modules

def forward(self, x: Tensor) -> Tensor:
    ### BEGIN YOUR SOLUTION
    y = x
    for module in self.modules:
        y = module(y)
    return y
    
    ### END YOUR SOLUTION
fallback<br>- LogSumExp <br> 这里要实现的是数值稳定版本的 LogSumExp 算子。文档中直接给出了公式,这里我们给出推导过程:<br><br>log⁡∑iexp⁡(zi)=log⁡∑iexp⁡(zi−max⁡z+max⁡z)=log⁡∑i[exp⁡(zi−max⁡z)⋅exp⁡(max⁡z)]=log⁡[∑iexp⁡(zi−max⁡z)⋅exp⁡(max⁡z)]=log⁡∑iexp⁡(zi−max⁡z)+max⁡zlogi∑​exp(zi​)​=logi∑​exp(zi​−maxz+maxz)=logi∑​[exp(zi​−maxz)⋅exp(maxz)]=log[i∑​exp(zi​−maxz)⋅exp(maxz)]=logi∑​exp(zi​−maxz)+maxz​<br><br>通过恒等变换,避免了 exp⁡exp 指数运算可能导致的数值上溢的问题。<br><br>显然,数值稳定版本的梯度和原始公式的梯度一致,直接求导或者根据文章 [LogSumExp梯度推导](https://www.zhouxin.space/notes/gradient-of-log-sum-exp/) 得到其梯度计算公式为:<br><br>∂f∂zj=exp⁡z^j∑i=1nexp⁡z^i=exp⁡(zj−log⁡∑i=1nexp⁡z^i)=exp⁡(zj−f)∂zj​∂f​​=∑i=1n​expz^i​expz^j​​=exp(zj​−logi=1∑n​expz^i​)=exp(zj​−f)​<br><br>惊喜地发现,LogSumExp 这个函数的梯度可以用其输入和输出来表示,那在代码实现中,只要获取该节点的输入和输出就可以计算出梯度,即:<br>```python<br>class LogSumExp(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> self.axes = axes<br><br> def compute(self, Z):<br> ### BEGIN YOUR SOLUTION<br> max_z = array_api.max(Z, axis=self.axes, keepdims=True)<br> self.max_z = max_z<br> return array_api.log(array_api.sum(array_api.exp(Z - max_z), axis=self.axes)) + max_z.squeeze()<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> if self.axes is None:<br> self.axes = tuple(range(len(node.inputs[0].shape)))<br> z = node.inputs[0]<br> shape = [1 if i in self.axes else z.shape[i] for i in range(len(z.shape))]<br> gradient = exp(z - node.reshape(shape).broadcast_to(z.shape))<br> return out_grad.reshape(shape).broadcast_to(z.shape)*gradient<br>

ℓsoftmax(z,y)=log⁡∑i=1kexp⁡zi−zyℓsoftmax​(z,y)=logi=1∑k​expzi​−zy​​

代码骨架中已经提供了一个将标签转换为度和编码的辅助函数,同时记得求的损失应该是在 batch 上的均值,记得做平均。

python<br>class SoftmaxLoss(Module):<br> def forward(self, logits: Tensor, y: Tensor):<br> ### BEGIN YOUR SOLUTION<br> batch_size, label_size = logits.shape<br> one_hot_y = init.one_hot(label_size, y)<br> true_logits = ops.summation(logits * one_hot_y, axes=(1,))<br> return (ops.logsumexp(logits, axes=(1, )) - true_logits).sum()/batch_size<br> ### END YOUR SOLUTION<br>

y=w∘xi−Ex+by=w∘((Var[x]+ϵ)1/2)xi​−E[x]​+b​

根据公式照抄即可,但是要注意中间变量的 shape:

python<br>class LayerNorm1d(Module):<br> def __init__(self, dim, eps=1e-5, device=None, dtype="float32"):<br> super().__init__()<br> self.dim = dim<br> self.eps = eps<br> self.weight = Parameter(init.ones(1, dim, device=device, dtype=dtype), device=device, dtype=dtype)<br> ### BEGIN YOUR SOLUTION<br> self.bias = Parameter(init.zeros(1, dim, device=device, dtype=dtype), device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> batch_size, feature_size = x.shape<br> mean = (x.sum(axes=(1, )) / feature_size).reshape((batch_size, 1)).broadcast_to(x.shape)<br> var = (((x - mean) ** 2).sum(axes=(1, )) / feature_size).reshape((batch_size, 1)).broadcast_to(x.shape)<br> std_x = (x - mean) / ops.power_scalar(var + self.eps, 0.5)<br> weight = self.weight.broadcast_to(x.shape)<br> bias = self.bias.broadcast_to(x.shape)<br> return std_x * weight + bias<br> ### END YOUR SOLUTION<br>
python<br>class Flatten(Module):<br> def forward(self, X):<br> ### BEGIN YOUR SOLUTION<br> assert len(X.shape) >= 2<br> elem_cnt = 1<br> for i in range(1, len(X.shape)):<br> elem_cnt *= X.shape[i]<br> return X.reshape((X.shape[0], elem_cnt))<br> ### END YOUR SOLUTION<br>

与 LayerNorm 类似,在实现过程中运用了大量 reshape 和广播操作,要留意中间变量的形状。

python<br>class BatchNorm1d(Module):<br> def __init__(self, dim, eps=1e-5, momentum=0.1, device=None, dtype="float32"):<br> super().__init__()<br> self.dim = dim<br> self.eps = eps<br> self.momentum = momentum<br> ### BEGIN YOUR SOLUTION<br> self.weight = Parameter(init.ones(1, dim, device=device, dtype=dtype), device=device, dtype=dtype)<br> self.bias = Parameter(init.zeros(1, dim, device=device, dtype=dtype), device=device, dtype=dtype)<br> self.running_mean = init.zeros(dim, device=device, dtype=dtype)<br> self.running_var = init.ones(dim, device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> if self.weight.shape != (1, self.dim):<br> self.weight = self.weight.reshape((1, self.dim))<br> if self.bias.shape != (1, self.dim):<br> self.bias = self.bias.reshape((1, self.dim))<br> if self.training:<br> batch_size, feature_size = x.shape<br> mean = (x.sum(axes=(0, )) / batch_size).reshape((1, feature_size))<br> var = (((x - mean.broadcast_to(x.shape)) ** 2).sum(axes=(0, )) / batch_size).reshape((1, feature_size))<br> self.running_mean = self.running_mean *(1 - self.momentum) + mean.reshape(self.running_mean.shape) * ( self.momentum)<br> self.running_var = self.running_var *(1 - self.momentum) + var.reshape(self.running_var.shape) * (self.momentum)<br> mean = mean.broadcast_to(x.shape)<br> var = var.broadcast_to(x.shape)<br> std_x = (x - mean) / ops.power_scalar(var + self.eps, 0.5)<br> weight = self.weight.broadcast_to(x.shape)<br> bias = self.bias.broadcast_to(x.shape)<br> return std_x * weight + bias<br> else:<br> std_x = (x - self.running_mean.broadcast_to(x.shape)) / ops.power_scalar(self.running_var.broadcast_to(x.shape) + self.eps, 0.5)<br> return std_x * self.weight.broadcast_to(x.shape) + self.bias.broadcast_to(x.shape)<br>
python<br>class Dropout(Module):<br> def __init__(self, p=0.5):<br> super().__init__()<br> self.p = p<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> if not self.training:<br> return x<br> mask = init.randb(*x.shape, p=1 - self.p)<br> return x * mask / (1 - self.p)<br> ### END YOUR SOLUTION<br>
python<br>class Residual(Module):<br> def __init__(self, fn: Module):<br> super().__init__()<br> self.fn = fn<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> return x + self.fn(x)<br> ### END YOUR SOLUTION<br>

Q3: Optimizer Implementation

在本问题中,我们将实现优化器模块。优化器模块的作用是根据 loss.backward() 计算出的梯度,更新模型的参数。

需要注意的是,本模块默认启用 l2 正则化或者说 weight decay,因此梯度等于 param.grad + weight_decay * param

python<br>class SGD(Optimizer):<br> def __init__(self, params, lr=0.01, momentum=0.0, weight_decay=0.0):<br> super().__init__(params)<br> self.lr = lr<br> self.momentum = momentum<br> self.u = {}<br> self.weight_decay = weight_decay<br><br> def step(self):<br> ### BEGIN YOUR SOLUTION<br> for param in self.params:<br> if param.grad is not None:<br> if param not in self.u:<br> self.u[param] = ndl.zeros_like(param.grad, requires_grad=False)<br> self.u[param] = self.momentum * self.u[param].data + (1 - self.momentum) * (param.grad.data + self.weight_decay * param.data)<br> param.data = param.data - self.lr * self.u[param]<br> ### END YOUR SOLUTION<br>

Q4: DataLoader Implementation

在本问题中,我们将实现一些数据处理、Dataset 和 DataLoader 类。Dataset 类用于提供标准接口来访问数据集,DataLoader 类是从数据集读取一个 batch 的迭代器。

python<br>class RandomFlipHorizontal(Transform):<br> def __init__(self, p = 0.5):<br> self.p = p<br><br> def __call__(self, img):<br> """<br> Horizonally flip an image, specified as an H x W x C NDArray.<br> Args:<br> img: H x W x C NDArray of an image<br> Returns:<br> H x W x C ndarray corresponding to image flipped with probability self.p<br> Note: use the provided code to provide randomness, for easier testing<br> """<br> flip_img = np.random.rand() < self.p<br> ### BEGIN YOUR SOLUTION<br> if flip_img:<br> img = np.flip(img, axis=1)<br> return img<br> ### END YOUR SOLUTION<br>
python<br>class RandomCrop(Transform):<br> def __init__(self, padding=3):<br> self.padding = padding<br><br> def __call__(self, img):<br> """ Zero pad and then randomly crop an image.<br> Args:<br> img: H x W x C NDArray of an image<br> Return <br> H x W x C NAArray of cliped image<br> Note: generate the image shifted by shift_x, shift_y specified below<br> """<br> shift_x, shift_y = np.random.randint(low=-self.padding, high=self.padding+1, size=2)<br> ### BEGIN YOUR SOLUTION<br> img_size = img.shape<br> img = np.pad(img, ((self.padding, self.padding), (self.padding, self.padding), (0, 0)), 'constant')<br> img = img[self.padding + shift_x:self.padding + shift_x + img_size[0], self.padding + shift_y:self.padding + shift_y + img_size[1], :]<br> return img<br> ### END YOUR SOLUTION<br>

要注意的是:1) 使用之前实现的 parse_mnist 方法来解析 MNIST 数据集;2) Dataset 父类提供了 apply_transforms 方法对图片进行处理;3) __getitem__ 方法最好支持以列表指定的多下标以批量读取数据集;4) 图片处理函数接受的数据格式是 H*W*C,但 __getitem__ 返回值的格式应当为 batch_size*n

代码实现为:

python<br>class MNISTDataset(Dataset):<br> def __init__(<br> self,<br> image_filename: str,<br> label_filename: str,<br> transforms: Optional[List] = None,<br> ):<br> ### BEGIN YOUR SOLUTION<br> self.transforms = transforms<br> self.X, self.y = parse_mnist(image_filename, label_filename)<br> <br> ### END YOUR SOLUTION<br> def __getitem__(self, index) -> object:<br> ### BEGIN YOUR SOLUTION<br> x = self.apply_transforms(self.X[index].reshape(28, 28, -1))<br> return x.reshape(-1, 28*28), self.y[index]<br> ### END YOUR SOLUTION<br><br> def __len__(self) -> int:<br> ### BEGIN YOUR SOLUTION<br> return self.X.shape[0]<br> ### END YOUR SOLUTION<br>
python<br>class DataLoader:<br> r"""<br> Data loader. Combines a dataset and a sampler, and provides an iterable over<br> the given dataset.<br> Args:<br> dataset (Dataset): dataset from which to load the data.<br> batch_size (int, optional): how many samples per batch to load<br> (default: ``1``).<br> shuffle (bool, optional): set to ``True`` to have the data reshuffled<br> at every epoch (default: ``False``).<br> """<br> dataset: Dataset<br> batch_size: Optional[int]<br><br> def __init__(<br> self,<br> dataset: Dataset,<br> batch_size: Optional[int] = 1,<br> shuffle: bool = False,<br> ):<br><br> self.dataset = dataset<br> self.shuffle = shuffle<br> self.batch_size = batch_size<br> if not self.shuffle:<br> self.ordering = np.array_split(np.arange(len(dataset)), <br> range(batch_size, len(dataset), batch_size))<br><br> def __iter__(self):<br> ### BEGIN YOUR SOLUTION<br> if self.shuffle:<br> self.ordering = np.array_split(np.random.permutation(len(self.dataset)), <br> range(self.batch_size, len(self.dataset), self.batch_size))<br> self.index = 0<br> ### END YOUR SOLUTION<br> return self<br><br> def __next__(self):<br> ### BEGIN YOUR SOLUTION<br> if self.index >= len(self.ordering):<br> raise StopIteration<br> else:<br> batch = [Tensor.make_const(x) for x in self.dataset[self.ordering[self.index]]]<br> self.index += 1<br> return batch<br> ### END YOUR SOLUTION<br>

Q5: MLPResNet Implementation

到此为止,我们的 needle 库的各基本组件都实现好了,在本问题中,我们将使用他们拼出 MLP ResNet,并在 MNIST 数据集上进行训练。

python<br>def ResidualBlock(dim, hidden_dim, norm=nn.BatchNorm1d, drop_prob=0.1):<br> ### BEGIN YOUR SOLUTION<br> return nn.Sequential(<br> nn.Residual(<br> nn.Sequential(<br> nn.Linear(dim, hidden_dim),<br> norm(hidden_dim),<br> nn.ReLU(),<br> nn.Dropout(drop_prob),<br> nn.Linear(hidden_dim, dim),<br> norm(dim),<br> )<br> ),<br> nn.ReLU(),<br> )<br> ### END YOUR SOLUTION<br>
python<br>def MLPResNet(<br> dim,<br> hidden_dim=100,<br> num_blocks=3,<br> num_classes=10,<br> norm=nn.BatchNorm1d,<br> drop_prob=0.1,<br>):<br> ### BEGIN YOUR SOLUTION<br> return nn.Sequential(<br> nn.Linear(dim, hidden_dim),<br> nn.ReLU(),<br> *[ResidualBlock(hidden_dim, hidden_dim//2, norm, drop_prob) for _ in range(num_blocks)],<br> nn.Linear(hidden_dim, num_classes),<br> )<br> ### END YOUR SOLUTION<br>
python<br>def epoch(dataloader, model, opt=None):<br> np.random.seed(4)<br> ### BEGIN YOUR SOLUTION<br> loss_func = nn.SoftmaxLoss()<br> error_count = 0<br> loss = 0<br> for x, y in dataloader:<br> if opt is None:<br> model.eval()<br> else:<br> model.train()<br> y_pred = model(x)<br> batch_loss = loss_func(y_pred, y)<br> loss += batch_loss.numpy() * x.shape[0]<br> if opt is not None:<br> opt.reset_grad()<br> batch_loss.backward()<br> opt.step()<br> y = y.numpy()<br> y_pred = y_pred.numpy()<br> y_pred = np.argmax(y_pred, axis=1)<br> error_count += np.sum(y_pred != y)<br> return error_count / len(dataloader.dataset), loss / len(dataloader.dataset)<br> ### END YOUR SOLUTION<br>
python<br>def train_mnist(<br> batch_size=100,<br> epochs=10,<br> optimizer=ndl.optim.Adam,<br> lr=0.001,<br> weight_decay=0.001,<br> hidden_dim=100,<br> data_dir="data",<br>):<br> np.random.seed(4)<br> ### BEGIN YOUR SOLUTION<br> train_dataset = ndl.data.MNISTDataset(data_dir+"/train-images-idx3-ubyte.gz", data_dir+"/train-labels-idx1-ubyte.gz")<br> test_dataset = ndl.data.MNISTDataset(data_dir+"/t10k-images-idx3-ubyte.gz", data_dir+"/t10k-labels-idx1-ubyte.gz")<br> train_dataloader = ndl.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)<br> test_dataloader = ndl.data.DataLoader(test_dataset, batch_size=batch_size)<br> model = MLPResNet(784, hidden_dim)<br> opt = optimizer(model.parameters(), lr=lr, weight_decay=weight_decay)<br> for i in range(epochs):<br> train_error, train_loss = epoch(train_dataloader, model, opt)<br> test_error, test_loss = epoch(test_dataloader, model)<br> # print(f"Epoch {i+1}/{epochs} Train Error: {train_error:.4f} Train Loss: {train_loss:.4f} Test Error: {test_error:.4f} Test Loss: {test_loss:.4f}")<br> return train_error, train_loss, test_error, test_loss<br> <br> ### END YOUR SOLUTION<br>

hw2 小结

到这里,hw2 就已经完结啦。拖拖拖,拖了一个月才做完,本课程的 test 不是很严格,在 Debug hw2 的过程中发现了不少 hw1 中的错误。遇到问题除了自己调试,也建议参考一下别人的实现,能够提升找到问题所在的效率。

hw3

在本次实验中,我们将构建一个简单的底层库,用于实现 NDArray。之前我们是用 NunPy 来实现,这次我们将手动实现该 CPU 和 GPU 版本的底层库,并且不调用现有的高度优化的矩阵乘法或其他操作代码。

Part 1: Python array operations

第一个部分是通过 Python 代码修改 stridesshapeoffset 字段来实现一些操作,由于不涉及底层,使用 Python 来实现这些方法效率已经够高了。

在实现前,先浏览一遍 ndarray.py,其提供大量辅助函数以简化实现过程。

使用以上辅助函数后,reshape 的实现就相当简单:

python<br>def reshape(self, new_shape):<br> assert prod(self.shape) == prod(new_shape), "Product of shapes must be equal"<br> assert self.is_compact(), "Matrix must be compact"<br> return self.as_strided(new_shape, NDArray.compact_strides(new_shape))<br>
python<br>def permute(self, new_axes):<br> new_shape = tuple(self.shape[i] for i in new_axes)<br> new_strides = tuple(self.strides[i] for i in new_axes)<br> return NDArray.make(shape=new_shape, strides=new_strides, device=self.device, handle=self._handle, offset=self._offset)<br>
python<br>def broadcast_to(self, new_shape):<br> assert all(<br> new_shape[i] == self.shape[i] or self.shape[i] == 1<br> for i in range(len(self.shape))<br> ), "Invalid broadcast shape"<br> new_strides = tuple(<br> self.strides[i] if self.shape[i] == new_shape[i] else 0 for i in range(len(self.shape))<br> )<br> return self.compact().as_strided(new_shape, new_strides)<br>

结果的 shape 计算比较简单,计算每个维度上的切片包含几个元素即可。strides 用于根据索引计算索引元素在一维数组中的下标,如果该维度上切片步长不为 1,那相当于每次都要跳过几个元素来访问下个元素,定量计算不难发现,新的 strides 就等于该维度上 slice.step 乘上对应的 strides。

接下来计算 offset,由于切片中存在 start 值,因此如果待访问的索引存在某个维度上索引值小于对应切片上的 start 值的,这个元素不应存在新的 NDArray 上。例如,切片在每个维度上的 start 值为 (2, 3, 4, 5),那么原始索引 (1, 3, 4, 5) 或者 (2, 3, 4, 1) 都在切片后的首个元素之前,应该被 offset 覆盖。因此,offset 值等于每个维度上的 slice.start 乘上对应的 strides。

python<br>def __getitem__(self, idxs):<br> ...<br> ### BEGIN YOUR SOLUTION<br> shape = tuple(max(0, (s.stop - s.start + s.step - 1) // s.step) for s in idxs)<br> strides = tuple(s.step * self.strides[i] for i, s in enumerate(idxs))<br> offset = reduce(operator.add, (s.start * self.strides[i] for i, s in enumerate(idxs)))<br> return NDArray.make(shape, strides, device=self.device, handle=self._handle, offset=offset)<br> ### END YOUR SOLUTION<br>

Part 2: CPU Backend - Compact and setitem

在本部分中,我们将实现 CPU 版本的 compact 和 setitem,前者用于在内存中创建一份紧密排列的数据副本,后者用于在内存中根据给定的数据赋值。

二者有个共同点,就是涉及到可变循环展开。即,由于给定 NDArray 的维度数量是不确定的,无法通过 n 重循环对数据进行遍历。此处我采用的思路是维护一个索引 (0, 0, 0, ..., 0),每次手动在最后一位执行 +1 操作,当达到对应维度的 shape 值时则进位,直至最高位也向前进位,说明遍历完毕。

这里我定义了两个辅助函数 bool next_indexvector<int32_t>& index, const std::vector<int32_t>& shape) 和 size_t index_to_offset(const std::vector<int32_t>& index, const std::vector<int32_t>& strides, const size_t offset,分别用于遍历索引和将索引转换为下标。二者实现为:

cpp<br>bool next_indexvector<int32_t>& index, const std::vector<int32_t>& shape) {<br> /**<br> * Increment the index by one, and return true if the index is still valid<br> * <br> * Args:<br> * index: current index<br> * shape: shape of the array<br> * <br> * Returns:<br> * true if the index is still valid, false otherwise<br> */<br> if(index.size() == 0){<br> return false;<br> }<br> index[index.size()-1]++;<br> for(int i=index.size()-1; i>=0; i--){<br> if(index[i] >= shape[i]){<br> index[i] = 0;<br> if(i > 0){<br> index[i-1]++;<br> }<br> else {<br> return false;<br> }<br> }<br> else {<br> return true;<br> }<br> }<br>}<br><br>size_t index_to_offset(const std::vector<int32_t>& index, const std::vector<int32_t>& strides, const size_t offset) {<br> /**<br> * Convert an index to an offset<br> * <br> * Args:<br> * index: index to convert<br> * strides: strides of the array<br> * offset: offset of the array<br> * <br> * Returns:<br> * offset of the index<br> */<br> size_t res = offset;<br> for(int i=0; i<index.size(); i++{<br> res += index[i] * strides[i];<br> }<br> return res;<br>} <br>
cpp<br>void Compact(const AlignedArray& a, AlignedArray* out, std::vector<int32_t> shape, std::vector<int32_t> strides, size_t offset) {<br> /// BEGIN SOLUTION<br> auto a_index = std::vector<int32_t>(shape.size(), 0);<br> for (int out_index = 0; out_index < out->size; out_index++) {<br> size_t a_offset = index_to_offset(a_index, strides, offset);<br> out->ptr[out_index] = a.ptr[a_offset];<br> next_index(a_index, shape);<br> }<br> /// END SOLUTION<br>}<br>
cpp<br>void EwiseSetitem(const AlignedArray& a, AlignedArray* out, std::vector<int32_t> shape, std::vector<int32_t> strides, size_t offset) {<br> /// BEGIN SOLUTION<br> auto out_index = std::vector<int32_t>(shape.size(), 0);<br> for (int a_index = 0; a_index < a.size; a_index++) {<br> size_t out_offset = index_to_offset(out_index, strides, offset);<br> out->ptr[out_offset] = a.ptr[a_index];<br> next_index(out_index, shape);<br> }<br> /// END SOLUTION<br>}<br><br>void ScalarSetitem(const size_t size, scalar_t val, AlignedArray* out, std::vector<int32_t> shape, td::vector<int32_t> strides, size_t offset) {<br> /// BEGIN SOLUTION<br> auto out_index = std::vector<int32_t>(shape.size(), 0);<br> for (int i = 0; i < size; i++) {<br> size_t out_offset = index_to_offset(out_index, strides, offset);<br> out->ptr[out_offset] = val;<br> next_index(out_index, shape);<br> }<br> /// END SOLUTION<br>}<br>

Part 3: CPU Backend - Elementwise and scalar operations

在本 Part 中,我们将完成一些非常简单的算子的 CPU 版本,本任务主要是用于熟悉在 pybind 中注册 cpp 函数的流程。文档中提到,鼓励使用模板、宏等简化实现。

我没有为每个算子都写一个显式函数声明和定义,我首先实现了 void EwiseOp(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, std::function<scalar_t(scalar_t, scalar_t)> op) 和 void ScalarOp(const AlignedArray& a, scalar_t val, AlignedArray* out, std::function<scalar_t(scalar_t, scalar_t)> op),分别用于逐元素和统一执行函数 op,通过传入不同的函数 op 可以实现不同的操作。

cpp<br>void EwiseOp(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, std::function<scalar_t(scalar_t, scalar_t)> op) {<br> /**<br> * Element-wise operation on two arrays<br> *<br> * Args:<br> * a: first array<br> * b: second array<br> * out: output array<br> * op: operation to perform<br> */<br> for (size_t i = 0; i < a.size; i++) {<br> out->ptr[i] = op(a.ptr[i], b.ptr[i]);<br> }<br>}<br><br>void ScalarOp(const AlignedArray& a, scalar_t val, AlignedArray* out, std::function<scalar_t(scalar_t, scalar_t)> op) {<br> /**<br> * Element-wise operation on an array and a scalar<br> *<br> * Args:<br> * a: array<br> * val: scalar<br> * out: output array<br> * op: operation to perform<br> */<br> for (size_t i = 0; i < a.size; i++) {<br> out->ptr[i] = op(a.ptr[i], val);<br> }<br>}<br>

再通过 lambda 表达式对上面这两个函数部分实例化(柯里化),以便其只接受两个参数 a, b 并在 pybind 中注册。

举个栗子,如果想注册一个按元素乘法,那么完整的代码为:

cpp<br>m.def("ewise_mul", [](const AlignedArray& a, const AlignedArray& b, AlignedArray* out) {<br> EwiseOp(a, b, out, std::multiplies<scalar_t>());<br>});<br>

从外向内看,m.def 用于在 pybind 中注册一个方法,该方法名由第一个参数指定,即 ewise_mul,第二个参数用于指定对应的 cpp 函数,这里可以接受函数指针、匿名函数等。注意,在 python 我们调用 ewise_mul,只传入两个 NDArray,因此我们需要对接受三个参数的 EwiseOp 柯里化,即传入 std::multiplies<scalar_t>() 给 EwiseOp,并将其封装为一个匿名函数。

注册方法的这一步每次都要创建一个匿名函数,有点复杂了,这一步也能抽象为一个宏,即:

cpp<br> #define REGISTER_EWISW_OP(NAME, OP) \<br> m.def(NAME, [](const AlignedArray& a, const AlignedArray& b, AlignedArray* out) { \<br> EwiseOp(a, b, out, OP); \<br> });<br><br> #define REGISTER_SCALAR_OP(NAME, OP) \<br> m.def(NAME, [](const AlignedArray& a, scalar_t val, AlignedArray* out) { \<br> ScalarOp(a, val, out, OP); \<br> });<br> #define REGISTER_SINGLE_OP(NAME, OP) \<br> m.def(NAME, [](const AlignedArray& a, AlignedArray* out) { \<br> for (size_t i = 0; i < a.size; i++) { \<br> out->ptr[i] = OP(a.ptr[i]); \<br> } \<br> });<br>

上述三个宏,分别用于注册按元素、按标量的双目运算符,和单目运算符在 pybind 中的注册。

应用这些宏,注册所有指定的方法:

cpp<br> REGISTER_EWISW_OP("ewise_mul", std::multiplies<scalar_t>());<br> REGISTER_SCALAR_OP("scalar_mul", std::multiplies<scalar_t>());<br> REGISTER_EWISW_OP("ewise_div", std::divides<scalar_t>());<br> REGISTER_SCALAR_OP("scalar_div", std::divides<scalar_t>());<br> REGISTER_SCALAR_OP("scalar_power", static_cast<scalar_t(*)(scalar_t, scalar_t)>pow));<br> REGISTER_EWISW_OP("ewise_maximum", static_cast<scalar_t(*)(scalar_t, scalar_t)>(std::fmax));<br> REGISTER_SCALAR_OP("scalar_maximum", static_cast<scalar_t(*)(scalar_t, scalar_t)>(std::fmax));<br> REGISTER_EWISW_OP("ewise_eq", std::equal_to<scalar_t>());<br> REGISTER_SCALAR_OP("scalar_eq", std::equal_to<scalar_t>());<br> REGISTER_EWISW_OP("ewise_ge", std::greater_equal<scalar_t>());<br> REGISTER_SCALAR_OP("scalar_ge", std::greater_equal<scalar_t>());<br> REGISTER_SINGLE_OP("ewise_log", std::log);<br> REGISTER_SINGLE_OP("ewise_exp", std::exp);<br> REGISTER_SINGLE_OP("ewise_tanh", std::tanh;<br>

注意,其中 std::pow 等有多个重载版本,通过 static_cast 关键字可以指定版本。

Part 4: CPU Backend - Reductions

这里要实现两个归约算子 max 和 sum,为了简化实现,这里只对单个维度进行归约。即便在单个维度上,想要实现归约运算也是相当困难的,因此本任务还进行了简化:在调用归约算子前会将待归约维度重排到最后一个维度上,并在调用结束后自动恢复,因此我们只要实现对最后一个维度的归约运算。

经过一系列简化操作,这两个算子实现起来有点过于简单了:对连续的 reduce_size 个元素进行 max/sum 运算作为输出的新元素即可,最后记得在 pybind 中注册这两个方法:

cpp<br>void ReduceMax(const AlignedArray& a, AlignedArray* out, size_t reduce_size) {<br> /// BEGIN SOLUTION<br> for(size_t i = 0; i < out->size; i++){<br> out->ptr[i] = a.ptr[i*reduce_size];<br> for(size_t j = 1; j < reduce_size; j++){<br> out->ptr[i] = std::max(out->ptr[i], a.ptr[i*reduce_size + j]);<br> }<br> }<br> /// END SOLUTION<br>}<br><br>void ReduceSum(const AlignedArray& a, AlignedArray* out, size_t reduce_size) {<br> /// BEGIN SOLUTION<br> for(size_t i = 0; i < out->size; i++){<br> out->ptr[i] = 0;<br> for(size_t j = 0; j < reduce_size; j++){<br> out->ptr[i] += a.ptr[i*reduce_size + j];<br> }<br> }<br> /// END SOLUTION<br>}<br>

Part 5: CPU Backend - Matrix multiplication

在本模块中,我们将实现矩阵乘法。

cpp<br>void Matmul(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, uint32_t m, uint32_t n,<br> uint32_t p) {<br> for(uint32_t i = 0; i < m*p; i++){<br> out->ptr[i] = 0;<br> }<br> for (uint32_t i=0; i<m; i++) {<br> for (uint32_t j=0; j<p; j++) {<br> for (uint32_t k=0; k<n; k++) {<br> out->ptr[i*p + j] += a.ptr[i*n + k] * b.ptr[k*p + j];<br> }<br> }<br> }<br>}<br>
cpp<br>inline void AlignedDot(const float* __restrict__ a,<br> const float* __restrict__ b,<br> float* __restrict__ out) {<br><br> a = (const float*)__builtin_assume_aligned(a, TILE * ELEM_SIZE);<br> b = (const float*)__builtin_assume_aligned(b, TILE * ELEM_SIZE);<br> out = (float*)__builtin_assume_aligned(out, TILE * ELEM_SIZE);<br><br> /// BEGIN SOLUTION<br><br> for (uint32_t i=0; i<TILE; i++) {<br> for (uint32_t j=0; j<TILE; j++) {<br> for (uint32_t k=0; k<TILE; k++) {<br> out[i*TILE + j] += a[i*TILE + k] * b[k*TILE + j];<br> }<br> }<br> }<br> /// END SOLUTION<br>}<br>
cpp<br>void MatmulTiled(const AlignedArray& a, const AlignedArray& b, AlignedArray* out, uint32_t m,<br> uint32_t n, uint32_t p) {<br> for(uint32_t i=0; i<m*p; i++){<br> out->ptr[i] = 0;<br> }<br> for (uint32_t i=0; i<m/TILE; i++) {<br> for (uint32_t j=0; j<p/TILE; j++) {<br> for (uint32_t k=0; k<n/TILE; k++) {<br> AlignedDot(a.ptr + (i*n/TILE + k)*TILE*TILE, b.ptr + (k*p/TILE + j)*TILE*TILE, out->ptr + (i*p/TILE + j)*TILE*TILE);<br> }<br> }<br> }<br>}<br>

Part 6: GPU Backend - Compact and setitem

从本 Part 开始,我们要写 CUDA 代码了,第一次接触 CUDA 编程的同学可以看一下这个不到 5 小时的教程 CUDA编程基础入门系列(持续更新)_哔哩哔哩_bilibili,快速入门。

本 Part 中,我们将实现 compact 和 setitem 算子。有了之前实现 CPU 版本的经验,先写一个将逻辑索引转换为物理索引的辅助函数:

cpp<br>__device__ size_t indexToMemLocation(size_t index, CudaVec shape, CudaVec strides, size_t offset){<br> size_t ret = offset;<br> for(int i=shape.size-1; i>=0; i--){<br> ret += (index % shape.data[i]) * strides.data[i];<br> index /= shape.data[i];<br> }<br> return ret;<br>}<br>

CompactKernel 根据文档,其作用是将 a 中逻辑下标为 gid 的数据拷贝到 out[gid] 处,注意判断 gid 是否越界,即:

cpp<br>__global__ void CompactKernel(const scalar_t* a, scalar_t* out, size_t size, CudaVec shape,<br> CudaVec strides, size_t offset) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br><br> if(gid >= size)<br> return;<br> size_t memLocation = indexToMemLocation(gid, shape, strides, offset);<br> out[gid] = a[memLocation];<br>}<br>

两个 setitem 算子照猫画虎,比较简单,直接贴代码:

cpp<br>__global__ void EwiseSetitemKernel(const scalar_t* a, scalar_t* out, size_t size, CudaVec shape, CudaVec strides,<br> size_t offset) {<br> /**<br> * <br> * Args:<br> * a: _compact_ array whose items will be written to out<br> * out: non-compact array whose items are to be written<br> * shape: shapes of each dimension for a and out<br> * strides: strides of the *out* array (not a, which has compact strides)<br> * offset: offset of the *out* array (not a, which has zero offset, being compact)<br> */<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size){<br> size_t memLocation = indexToMemLocation(gid, shape, strides, offset);<br> out[memLocation] = a[gid];<br> }<br> <br> <br>}<br><br><br>void EwiseSetitem(const CudaArray& a, CudaArray* out, std::vector<int32_t> shape,<br> std::vector<int32_t> strides, size_t offset) {<br> /**<br> * Set items in a (non-compact) array using CUDA. Yyou will most likely want to implement a<br> * EwiseSetitemKernel() function, similar to those above, that will do the actual work.<br> * <br> * Args:<br> * a: _compact_ array whose items will be written to out<br> * out: non-compact array whose items are to be written<br> * shape: shapes of each dimension for a and out<br> * strides: strides of the *out* array (not a, which has compact strides)<br> * offset: offset of the *out* array (not a, which has zero offset, being compact)<br> */<br> /// BEGIN SOLUTION<br> CudaDims dim = CudaOneDim(a.size);<br> EwiseSetitemKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, a.size, VecToCuda(shape),<br> VecToCuda(strides), offset);<br> /// END SOLUTION<br>}<br><br>__global__ void ScalarSetitemKernel(size_t size, scalar_t val, scalar_t* out, CudaVec shape, <br> CudaVec strides, size_t offset){<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size){<br> size_t memLocation = indexToMemLocation(gid, shape, strides, offset);<br> out[memLocation] = val;<br> }<br>}<br><br>void ScalarSetitem(size_t size, scalar_t val, CudaArray* out, std::vector<int32_t> shape,<br> std::vector<int32_t> strides, size_t offset) {<br> /**<br> * Set items is a (non-compact) array<br> * <br> * Args:<br> * size: number of elements to write in out array (note that this will note be the same as<br> * out.size, because out is a non-compact subset array); it _will_ be the same as the <br> * product of items in shape, but covenient to just pass it here.<br> * val: scalar value to write to<br> * out: non-compact array whose items are to be written<br> * shape: shapes of each dimension of out<br> * strides: strides of the out array<br> * offset: offset of the out array<br> */<br> /// BEGIN SOLUTION<br> CudaDims dim = CudaOneDim(size);<br> ScalarSetitemKernel<<<dim.grid, dim.block>>>(size, val, out->ptr, VecToCuda(shape),<br> VecToCuda(strides), offset);<br> /// END SOLUTION<br>}<br><br>////////////////////////////////////////////////////////////////////////////////<br>// Elementwise and scalar operations<br>////////////////////////////////////////////////////////////////////////////////<br><br>__global__ void EwiseAddKernel(const scalar_t* a, const scalar_t* b, scalar_t* out, size_t size) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = a[gid] + b[gid];<br>}<br><br>void EwiseAdd(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> /**<br> * Add together two CUDA array<br> */<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseAddKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size);<br>}<br>

Part 7: CUDA Backend - Elementwise and scalar operations

本 Part 将实现一系列比较简单的单目、双目运算符,重点讲一下如何精简代码。

在 CPU 版本中,我们通过 std::function 动态传入 Op 来实现不同的运算,但在 CUDA 的核函数中是不支持 std 的,因此我们改为通过模板来实现。

分别为逐元素运算和标量运算各写一个模板核函数:

cpp<br>template <typename Op><br>__global__ void EwiseKernel(const scalar_t* a, const scalar_t* b, scalar_t* out, size_t size, Op op) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = op(a[gid], b[gid]);<br>}<br><br>template <typename Op><br>__global__ void ScalarKernel(const scalar_t* a, scalar_t val, scalar_t* out, size_t size, Op op) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = op(a[gid], val);<br>}<br>

CUDA 核函数中调用的其它函数必须也是核函数或者设备函数,因此我们还要为各个算子封装一个类,并重载 () 运算符,以便实例化上述两个模板核函数:

cpp<br>struct Add {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return x + y; }<br>};<br><br>struct Mul {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return x * y; }<br>};<br><br>struct Div {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return x / y; }<br>};<br><br>struct Maximum {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return max(x, y); }<br>};<br><br>struct Eq {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return x == y; }<br>};<br><br>struct Ge {<br> __device__ scalar_t operator()(scalar_t x, scalar_t y) const { return x >= y; }<br>};<br><br>struct Power {<br> scalar_t val;<br> Power(scalar_t v) : val(v) {}<br> __device__ scalar_t operator()(scalar_t x, scalar_t) const { return pow(x, val); }<br>};<br>

接下来定义主机端接口,以便注册到 pybind11 中:

cpp<br>void EwiseMul(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size, Mul());<br>}<br><br>void ScalarMul(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Mul());<br>}<br><br>void EwiseDiv(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size, Div());<br>}<br><br>void ScalarDiv(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Div());<br>}<br><br>void ScalarPower(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Power(val));<br>}<br><br>void EwiseMaximum(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size, Maximum());<br>}<br><br>void ScalarMaximum(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Maximum());<br>}<br><br>void EwiseEq(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size, Eq());<br>}<br><br>void ScalarEq(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Eq());<br>}<br><br>void EwiseGe(const CudaArray& a, const CudaArray& b, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseKernel<<<dim.grid, dim.block>>>(a.ptr, b.ptr, out->ptr, out->size, Ge());<br>}<br><br>void ScalarGe(const CudaArray& a, scalar_t val, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> ScalarKernel<<<dim.grid, dim.block>>>(a.ptr, val, out->ptr, out->size, Ge());<br>}<br>

上述是双目运算符的实现,接下来实现单目运算符。单目运算符也可以像双目一样通过模板实现,但 copilot 直接生成了对应代码,我也懒得改:

cpp<br>__global__ void EwiseLogKernel(const scalar_t* a, scalar_t* out, size_t size) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = log(a[gid]);<br>}<br><br>void EwiseLog(const CudaArray& a, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseLogKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, out->size);<br>}<br><br>__global__ void EwiseExpKernel(const scalar_t* a, scalar_t* out, size_t size) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = exp(a[gid]);<br>}<br><br>void EwiseExp(const CudaArray& a, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseExpKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, out->size);<br>}<br><br>__global__ void EwiseTanhKernel(const scalar_t* a, scalar_t* out, size_t size) {<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> if (gid < size) out[gid] = tanh(a[gid]);<br>}<br><br>void EwiseTanh(const CudaArray& a, CudaArray* out) {<br> CudaDims dim = CudaOneDim(out->size);<br> EwiseTanhKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, out->size);<br>}<br>

最后,将本文件最后 m.def 开头的代码取消注释,将对应接口注册到 pybind11 中即可。

Part 8: CUDA Backend - Reductions

本 Part 将实现两个规约算子 sum 和 max

和 CPU 版本一样,待归约的元素在内存中是连续排列的。在 CUDA 中,由每个线程负责一个规约任务,其负责的规约范围为 [gid*size, min(gid*size+size, a_size)],其中 size 是单个线程负责规约的长度,a_size 是输入数据的长度。

核函数中根据具体的规约算子,计算求和或者最大值即可:

cpp<br>__global__ void ReduceMaxKernel(const scalar_t* a, scalar_t* out, size_t size, size_t a_size) {<br> /**<br> * 对a中连续`size`个元素进行规约<br> */<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> size_t start = gid * size;<br> size_t end = min(start + size, a_size);<br> if(start < end){<br> scalar_t max_val = a[start];<br> for(size_t i=start+1; i<end; i++){<br> max_val = max(max_val, a[i]);<br> }<br> out[gid] = max_val;<br> }<br>}<br><br>void ReduceMax(const CudaArray& a, CudaArray* out, size_t reduce_size) {<br> /**<br> * Reduce by taking maximum over `reduce_size` contiguous blocks. Even though it is inefficient,<br> * for simplicity you can perform each reduction in a single CUDA thread.<br> * <br> * Args:<br> * a: compact array of size a.size = out.size * reduce_size to reduce over<br> * out: compact array to write into<br> * redice_size: size of the dimension to reduce over<br> */<br> /// BEGIN SOLUTION<br> CudaDims dim = CudaOneDim(out->size);<br> ReduceMaxKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, reduce_size, a.size);<br> /// END SOLUTION<br>}<br><br>__global__ void ReduceSumKernel(const scalar_t* a, scalar_t* out, size_t size, size_t a_size) {<br> /**<br> * 对a中连续`size`个元素进行规约<br> */<br> size_t gid = blockIdx.x * blockDim.x + threadIdx.x;<br> size_t start = gid * size;<br> size_t end = min(start + size, a_size);<br> if(start >= end){<br> return;<br> }<br> out[gid] = 0; // 如果进行初始化,必须只有需要运行线程才能初始化,否则会越界修改数据<br> for(size_t i=start; i<end; i++){<br> out[gid] += a[i];<br> }<br>}<br><br><br><br>void ReduceSum(const CudaArray& a, CudaArray* out, size_t reduce_size) {<br> /**<br> * Reduce by taking summation over `reduce_size` contiguous blocks. Again, for simplicity you <br> * can perform each reduction in a single CUDA thread.<br> * <br> * Args:<br> * a: compact array of size a.size = out.size * reduce_size to reduce over<br> * out: compact array to write into<br> * redice_size: size of the dimension to reduce over<br> */<br> /// BEGIN SOLUTION<br> CudaDims dim = CudaOneDim(out->size);<br> ReduceSumKernel<<<dim.grid, dim.block>>>(a.ptr, out->ptr, reduce_size, a.size);<br> /// END SOLUTION<br>}<br>

Part 9: CUDA Backend - Matrix multiplication

这是最后一个任务,也是最难的一部分。正如文档中所说,想要实现一个矩阵乘法算子还是挺简单的,让每个线程负责一个结果的计算即可。但,如果想使用 cooperative fetching 和 block shared memory register tiling 技术,尤其是按照理论课中提到的伪代码来实现,则要困难得多。

首先贴出理论课中提到的伪代码:

cpp<br>__global__ void mm(float A[N][N], float B[N][N], float C[N][N]) {<br> __shared__ float sA[S][L], sB[S][L];<br> float c[V][V] = {0};<br> float a[V], b[V];<br> int yblock = blockIdx.y;<br> int xblock = blockIdx.x;<br><br> for (int ko = 0; ko < N; ko += S) {<br> __syncthreads();<br> // needs to be implemented by thread cooperative fetching<br> sA[:, :] = A[ko + S, yblock * L : yblock * L + L];<br> sB[:, :] = B[ko + S, xblock * L : xblock * L + L];<br> __syncthreads();<br><br> for (int ki = 0; ki < S; ++ki) {<br> a[:] = sA[ki, threadIdx.x * V + V];<br> b[:] = sB[ki, threadIdx.x * V + V];<br> for (int y = 0; y < V; ++y) {<br> for (int x = 0; x < V; ++x) {<br> c[y][x] += a[y] * b[x];<br> }<br> }<br> }<br> }<br><br> int ybase = blockIdx.y * blockDim.y + threadIdx.y;<br> int xbase = blockIdx.x * blockDim.x + threadIdx.x;<br> C[ybase * V : ybase * V + V, xbase * V : xbase * V + V] = c[:, :];<br>}<br>

image1.png|352x399
如上图所示,我们要计算的是两个长度为 N 的方阵之间的乘法,结果矩阵 C 会被分块为 (L,L) 的子矩阵,每个 block 负责计算一个子矩阵。

为了计算这个子矩阵,索引为 block_x, block_y 的 block 需要用到的数据为 A'=A[L*block_x:L*block_x+L,:] 和 B'=B[:,L*block_x:L*block_x+L]。A’ 和 B’ 可能比较大,因此在另一维度上按照长度 S 再次分为 N/S 块,分块后的 shape 分别为 (L,S) 和 (S,L),二者的矩阵乘法结果的 shape 为 (L,L),将 N/S 块累加即可得到该 block 负责的子矩阵的结果。

后文将使用矩阵的 shape 来指代该矩阵。

在计算单个 (L,S) 和 (S,L) 的乘法时,每个 block 都会将其对应的数据,即图中 A 和 B 的阴影部分,加载进 block 内线程共享的共享内存中。

通过外积计算单个 (L,S) 和 (S,L) 的乘法,该算法简单说就是从 (L,S) 任取一列,从 (S,L) 中任取一行,进行外积运算。将各种组合方式的外积结果累加,即可实现矩阵乘法。

单个外积运算由 block 内的线程共同完成,如图中所示,每个 thread 负责计算的就是 (V,V) 的更小的矩阵。具体来说,从 (L,S) 任取一列的 shape 为 (L,1),从 (S,L) 任取一行的 shape 为 (1,L),对二者按照长度为 V 再次进行分块,即分块为 (V,1) 和 (1,V)shape 的两个矩阵,然后由一个线程负责计算二者的外积,得到 shape 为 (V,V) 的结果。

以上就是理论课伪代码中提到的算法,将其改写为 CUDA 代码时需要考虑各种情况,有如下注意点:

代码中写了比较详细的注释,这部分比较复杂,难以单纯通过文字讲明白,如有问题欢迎留言一起讨论。

cpp<br>__global__ void MatmulKernel(const scalar_t* a, const scalar_t* b, scalar_t* c, uint32_t M, uint32_t N,<br> uint32_t P){<br>#define V 2<br>#define TILE 4<br> /**<br> * 使用分块计算矩阵乘法,按照TILE大小分块<br> * a: M x N<br> * b: N x P<br> */<br> int block_x = blockIdx.x;<br> int block_y = blockIdx.y;<br> int thread_x = threadIdx.x;<br> int thread_y = threadIdx.y;<br> int thread_id = thread_x + thread_y * blockDim.x;<br> int nthreads = blockDim.x * blockDim.y;<br> // 每个block负责计算一个子矩阵的结果,具体来说,就是c[block_x*TILE: (block_x+1)*TILE, block_y*TILE: (block_y+1)*TILE]<br> // 通过累加"outer product"的结果计算这个子矩阵,product的两个元素都是分块后行列子矩阵的一个stripe<br> // 例如,a按行分块后每一块shape是(TILE, N),再取一个stripe的shape就是(TILE, TILE)<br> // outer product每次的步长不是1,而是TILE<br><br> __shared__ scalar_t a_shared[TILE][TILE];<br> __shared__ scalar_t b_shared[TILE][TILE];<br> scalar_t c_reg[V][V] = {0};<br> scalar_t a_reg[V]={0}, b_reg[V]={0};<br><br><br> for(int start=0; start<N; start+=TILE){<br> __syncthreads();<br> // 一共有TILE * TILE个元素要导入,每个线程平均负责(TILE * TILE+nthreads-1)/nthreads个元素<br> // for (int i=0; i<(TILE * TILE+nthreads-1)/nthreads; i++){<br> // int idx = thread_id + i * nthreads; // 在shared中的索引<br> // int x = idx / TILE; // 在shared中的索引<br> // int y = idx % TILE; // 在shared中的索引<br> // // a_shared中的(x, y)相当于a中的(x+block_x*TILE, y+start)<br> // // b_shared中的(x, y)相当于b中的(x+start, y+block_y*TILE)<br> // if(x+block_x*TILE < M && y+start < N){<br> // a_shared[x][y] = a[(x+block_x*TILE)*N + y+start];<br> // }<br> // if(x+start < N && y+block_y*TILE < P){<br> // b_shared[x][y] = b[(x+start)*P + y+block_y*TILE];<br> // }<br> // }<br> for (int idx = thread_id; idx < TILE * TILE; idx += nthreads){<br> int x = idx / TILE; // 在shared中的索引<br> int y = idx % TILE; // 在shared中的索引<br> // a_shared中的(x, y)相当于a中的(x+block_x*TILE, y+start)<br> // b_shared中的(x, y)相当于b中的(x+start, y+block_y*TILE)<br> if(x+block_x*TILE < M && y+start < N){<br> a_shared[x][y] = a[(x+block_x*TILE)*N + y+start];<br> }<br> if(x+start < N && y+block_y*TILE < P){<br> b_shared[x][y] = b[(x+start)*P + y+block_y*TILE];<br> }<br> }<br> __syncthreads();<br> // 接下来开始计算外积<br> // 通过遍历a_shared的列和b_shared的行,也就是a_shared的第stripe_i行和b_shared的第stripe_i列<br> int stripe_cnt = min(TILE, N-start);<br> for(int stripe_i=0; stripe_i<stripe_cnt; stripe_i++){<br> // 这个外积由nthreads负责计算,这个外积将stripe_a 和 stripe_b 按照连续的V行/列分块,由每个线程计算<br> // 接下来把计算V*V的外积结果的要用的数据加载到寄存器数组中<br> if(thread_x * V >= TILE | thread_y * V >= TILE)<br> continue;<br> for(int reg_x=0; reg_x<V; reg_x++){<br> int shared_x = reg_x + thread_x * V;<br> if(shared_x >= TILE){<br> break;<br> }<br> a_reg[reg_x] = a_shared[shared_x][stripe_i];<br> // b_reg[reg_x] = b_shared[stripe_i][shared_x];<br> }<br> for(int reg_y=0; reg_y<V; reg_y++){<br> int shared_y = reg_y + thread_y * V;<br> if(shared_y >= TILE){<br> printf("quit: thread id: %d, shared_y: %d, TILE: %d\n", thread_id, shared_y, TILE);<br> break;<br> }<br> // a_reg[reg_y] = a_shared[stripe_i][shared_y];<br> b_reg[reg_y] = b_shared[stripe_i][shared_y];<br> }<br> for(int i=0; i<V; i++){<br> for(int j=0; j<V; j++){<br> // 这里“越界”可以不管吧?把c_reg放到结果中的时候再处理<br> c_reg[i][j] += a_reg[i] * b_reg[j];<br> }<br> }<br> }<br> }<br><br> // 把c_reg的结果写入到c中<br> if(thread_x * V >= TILE | thread_y * V >= TILE)<br> return;<br> for(int i=0; i<V; i++){<br> for(int j=0; j<V; j++){<br> int x = block_x * TILE + thread_x * V + i;<br> int y = block_y * TILE + thread_y * V + j;<br> if(x < M && y < P){<br> c[x*P + y] = c_reg[i][j];<br> } else {<br> break;<br> }<br><br> }<br> }<br><br><br>}<br><br>void Matmul(const CudaArray& a, const CudaArray& b, CudaArray* out, uint32_t M, uint32_t N,<br> uint32_t P) {<br> /**<br> * Multiply two (compact) matrices into an output (also comapct) matrix. You will want to look<br> * at the lecture and notes on GPU-based linear algebra to see how to do this. Since ultimately<br> * mugrade is just evaluating correctness, you _can_ implement a version that simply parallelizes<br> * over (i,j) entries in the output array. However, to really get the full benefit of this<br> * problem, we would encourage you to use cooperative fetching, shared memory register tiling, <br> * and other ideas covered in the class notes. Note that unlike the tiled matmul function in<br> * the CPU backend, here you should implement a single function that works across all size<br> * matrices, whether or not they are a multiple of a tile size. As with previous CUDA<br> * implementations, this function here will largely just set up the kernel call, and you should<br> * implement the logic in a separate MatmulKernel() call.<br> * <br> *<br> * Args:<br> * a: compact 2D array of size m x n<br> * b: comapct 2D array of size n x p<br> * out: compact 2D array of size m x p to write the output to<br> * M: rows of a / out<br> * N: columns of a / rows of b<br> * P: columns of b / out<br> */<br><br> /// BEGIN SOLUTION<br> // 结果的shape是M*P,每个block负责计算一个TILE*TILE的子矩阵<br> dim3 grid_dim = dim3((M + TILE - 1) / TILE, (P + TILE - 1) / TILE, 1);<br> dim3 block_dim = dim3(16, 16, 1);<br> // dim3 block_dim = dim3(2, 2, 1);<br> MatmulKernel<<<grid_dim, block_dim>>>(a.ptr, b.ptr, out->ptr, M, N, P);<br> /// END SOLUTION<br>}<br>

hw3 小结

本 hw 主要内容是各算子 CPU 和 GPU 版本的底层实现,由于是第一次接触 CUDA 代码,在实现 GPU 版本的矩阵乘法的时候花了不少时间 Debug,调试到最后甚至要头疼昏睡过去。好在皇天不负苦心人,灵感一瞬间它就来了,谁懂这柳暗花明又一村的感觉。特别感谢 好友 为我讲解矩阵乘法的实现、大半夜不厌其烦地与我一起调试代码。

hw4

本实验中,首先将实现一些算子,然后分别实现 CNN 和 RNN 网络,并在数据集上进行训练。

Part 1: ND Backend

首先将 src/*autograd.pyndarray.py 文件中未实现的方法从之前的 hw 中复制过来,然后在 ops_*.py 中实现之前实现过的 op,大部分只要复制粘贴。

提一下我踩过的坑 2

python<br>import needle<br># from .backend_numpy import Device, cpu, all_devices<br>from typing import List, Optional, NamedTuple, Tuple, Union<br>from collections import namedtuple<br>import numpy<br><br>from needle import init<br><br># needle version<br>LAZY_MODE = False<br>TENSOR_COUNTER = 0<br><br>from .backend_selection import array_api, NDArray, default_device<br>from .backend_selection import Device, cpu, all_devices<br>
python<br>def sum(self, axis=None, keepdims=False):<br> if isinstance(axis, int):<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> elif isinstance(axis, (tuple, list)):<br> for axis_ in axis:<br> view, out = self.reduce_view_out(axis_, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> else:<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> <br> return out<br><br>def max(self, axis=None, keepdims=False):<br> if isinstance(axis, int):<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_max(view.compact()._handle, out._handle, view.shape[-1])<br> elif isinstance(axis, (tuple, list)):<br> for axis_ in axis:<br> view, out = self.reduce_view_out(axis_, keepdims=keepdims)<br> self.device.reduce_max(view.compact()._handle, out._handle, view.shape[-1])<br> else:<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_max(view.compact()._handle, out._handle, view.shape[-1])<br> <br> return out<br>
python<br>def __rsub__(self, other):<br> return needle.ops.AddScalar(other)(needle.ops.Negate()(self))<br>

然后我们来实现新增的三个 op。

tanh⁡′(x)=1−tanh⁡2(x)tanh′(x)=1−tanh2(x)

反向传播中直接用 1 减去 node 的平方即可。需要注意,这里有一个上面提到的坑,也就是要自定义 rsub 函数。

python<br>class Tanh(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.tanh(a)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return out_grad * (1 - node ** 2)<br> ### END YOUR SOLUTION<br>
python<br>class Stack(TensorOp):<br> def __init__(self, axis: int):<br> """<br> Concatenates a sequence of arrays along a new dimension.<br> Parameters:<br> axis - dimension to concatenate along<br> All arrays need to be of the same size.<br> """<br> self.axis = axis<br><br> def compute(self, args: TensorTuple) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> if len(args) > 0:<br> shape = args[0].shape<br> for arg in args:<br> assert arg.shape == shape, "The shape of all tensors should be the same"<br> ret_shape = list(shape)<br> ret_shape.insert(self.axis, len(args))<br> ret = array_api.empty(ret_shape, device=args[0].device)<br> for i, arg in enumerate(args):<br> slices = [slice(None)] * len(ret_shape)<br> slices[self.axis] = i<br> ret[tuple(slices)] = arg<br> return ret<br> ### END YOUR SOLUTION<br><br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return split(out_grad, self.axis)<br> ### END YOUR SOLUTION<br>
python<br>class Split(TensorTupleOp):<br> def __init__(self, axis: int):<br> """<br> Splits a tensor along an axis into a tuple of tensors.<br> (The "inverse" of Stack)<br> Parameters:<br> axis - dimension to split<br> """<br> self.axis = axis<br><br> def compute(self, A):<br> ### BEGIN YOUR SOLUTION<br> ret = []<br> ret_shape = list(A.shape)<br> ret_shape.pop(self.axis)<br> for i in range(A.shape[self.axis]):<br> slices = [slice(None)] * len(A.shape)<br> slices[self.axis] = i<br> ret.append((A[tuple(slices)]).compact().reshape(ret_shape))<br> return tuple(ret)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return stack(out_grad, self.axis)<br> ### END YOUR SOLUTION<br><br><br>def split(a, axis):<br> return Split(axis)(a)<br>

Part 2: CIFAR-10 dataset

在本 Part 中,将完成对 CIFAR-10 数据库的解析。首先从之前的 hw 中复制 python/needle/data/data_transforms.py 和 python/needle/data/data_basic.py 两个文件,并修改 data_basic 中 DataLoader::__next__ 方法为:

python<br>def __next__(self):<br> if self.index >= len(self.ordering):<br> raise StopIteration<br> else:<br> batch = [Tensor(x) for x in self.dataset[self.ordering[self.index]]]<br> self.index += 1<br> return batch<br>

在之前 hw 中使用 Tensor.make_const 来实现,但其不会根据当前的 backend 自动切换 cached_data 的数据结构。

CIFAR-10 的数据格式参考 CIFAR-10 and CIFAR-100 datasets,简单来说,按照 batch, channel, height, width 的格式排列。__init__ 方法实现参考网站上已经给出的代码读取数据集,然后进行 reshape 和归一化的操作即可,另外两个方法可以直接写出来。

python<br>class CIFAR10Dataset(Dataset):<br> def __init__(<br> self,<br> base_folder: str,<br> train: bool,<br> p: Optional[int] = 0.5,<br> transforms: Optional[List] = None<br> ):<br> """<br> Parameters:<br> base_folder - cifar-10-batches-py folder filepath<br> train - bool, if True load training dataset, else load test dataset<br> Divide pixel values by 255. so that images are in 0-1 range.<br> Attributes:<br> X - numpy array of images<br> y - numpy array of labels<br> """<br> ### BEGIN YOUR SOLUTION<br> train_names = ['data_batch_1', 'data_batch_2', 'data_batch_3', 'data_batch_4', 'data_batch_5']<br> test_names = ['test_batch']<br> names = train_names if train else test_names<br> dicts = []<br> for name in names:<br> with open(os.path.join(base_folder, name), 'rb') as f:<br> dicts.append(pickle.load(f, encoding='bytes'))<br> self.X = np.concatenate([d[b'data'] for d in dicts], axis=0).reshape(-1, 3, 32, 32)<br> self.X = self.X / 255.0<br> self.y = np.concatenate([d[b'labels'] for d in dicts], axis=0)<br> <br> ### END YOUR SOLUTION<br><br> def __getitem__(self, index) -> object:<br> """<br> Returns the image, label at given index<br> Image should be of shape (3, 32, 32)<br> """<br> ### BEGIN YOUR SOLUTION<br> return self.X[index], self.y[index]<br> ### END YOUR SOLUTION<br><br> def __len__(self) -> int:<br> """<br> Returns the total number of examples in the dataset<br> """<br> ### BEGIN YOUR SOLUTION<br> return len(self.X)<br> ### END YOUR SOLUTION<br>

Part 3: Convolutional neural network

在本 Part 中,我们将首先实现一些算子,然后实现一个 CNN 网络并在 CIFAR 数据集上进行训练。

python<br>def pad(self, axes):<br> out_shape = tuple(self.shape[i] + axes[i][0] + axes[i][1] for i in range(len(self.shape)))<br> out = self.device.full(out_shape, 0)<br> slices = tuple(slice(axes[i][0], axes[i][0] + self.shape[i]) for i in range(len(self.shape)))<br> out[slices] = self<br> return out<br>
python<br># ndarray.py<br>def flip(self, axes):<br> assert isinstance(axes, tuple), "axes must be a tuple"<br> <br> strides = tuple(self.strides[i] if i not in axes else -self.strides[i] for i in range(len(self.shape)))<br> sum = __builtins__["sum"]<br> offset = sum((self.shape[i] - 1) * self.strides[i] for i in range(len(self.shape)) if i in axes)<br> out = NDArray.make(self.shape, strides=strides, device=self.device, handle=self._handle, offset=offset).compact()<br> return out<br><br># ops_mathematic.py<br>class Flip(TensorOp):<br> def __init__(self, axes: Optional[tuple] = None):<br> if isinstance(axes, int):<br> axes = (axes,)<br> if isinstance(axes, list):<br> axes = tuple(axes)<br> self.axes = axes<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.flip(a, self.axes)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return flip(out_grad, self.axes)<br> ### END YOUR SOLUTION<br>

通过操纵 offset 和 strides 实现 flip 在数学角度应该是可以证明的,此处不表。

[1234]⟹[1020000030400000][13​24​]⟹​1030​0000​2040​0000​​

参数 dilation 就是 0 的个数。

这个函数的实现思路与 flip 非常接近,先计算 out 的 shape,然后创建空矩阵,然后通过切片选择目标元素:

python<br>class Dilate(TensorOp):<br> def __init__(self, axes: tuple, dilation: int):<br> self.axes = axes<br> self.dilation = dilation<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> if self.dilation == 0:<br> return a<br> out_shape = list(a.shape)<br> for i in self.axes:<br> out_shape[i] *= self.dilation + 1<br> out = array_api.full(out_shape, 0, device=a.device)<br> slices = [slice(None)] * len(a.shape)<br> for dim in self.axes:<br> slices[dim] = slice(None, None, self.dilation+1)<br> out[tuple(slices)] = a<br> return out<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return undilate(out_grad, self.axes, self.dilation)<br> ### END YOUR SOLUTION<br><br><br>def dilate(a, axes, dilation):<br> return Dilate(axes, dilation)(a)<br><br><br>class UnDilate(TensorOp):<br> def __init__(self, axes: tuple, dilation: int):<br> self.axes = axes<br> self.dilation = dilation<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> if self.dilation == 0:<br> return a<br> out_shape = list(a.shape)<br> for i in self.axes:<br> out_shape[i] //= self.dilation + 1<br> out = array_api.empty(out_shape, device=a.device)<br> slices = [slice(None)] * len(a.shape)<br> for dim in self.axes:<br> slices[dim] = slice(None, None, self.dilation+1)<br> out = a[tuple(slices)]<br> return out<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> return dilate(out_grad, self.axes, self.dilation)<br> ### END YOUR SOLUTION<br><br><br>def undilate(a, axes, dilation):<br> return UnDilate(axes, dilation)(a)<br>

dilate 和 undilate 互为逆运算,在计算梯度时互相调用即可。

python<br>conv(X, W, padding=n)<br><br>conv(pad(X, n), W, padding=0)<br>

因此,第一步就是将 X 进行 pad,作为新的 X。后面通过 im2col 技术和操作 strides 将 X 和 W 向量化,通过矩阵乘法来实现卷积。上述原理见课程笔记:《CMU 10-414 deep learning system》学习笔记 | 周鑫的个人博客

反向传播推导见博文:2d 卷积梯度推导与实现 | 周鑫的个人博客

实现 Conv 的代码中使用了较多的 permute 重排操作,如果用 transpose 来实现重排太麻烦了,倒不如直接实现个重排的 TensorOp:

python<br>class Permute(TensorOp):<br> def __init__(self, axes: tuple):<br> self.axes = axes<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return a.compact().permute(self.axes)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> a = node.inputs[0]<br> index = [0] * len(self.axes)<br> for i in range(len(self.axes)):<br> index[self.axes[i]] = i<br> return permute(out_grad, tuple(index))<br> ### END YOUR SOLUTION<br> <br>def permute(a, axes):<br> return Permute(axes)(a)<br>

最终实现的代码为:

python<br>class Conv(TensorOp):<br> def __init__(self, stride: Optional[int] = 1, padding: Optional[int] = 0):<br> self.stride = stride<br> self.padding = padding<br><br> def compute(self, A, B):<br> ### BEGIN YOUR SOLUTION<br> assert len(A.shape) == 4, "The input tensor should be 4D"<br> assert len(B.shape) == 4, "The kernel tensor should be 4D"<br> A = A.compact()<br> B = B.compact()<br> batch_size, in_height, in_width, in_channel = A.shape<br> bs, hs, ws, cs = A.strides<br> kernel_height, kernel_width, in_channel, out_channel = B.shape<br> <br> <br> <br> pad_A = A.pad(((0, 0), (self.padding, self.padding), (self.padding, self.padding), (0, 0))).compact()<br> batch_size, in_height, in_width, in_channel = pad_A.shape<br> bs, hs, ws, cs = pad_A.strides<br> receiptive_field_shape = (batch_size, (in_height - kernel_height) // self.stride + 1, (in_width - kernel_width) // self.stride + 1, kernel_height, kernel_width, in_channel)<br> receiptive_field_strides = (bs, hs * self.stride, ws * self.stride, hs, ws, cs)<br> receiptive_field = pad_A.as_strided(receiptive_field_shape, receiptive_field_strides).compact()<br> reveiptive_vector = receiptive_field.reshape((receiptive_field.size //(kernel_height * kernel_width * in_channel), kernel_height * kernel_width * in_channel)).compact()<br> kernel_vector = B.reshape((kernel_height * kernel_width * in_channel, out_channel)).compact()<br> out = reveiptive_vector @ kernel_vector<br> out = out.reshape((batch_size, (in_height - kernel_height) // self.stride + 1, (in_width - kernel_width) // self.stride + 1, out_channel)).compact()<br> return out<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> X, W = node.inputs<br> s, _, _, _ = W.shape<br> <br> # 计算X_grad<br> W_flipped = flip(W, (0, 1))<br> W_flipped_permuted = transpose(W_flipped, (2, 3)) # transpose 只支持两个维度的交换<br> outgrad_dilated = dilate(out_grad, (1, 2), self.stride - 1)<br> X_grad = conv(outgrad_dilated, W_flipped_permuted, padding=s - 1 - self.padding)<br> <br> # 计算W_grad<br> # outgrad_dilated = dilate(out_grad, (1, 2), self.stride - 1)<br> outgrad_dilated_permuted = permute(outgrad_dilated, (1, 2, 0, 3))<br> X_permuted = permute(X, (3, 1, 2, 0))<br> W_grad = conv(X_permuted, outgrad_dilated_permuted, padding=self.padding)<br> W_grad = permute(W_grad, (1, 2, 0, 3))<br> return X_grad, W_grad<br> ### END YOUR SOLUTION<br><br><br>def conv(a, b, stride=1, padding=1):<br> return Conv(stride, padding)(a, b)<br>

首先修改 Kaming uniform 的实现,使之支持对卷积核的初始化。增加一个逻辑,根据参数 shape 是否为 None,在调用 rand 函数时传入不同的形状即可:

python<br>def kaiming_uniform(fan_in, fan_out, shape=None, nonlinearity="relu", **kwargs):<br> assert nonlinearity == "relu", "Only relu supported currently"<br> ### BEGIN YOUR SOLUTION<br> if nonlinearity == "relu":<br> gain = math.sqrt(2)<br> ### BEGIN YOUR SOLUTION<br> bound = gain * math.sqrt(3 / fan_in)<br> if shape is None:<br> return rand(fan_in, fan_out, low=-bound, high=bound, **kwargs)<br> else:<br> return rand(*shape, low=-bound, high=bound, **kwargs)<br> ### END YOUR SOLUTION<br>

hw4 的代码中,对于 NDArray.sum 的实现有问题,当求和的维度指定为空 tuple 时,其不应该进行求和操作,但原始代码无法正确处理这种情况,需要参数 axis 类型为 list 或者 tuple 的分支进行额外的判断,如果为空 list 或者 tuple,输出等于输入:

python<br>def sum(self, axis=None, keepdims=False):<br> if isinstance(axis, int):<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> elif isinstance(axis, (tuple, list)):<br> if len(axis) == 0:<br> out = self<br> for axis_ in axis:<br> view, out = self.reduce_view_out(axis_, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> else:<br> view, out = self.reduce_view_out(axis, keepdims=keepdims)<br> self.device.reduce_sum(view.compact()._handle, out._handle, view.shape[-1])<br> <br> return out<br>

万事俱备,卷积层的实现调用上边的函数即可。初始化的部分,根据文档描述初始化好权重和偏执项。对于步长为 1 的卷积,卷积结果会缩水 k-1 行 k-1 列,为了确保 shape 不变,卷积时四周要 pad (k-1)/2,又由于传统上 k 为奇数,因此等价于 pad k/2。

前向传播的部分,首先将 X 重排为 NHWC 的格式,然后加上卷积层。如果由偏执项,则将其广播后再加到结果中,最后将结果重排为 NCHW 格式返回即可。完整代码为:

python<br>class Conv(Module):<br> """<br> Multi-channel 2D convolutional layer<br> IMPORTANT: Accepts inputs in NCHW format, outputs also in NCHW format<br> Only supports padding=same<br> No grouped convolution or dilation<br> Only supports square kernels<br> """<br> def __init__(self, in_channels, out_channels, kernel_size, stride=1, bias=True, device=None, dtype="float32"):<br> super().__init__()<br> if isinstance(kernel_size, tuple):<br> kernel_size = kernel_size[0]<br> if isinstance(stride, tuple):<br> stride = stride[0]<br> self.in_channels = in_channels<br> self.out_channels = out_channels<br> self.kernel_size = kernel_size<br> self.stride = stride<br><br> ### BEGIN YOUR SOLUTION<br> self.weight = Parameter(init.kaiming_uniform(self.in_channels, self.out_channels, shape=(kernel_size, kernel_size, in_channels, out_channels), device=device, dtype=dtype))<br> bias_bound = 1.0 / np.sqrt(in_channels * kernel_size * kernel_size)<br> self.bias = Parameter(init.rand(out_channels, low=-bias_bound, high=bias_bound, device=device, dtype=dtype)) if bias else None<br> self.padding = kernel_size // 2<br> ### END YOUR SOLUTION<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> # convert NCHW to NHWC<br> x = ops.permute(x, [0, 2, 3, 1])<br> conv_x = ops.conv(x, self.weight, stride=self.stride, padding=self.padding)<br> if self.bias is not None:<br> broadcasted_bias = ops.broadcast_to(ops.reshape(self.bias, (1, 1, 1, self.out_channels)), conv_x.shape)<br> conv_x = conv_x + broadcasted_bias<br> out = ops.permute(conv_x, [0, 3, 1, 2])<br> return out<br>
python<br>class ReLU(TensorOp):<br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> return array_api.maximum(a, 0)<br> ### END YOUR SOLUTION<br><br> def gradient(self, out_grad, node):<br> ### BEGIN YOUR SOLUTION<br> relu_mask = Tensor(node.inputs[0].cached_data > 0, device=node.inputs[0].device)<br> return out_grad * relu_mask<br> ### END YOUR SOLUTION<br>

同样,之前在实现 SoftmaxLoss 生成 one hot 时也没有指定 device,这里需要修改:

python<br>class SoftmaxLoss(Module):<br> def forward(self, logits: Tensor, y: Tensor):<br> ### BEGIN YOUR SOLUTION<br> batch_size, label_size = logits.shape<br> one_hot_y = init.one_hot(label_size, y, device=logits.device)<br> true_logits = ops.summation(logits * one_hot_y, axes=(1,))<br> return (ops.logsumexp(logits, axes=(1, )) - true_logits).sum()/batch_size<br> ### END YOUR SOLUTION<br>

此外,还发现在 reshape 操作可能没有调用 compact,这里直接修改其实现,在调用 array_api 前进行 compact 操作:

python<br>class Reshape(TensorOp):<br> def __init__(self, shape):<br> self.shape = shape<br><br> def compute(self, a):<br> ### BEGIN YOUR SOLUTION<br> expect_size = 1<br> for i in self.shape:<br> expect_size *= i<br> real_size = 1<br> for i in a.shape:<br> real_size *= i<br> assert expect_size == real_size , "The reshape size is not compatible"<br> return array_api.reshape(a.compact(), self.shape)<br> ### END YOUR SOLUTION<br>

经过一番小修小补,我们的代码已经相当健壮,足以完成这个 ResNet 9🎉。ResNet 9 网络架构如下所示。写代码的过程中有些漏洞咱也没必要妄自菲薄,毕竟这么厉害的两位大佬也难免有笔误的地方。下图中的 ResNet 9 有一层网络架构写错了,已在原图中指出。
image.png|600|360x416
首先来实现 ConvBN,传入的四个参数以此为 channels_in,channels_out,kernel_size 和 stride。hw4 的框架代码中提供了 BatchNorm2d,在拷贝 nn_basic.py 文件时不要直接覆盖。剩余的实现很简单,根据示意图搭积木,运行后哪里报 Not Implemented Error 就补哪里,完整代码为:

python<br>class ResNet9(ndl.nn.Module):<br> def __init__(self, device=None, dtype="float32"):<br> super().__init__()<br> bias = True<br> ### BEGIN YOUR SOLUTION ###<br> self.conv1 = ConvBN(3, 16, 7, 4, bias=bias, device=device, dtype=dtype)<br> self.conv2 = ConvBN(16, 32, 3, 2, bias=bias, device=device, dtype=dtype)<br> self.res = ndl.nn.Residual(<br> ndl.nn.Sequential(<br> ConvBN(32, 32, 3, 1, bias=bias, device=device, dtype=dtype),<br> ConvBN(32, 32, 3, 1, bias=bias, device=device, dtype=dtype)<br> )<br> )<br> self.conv3 = ConvBN(32, 64, 3, 2, bias=bias, device=device, dtype=dtype)<br> self.conv4 = ConvBN(64, 128, 3, 2, bias=bias, device=device, dtype=dtype)<br> self.res2 = ndl.nn.Residual(<br> ndl.nn.Sequential(<br> ConvBN(128, 128, 3, 1, bias=bias, device=device, dtype=dtype),<br> ConvBN(128, 128, 3, 1, bias=bias, device=device, dtype=dtype)<br> )<br> )<br> self.flatten = ndl.nn.Flatten()<br> self.linear = ndl.nn.Linear(128, 128, bias=bias, device=device, dtype=dtype)<br> self.relu = ndl.nn.ReLU()<br> self.linear2 = ndl.nn.Linear(128, 10, bias=bias, device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br><br> def forward(self, x):<br> ### BEGIN YOUR SOLUTION<br> x = self.conv1(x)<br> x = self.conv2(x)<br> x = self.res(x)<br> x = self.conv3(x)<br> x = self.conv4(x)<br> x = self.res2(x)<br> x = self.flatten(x)<br> x = self.linear(x)<br> x = self.relu(x)<br> x = self.linear2(x)<br> return x<br> ### END YOUR SOLUTION<br>

很遗憾,上述代码在我的设备上并不能通过 ResNet 9 的测试点,误差为 0.09,远超 tolerance 0.01。但其又能通过后续在 CIFAR 10 训练集上训练 2 epoches 的测试点,且误差为 5e-5,远小于 tolerance 0.01。怀疑前一个测试点数据有问题。

Part 4: Recurrent neural network

python<br>class RNNCell(Module):<br> def __init__(self, input_size, hidden_size, bias=True, nonlinearity='tanh', device=None, dtype="float32"):<br> super().__init__()<br> ### BEGIN YOUR SOLUTION<br> bound = 1 / np.sqrt(hidden_size)<br> self.W_ih = Parameter(init.rand(input_size, hidden_size, low=-bound, high=bound, device=device, dtype=dtype))<br> self.W_hh = Parameter(init.rand(hidden_size, hidden_size, low=-bound, high=bound, device=device, dtype=dtype))<br> self.bias_ih = Parameter(init.rand(hidden_size, low=-bound, high=bound, device=device, dtype=dtype)) if bias else None<br> self.bias_hh = Parameter(init.rand(hidden_size, low=-bound, high=bound, device=device, dtype=dtype)) if bias else None<br> self.nonlinearity = ops.tanh if nonlinearity == 'tanh' else ops.relu<br> ### END YOUR SOLUTION<br><br> def forward(self, X, h=None):<br> ### BEGIN YOUR SOLUTION<br> if h is None:<br> h = init.zeros(X.shape[0], self.W_hh.shape[0], device=X.device, dtype=X.dtype)<br> Z = X@self.W_ih + h@self.W_hh<br> if self.bias_ih:<br> bias = self.bias_ih + self.bias_hh<br> bias = bias.reshape((1, bias.shape[0]))<br> bias = bias.broadcast_to(Z.shape)<br> Z += bias<br> return self.nonlinearity(Z)<br> ### END YOUR SOLUTION<br>

由上图,可知每一层的输入都是在变化的,因此考虑维护一个 X_input 列表用于存储当前没计算的 cell 的垂直输入。同样,维护一个 h_input 列表存储当前没计算的 cell 的水平输入。具体来说,当计算的 cell 编号为 hijhij​ 时,其用到的输入为 X_input[i] 和 h_input[j],同时计算结束后 X_input[j] 和 h_input[j] 都要更新为该节点的输出。

对于这个堆叠在一起的 RNN,可以采用从左往右、从下到上,或者从下到上、从左往右的计算方式。我采用的是先垂直再水平的计算顺序。

模型最后要返回两个变量,一个是最后一层的输出 output,即示意图中的 y 的集合,不难发现最后一层的输出就是最后一层的后一层(假设存在)的垂直输入,即我们一直在维护的 X_input。另一个要返回的变量是最后一列隐藏层,同样,这就是我们一直在维护的水平输入 h_input。水到渠成。

需要注意,Tensor 没有实现 getitem 和 setitem 方法,需要切片存取的时候调用之前实现的 split 和 stack 方法即可。

完整代码为:

python<br>class RNN(Module):<br> def __init__(self, input_size, hidden_size, num_layers=1, bias=True, nonlinearity='tanh', device=None, dtype="float32"):<br> super().__init__()<br> ### BEGIN YOUR SOLUTION<br> self.rnn_cells = []<br> self.rnn_cells.append(RNNCell(input_size, hidden_size, bias, nonlinearity, device, dtype))<br> for i in range(1, num_layers):<br> self.rnn_cells.append(RNNCell(hidden_size, hidden_size, bias, nonlinearity, device, dtype))<br> ### END YOUR SOLUTION<br><br> def forward(self, X, h0=None):<br> ### BEGIN YOUR SOLUTION<br> seq_len = X.shape[0]<br> layer_num = len(self.rnn_cells)<br> if h0 is None:<br> h0 = init.zeros(len(self.rnn_cells), X.shape[1], self.rnn_cells[0].W_hh.shape[0], device=X.device, dtype=X.dtype)<br> h_input = list(ops.split(h0, 0)) # list length = num_layers, element shape = (bs, hidden_size)<br> X_input = list(ops.split(X, 0)) # list length = seq_len, element shape = (bs, input_size)<br> for i in range(seq_len):<br> for j in range(layer_num):<br> X_input[i] = self.rnn_cells[j](X_input[i], h_input[j])<br> h_input[j] = X_input[i]<br> output = ops.stack(X_input, 0) # output features of last layer == input X of last+1 layer<br> h_n = ops.stack(h_input, 0)<br> return output, h_n<br> <br> <br> ### END YOUR SOLUTION<br>

Part 5: LSTM

本章节将实现 LSTM,LSTM 和上边的 RNN 逻辑相同,照抄公式,这里直接放出代码:

python<br>class LSTMCell(Module):<br> def __init__(self, input_size, hidden_size, bias=True, device=None, dtype="float32"):<br> super().__init__()<br> ### BEGIN YOUR SOLUTION<br> bound = 1.0 / np.sqrt(hidden_size)<br> self.W_ih = Parameter(init.rand(input_size, 4*hidden_size, low=-bound, high=bound, device=device, dtype=dtype))<br> self.W_hh = Parameter(init.rand(hidden_size, 4*hidden_size, low=-bound, high=bound, device=device, dtype=dtype))<br> self.bias_ih = Parameter(init.rand(4*hidden_size, low=-bound, high=bound, device=device, dtype=dtype)) if bias else None<br> self.bias_hh = Parameter(init.rand(4*hidden_size, low=-bound, high=bound, device=device, dtype=dtype)) if bias else None<br> self.sigmoid = Sigmoid()<br> ### END YOUR SOLUTION<br><br><br> def forward(self, X, h=None):<br> ### BEGIN YOUR SOLUTION<br> bs = X.shape[0]<br> hidden_size = self.W_hh.shape[0]<br> if h is None:<br> h0 = init.zeros(bs, hidden_size, device=X.device, dtype=X.dtype)<br> c0 = init.zeros(bs, hidden_size, device=X.device, dtype=X.dtype)<br> else:<br> h0, c0 = h<br> Z = X@self.W_ih + h0@self.W_hh # [bs, 4*hidden_size]<br> if self.bias_ih:<br> bias = self.bias_ih + self.bias_hh<br> bias = bias.reshape((1, bias.shape[0]))<br> bias = bias.broadcast_to(Z.shape)<br> Z += bias<br> stripes = list(ops.split(Z, 1))<br> i = self.sigmoid(ops.stack(stripes[0: hidden_size], 1))<br> f = self.sigmoid(ops.stack(stripes[hidden_size: 2*hidden_size], 1))<br> g = ops.tanh(ops.stack(stripes[2*hidden_size: 3*hidden_size], 1))<br> o = self.sigmoid(ops.stack(stripes[3*hidden_size: 4*hidden_size], 1))<br> c = f * c0 + i * g<br> h = o * ops.tanh(c)<br> return h, c<br> <br> ### END YOUR SOLUTION<br><br><br>class LSTM(Module):<br> def __init__(self, input_size, hidden_size, num_layers=1, bias=True, device=None, dtype="float32"):<br> super().__init__()<br> ### BEGIN YOUR SOLUTION<br> self.lstm_cells = []<br> self.lstm_cells.append(LSTMCell(input_size, hidden_size, bias, device, dtype))<br> for i in range(1, num_layers):<br> self.lstm_cells.append(LSTMCell(hidden_size, hidden_size, bias, device, dtype))<br> ### END YOUR SOLUTION<br><br> def forward(self, X, h=None):<br> ### BEGIN YOUR SOLUTION<br> seq_len, bs, _ = X.shape<br> num_layers = len(self.lstm_cells)<br> hidden_size = self.lstm_cells[0].W_hh.shape[0]<br> if h is None:<br> h0 = init.zeros(num_layers, bs, hidden_size, device=X.device, dtype=X.dtype)<br> c0 = init.zeros(num_layers, bs, hidden_size, device=X.device, dtype=X.dtype)<br> else:<br> h0, c0 = h<br> h_input = list(ops.split(h0, 0))<br> c_input = list(ops.split(c0, 0))<br> X_input = list(ops.split(X, 0))<br> for i in range(seq_len):<br> for j in range(num_layers):<br> X_input[i], c_input[j] = self.lstm_cells[j](X_input[i], (h_input[j], c_input[j]))<br> h_input[j] = X_input[i]<br> output = ops.stack(X_input, 0)<br> h_n = ops.stack(h_input, 0)<br> c_n = ops.stack(c_input, 0)<br> return output, (h_n, c_n)<br> <br> ### END YOUR SOLUTION<br>

Part 6: Penn Treebank dataset

python<br>class Dictionary(object):<br> def __init__(self):<br> self.word2idx = {}<br> self.idx2word = []<br><br> def add_word(self, word):<br> ### BEGIN YOUR SOLUTION<br> if self.word2idx.get(word) is None:<br> self.word2idx[word] = len(self.idx2word)<br> self.idx2word.append(word)<br> return self.word2idx[word]<br> ### END YOUR SOLUTION<br><br> def __len__(self):<br> ### BEGIN YOUR SOLUTION<br> return len(self.idx2word)<br> ### END YOUR SOLUTION<br>

具体实现时参考 docstring 描述即可,由示意图,一目了然。完整代码为:

python<br>class Corpus(object):<br> def __init__(self, base_dir, max_lines=None):<br> self.dictionary = Dictionary()<br> self.train = self.tokenize(os.path.join(base_dir, 'train.txt'), max_lines)<br> self.test = self.tokenize(os.path.join(base_dir, 'test.txt'), max_lines)<br><br> def tokenize(self, path, max_lines=None):<br> ### BEGIN YOUR SOLUTION<br> with open(path, 'r') as f:<br> ids = []<br> line_idx = 0<br> for line in f:<br> if max_lines is not None and line_idx >= max_lines:<br> break<br> words = line.split() + ['<eos>']<br> for word in words:<br> ids.append(self.dictionary.add_word(word))<br> line_idx += 1<br> return ids<br> ### END YOUR SOLUTION<br><br><br>def batchify(data, batch_size, device, dtype):<br> ### BEGIN YOUR SOLUTION<br> data_len = len(data)<br> nbatch = data_len // batch_size<br> data = data[:nbatch * batch_size]<br> return np.array(data).reshape(batch_size, -1).T<br> ### END YOUR SOLUTION<br><br><br>def get_batch(batches, i, bptt, device=None, dtype=None):<br> ### BEGIN YOUR SOLUTION<br> data = batches[i: i + bptt, :]<br> target = batches[i + 1: i + 1 + bptt, :]<br> return Tensor(data, device=device, dtype=dtype), Tensor(target.flatten(), device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br>

Part 7: Training a word-level language model

这里有个大坑,ndarray 实现的矩阵乘法不支持批量矩乘,如果由三维矩阵乘二维的情况,需要手动 reshape 再乘,再 reshape 回去。

python<br>class Embedding(Module):<br> def __init__(self, num_embeddings, embedding_dim, device=None, dtype="float32"):<br> super().__init__()<br> ### BEGIN YOUR SOLUTION<br> self.weight = Parameter(init.randn(num_embeddings, embedding_dim, device=device, dtype=dtype))<br> ### END YOUR SOLUTION<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> one_hot = init.one_hot(self.weight.shape[0], x, device=x.device, dtype=x.dtype)<br> seq_len, bs, num_embeddings = one_hot.shape<br> one_hot = one_hot.reshape((seq_len*bs, num_embeddings))<br> <br> return ops.matmul(one_hot, self.weight).reshape((seq_len, bs, self.weight.shape[1]))<br> ### END YOUR SOLUTION<br>
python<br>class LanguageModel(nn.Module):<br> def __init__(self, embedding_size, output_size, hidden_size, num_layers=1,<br> seq_model='rnn', device=None, dtype="float32"):<br> super(LanguageModel, self).__init__()<br> ### BEGIN YOUR SOLUTION<br> self.embedding = nn.Embedding(output_size, embedding_size, device=device, dtype=dtype)<br> if seq_model == 'rnn':<br> self.model = nn.RNN(embedding_size, hidden_size, num_layers, device=device, dtype=dtype)<br> elif seq_model == 'lstm':<br> self.model = nn.LSTM(embedding_size, hidden_size, num_layers, device=device, dtype=dtype)<br> self.linear = nn.Linear(hidden_size, output_size, device=device, dtype=dtype)<br> ### END YOUR SOLUTION<br><br> def forward(self, x, h=None):<br> ### BEGIN YOUR SOLUTION<br> x = self.embedding(x) # (seq_len, bs, embedding_size)<br> out, h = self.model(x, h)<br> seq_len, bs, hidden_size = out.shape<br> out = out.reshape((seq_len * bs, hidden_size))<br> out = self.linear(out)<br> return out, h<br> ### END YOUR SOLUTION<br>

如果出现没有实现的异常,就从 hw2 中粘过来。

python<br>def epoch_general_ptb(data, model, seq_len=40, loss_fn=nn.SoftmaxLoss(), opt=None,<br> clip=None, device=None, dtype="float32"):<br> np.random.seed(4)<br> ### BEGIN YOUR SOLUTION<br> if opt:<br> model.train()<br> else:<br> model.eval()<br> total_loss = 0<br> total_error = 0<br> n_batch, batch_size = data.shape<br> iter_num = n_batch - seq_len<br> for iter_idx in range(iter_num):<br> X, target = ndl.data.get_batch(data, iter_idx, seq_len, device=device, dtype=dtype)<br> if opt:<br> opt.reset_grad()<br> pred, _ = model(X)<br> loss = loss_fn(pred, target)<br> if opt:<br> opt.reset_grad()<br> loss.backward()<br> if clip:<br> opt.clip_grad_norm(clip)<br> opt.step()<br> total_loss += loss.numpy()<br> total_error += np.sum(pred.numpy().argmax(1)!=target.numpy())<br> avg_loss = total_loss / iter_num<br> avg_acc = 1 - total_error / (iter_num * seq_len)<br> return avg_acc, avg_loss<br> ### END YOUR SOLUTION<br>
def train_ptb(model, data, seq_len=40, n_epochs=1, optimizer=ndl.optim.SGD,
              lr=4.0, weight_decay=0.0, loss_fn=nn.SoftmaxLoss, clip=None,
              device=None, dtype="float32"):
    np.random.seed(4)
    ### BEGIN YOUR SOLUTION
    for epoch in range(n_epochs):
        avg_acc, avg_loss = epoch_general_ptb(data, model, seq_len, loss_fn(),
                                               optimizer(model.parameters(), lr=lr, weight_decay=weight_decay),
                                               clip=clip, device=device, dtype=dtype)
    return avg_acc, avg_loss
    ### END YOUR SOLUTION

def evaluate_ptb(model, data, seq_len=40, loss_fn=nn.SoftmaxLoss,
                 device=None, dtype="float32"):
    np.random.seed(4)
    ### BEGIN YOUR SOLUTION
    avg_acc, avg_loss = epoch_general_ptb(data, model, seq_len, loss_fn(),
                                           device=device, dtype=dtype)
    return avg_acc, avg_loss
    ### END YOUR SOLUTION

hw4 小结

本节最大的难点在于卷积反向传播的推导,当时推导得头秃了。剩余内容基本都是在搭积木和对之前的实现小修小补,也挺烦躁。

总算是完结了,撒花🎉

hw4_extra

Fine,还有一个实验,继续!

Part 1: Implementing the Multi-Head Attention Activation Layer

这部分将完成一个多头自注意层的正向传播部分。在这个类中提供了一系列辅助函数,记得先浏览一遍。

文档中有两点没有提到:

之前实现的 dropout 算子有点问题,没有指定 dtype 和 device,需要修改:

python<br>class Dropout(Module):<br> def __init__(self, p=0.5):<br> super().__init__()<br> self.p = p<br><br> def forward(self, x: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> if not self.training:<br> return x<br> mask = init.randb(*x.shape, p=1 - self.p, dtype="float32", device=x.device)<br> return x * mask / (1 - self.p)<br> ### END YOUR SOLUTION<br>

由于输入的 KQV 在已经把“头”作为一个独立维度分离出来了,实现多头自注意力就简单很多,直接当作单头一样抄公式即可:

python<br> def forward(<br> self,<br> q, k, v,<br> ):<br> batch_size, num_head, queries_len, q_dim = q.shape<br> _, _, keys_values_len, k_dim = k.shape<br> _, _, _, v_dim = v.shape<br><br> assert q_dim == k_dim == v_dim<br><br> result = None<br> probs = None<br><br> ### BEGIN YOUR SOLUTION<br> sqrt_d = np.sqrt(q_dim)<br> Z = self.matmul(q, k) / sqrt_d<br> if self.causal:<br> mask = self.create_causal_mask(queries_len, keys_values_len, self.device)<br> Z = Z + mask.broadcast_to(Z.shape)<br> probs = self.softmax(Z)<br> probs = self.dropout(probs)<br> result = self.matmul(probs, v.transpose((2, 3)))<br> ### END YOUR SOLUTION<br><br> return result, probs<br>

Part 2 Implementing the Self-Attention Layer with trainable parameters

本部分将实现一个多头自注意力层,包括对 KQV 进行 preNorm、分头、调用之前实现的正向传播代码、合并、线性映射。

首先修改 class Matmul 的实现,使之支持当 A 为 batch 时的 batch matmul 计算:

python<br>class MatMul(TensorOp):<br> def compute(self, a, b):<br> ### BEGIN YOUR SOLUTION<br> a_shape = a.shape<br> if len(a.shape) > 2:<br> batch_size = 1<br> for i in range(0, len(a.shape) - 1):<br> batch_size *= a.shape[i]<br> a = a.reshape((batch_size, a_shape[-1]))<br> out = a@b<br> if len(a_shape) > 2:<br> out = out.reshape((*a_shape[:-1], b.shape[-1]))<br> return out<br> ### END YOUR SOLUTION<br>

之前实现的 layerNorm1D 只支持 (batch_size, hddien_size) 的格式,在调用 perNorm 之前要手动进行 reshape,或者直接修改 layerNorm 的实现。

之前实现的 Linear 模块有点问题,当不存在 bias 时仍旧会尝试对其访问,需要修改:

python<br>class Linear(Module):<br> def forward(self, X: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> y = ops.matmul(X, self.weight)<br> if self.bias:<br> if self.bias.shape != (1, self.out_features):<br> self.bias = self.bias.reshape((1, self.out_features))<br> y += self.bias.broadcast_to(y.shape)<br> return y<br>

分头行动就是先 reshape 再 permute,这一操作在前面的 hw 中已经出现多次,比较熟练。整体实现比较简单,不到十行代码即可:

python<br>def forward(<br> self,<br> q, k=None, v=None,<br>):<br> if k is None:<br> k = q<br> if v is None:<br> v = q<br><br> batch_size, queries_len, q_dim = q.shape<br> _, keys_values_len, k_dim = k.shape<br> _, _, v_dim = v.shape<br><br> result = None<br><br> ### BEGIN YOUR SOLUTION<br> q, k, v = self.prenorm_q(q), self.prenorm_k(k), self.prenorm_v(v)<br> q, k, v = self.q_projection(q), self.k_projection(k), self.v_projection(v)<br> q = ops.permute(q.reshape((batch_size, queries_len, self.num_head, self.dim_head)), (0, 2, 1, 3))<br> k = ops.permute(k.reshape((batch_size, keys_values_len, self.num_head, self.dim_head)), (0, 2, 1, 3))<br> v = ops.permute(v.reshape((batch_size, keys_values_len, self.num_head, self.dim_head)), (0, 2, 1, 3))<br> attn_res, _ = self.attn(q, k, v)<br> attn_res = ops.permute(attn_res, (0, 2, 1, 3)).reshape((batch_size, keys_values_len, self.num_head * self.dim_head))<br> result = self.out_projection(attn_res)<br> ### END YOUR SOLUTION<br><br> return result<br>

Part 3 Implementing a prenorm residual Transformer Layer

本节将完成一个残差 Transformer 层,本层没有难度,纯搭积木。搭积木之前照例对我们的积木块打个补丁,上个 Part 中修改的 Linear 层仍有问题,bias 不支持多 batch 维度,修改为一下内容:

python<br>def forward(self, X: Tensor) -> Tensor:<br> ### BEGIN YOUR SOLUTION<br> y = ops.matmul(X, self.weight)<br> if self.bias:<br> boradcast_shape = [1] * (len(y.shape) - 1) + [self.out_features]<br> bias = self.bias.reshape(boradcast_shape).broadcast_to(y.shape)<br> y += bias<br> return y<br>

接下来就可以愉快地搭积木啦:

python<br>class TransformerLayer(Module):<br><br> def __init__(<br> self,<br> q_features: int,<br> num_head: int,<br> dim_head: int,<br> hidden_size: int,<br> *,<br> dropout = 0.,<br> causal = True,<br> device = None,<br> dtype = "float32",<br> ):<br><br> super().__init__()<br><br> self.device = device<br> self.dtype = dtype<br><br> ### BEGIN YOUR SOLUTION<br> self.layer1 = Sequential(<br> AttentionLayer(<br> q_features=q_features,<br> num_head=num_head,<br> dim_head=dim_head,<br> out_features=q_features,<br> dropout=dropout,<br> causal=causal,<br> device=device,<br> dtype=dtype<br> ),<br> Dropout(dropout),<br> )<br> self.layer2 = Sequential(<br> LayerNorm1d(q_features, device=device, dtype=dtype),<br> Linear(q_features, hidden_size, bias=True, device=device, dtype=dtype),<br> ReLU(),<br> Dropout(dropout),<br> Linear(hidden_size, q_features, bias=True, device=device, dtype=dtype),<br> Dropout(dropout),<br> )<br> <br> ### END YOUR SOLUTION<br><br> def forward(<br> self,<br> x<br> ):<br> batch_size, seq_len, x_dim = x.shape<br><br> ### BEGIN YOUR SOLUTION<br> x = self.layer1(x) + x<br> x = self.layer2(x) + x<br> ### END YOUR SOLUTION<br><br> return x<br>

Part 4 Implementing the Transformer model

本部分完成的是一个完整的 Transformer 网络。文档中提到,根据每个词在句子中的序号做一个 embed,所以在初始化时要额外初始化一个 embed 层,在数据进入 Transformer 前把这个 embed 加上去。其余部分搭积木:

python<br>class Transformer(Module):<br><br> def __init__(<br> self,<br> embedding_size: int,<br> hidden_size: int,<br> num_layers: int, <br> *,<br> num_head: int = 8,<br> dim_head: int = 32,<br> dropout = 0.,<br> causal = True,<br> device = None,<br> dtype = "float32",<br> batch_first = False,<br> sequence_len = 2048<br> ):<br><br> super().__init__()<br><br> self.device = device<br> self.dtype = dtype<br> self.batch_first = batch_first<br><br> ### BEGIN YOUR SOLUTION<br> self.embedding = Embedding(<br> num_embeddings=sequence_len,<br> embedding_dim=embedding_size,<br> device=device,<br> dtype=dtype<br> )<br> layers = [TransformerLayer(<br> q_features=embedding_size,<br> num_head=num_head,<br> dim_head=dim_head,<br> hidden_size=hidden_size,<br> dropout=dropout,<br> causal=causal,<br> device=device,<br> dtype=dtype<br> ) for _ in range(num_layers)]<br> self.model = Sequential(*layers)<br> <br> ### END YOUR SOLUTION<br><br> def forward(<br> self,<br> x, h=None<br> ):<br><br> if not self.batch_first:<br> x = ops.transpose(x, axes=(0, 1))<br><br> ### BEGIN YOUR SOLUTION<br> bs, seq_len, input_dim = x.shape<br> time = np.repeat(np.arange(seq_len), bs).reshape((seq_len, bs)).T<br> time = Tensor(time, device=self.device, dtype=self.dtype)<br> time = self.embedding(time)<br> x = x + time<br> x = self.model(x)<br> ### END YOUR SOLUTION<br><br> if not self.batch_first:<br> x = ops.transpose(x, axes=(0, 1))<br><br> return x, init.zeros_like(x)<br>

由于 ops.matmul 中对于 batch matmul 的坑太多了,之前只修改了正向传播部分,反向传播仍未支持 matmul,最后没能实现在数据集上进行训练 Transformer 网络,略有遗憾。

hw4_extra 小结

hw4_extra 难度相比 hw4 低了很多,毕竟没让我们自己手推 Transformer 的反向传播公式,不然又是一场腥风血雨。

这次是真的完结了,撒花🎉