Hello World, GNN

曼亚灿

发布于 2024-01-17 09:37:29

30210

代码可运行

文章被收录于专栏：亚灿网志亚灿网志

运行总次数：0

代码可运行

本案例的内容呢，主要是利用论文间的相互引用关系，设计一个GCN网络进行论文分类。具体的数据结构与内容会在下文详细介绍。

代码实战

1. 引包

import hues
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from torch import nn
from pathlib import Path
from prettytable import PrettyTable
from torch.nn import functional as nn_fun
from scipy.sparse import coo_matrix, csr_matrix, diags, eye

Python第一步，引包最重要。上面所用到包的主要作用大概是：

hues: hues 是一个Python库，用于在终端输出中添加颜色和样式。它可以用于美化控制台输出，使得调试和呈现数据更加直观和易于理解。
torch 和 from torch import nn: torch 是PyTorch框架的核心，一个流行的深度学习库，广泛用于机器学习和人工智能领域。它提供了丰富的张量操作，与NumPy兼容但具有更强大的GPU加速支持。
- from torch import nn 导入了PyTorch的神经网络模块。这个模块包含了构建深度学习模型所需的各种层、损失函数等组件。
numpy: numpy 是Python中用于科学计算的核心库。它提供了一个强大的N维数组对象、广泛的数学函数操作，以及用于线性代数、傅里叶变换和随机数生成的工具。
pandas: pandas 是Python中用于数据处理和分析的库。它提供了DataFrame和Series这两种主要的数据结构，适用于处理时间序列和非时间序列数据，非常适合于数据清洗、分析和可视化。
matplotlib.pyplot: matplotlib.pyplot 是一个绘图库，用于Python和其数值计算库NumPy。它提供了一种类似于MATLAB的绘图界面，用于生成各种静态、动画以及交云的图表。
from pathlib import Path: Path 来自pathlib模块，它提供了面向对象的文件系统路径处理方法。使用Path可以以更直观和安全的方式操作文件系统路径，比传统的字符串路径操作更加灵活和易用。
from prettytable import PrettyTable: PrettyTable 是一个简单的Python库，用于从数据中创建漂云的ASCII表格。这非常适合在命令行应用中格式化和呈现数据。
from torch.nn import functional as nn_fun: torch.nn.functional 包含了神经网络中使用的各种函数，如激活函数、损失函数等，通常与nn模块中的类接口结合使用。
scipy.sparse 相关函数: from scipy.sparse import coo_matrix, csr_matrix, diags, eye 导入了SciPy库中的稀疏矩阵相关功能。
- coo_matrix：一种稀疏矩阵表示方式，使用三个NumPy数组（行坐标、列坐标、数据值）存储非零元素。
- csr_matrix：另一种稀疏矩阵表示方式，使用行索引、列索引和数据值数组，适合高效的算术运算和矩阵向量积。
- diags：用于创建对角矩阵的函数。
- eye：创建一个单位矩阵（主对角线上为1，其余为0的方阵）。

2. 查看&定义计算设备

基于PyTorch的深度学习可以在CPU或者GPU上运行，如果你已经成功安装对应版本的CUDA，俺么就可以使用GPU来加速运行：

#输出运算资源请况
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

输出cuda，说明本环境中已经成功安装GPU版本的PyTorch。

3. 读取数据并进行数据预处理

特征与标签矩阵

定义文件存放文件夹路径：

path = Path('..\..\pytorch-GNN-1st-main\pytorch-GNN-1st-main\data\第9章28\第9章28\cora')

需要读取到的数据文件有两个：

cora.content: 这个文件中的数据矩阵大小为(2708, 1435)，其中每一行为一个样本——即一篇论文，第一列为论文编号，最后一列为该论文的分类，中间列为经过编码的文章关键字，具体结构如下图所示。

#读取论文内容数据，并将其转化为数组
paper_features_label = np.genfromtxt(path / 'cora.content', dtype=np.str_)
print(paper_features_label.shape)
paper_features_label

对paper_features_label，首先要进行拆分，第一列论文ID为不连续的整数，需要进行重命名编号：

#取出数据的第一列：论文的ID
papers = paper_features_label[:, 0].astype(np.int32)
#为论文重新编号，{31336: 0, 1061127: 1,……
paper_id = {k: v for v, k in enumerate(papers)}
paper_id

paper_id即为经过重编号的论文ID。

将中间部分的文字编码取出，作为特征矩阵：

#将数据中间部分的字标签取出，转化成(稀疏)矩阵
features = csr_matrix(paper_features_label[:, 1:-1], dtype=np.float32)
print(np.shape(features))
# 将稀疏矩阵转化为稠密矩阵
features.todense()

对最后一列的标签进行编码处理：

#将最后一项的论文分类属性取出，并转化为分类索引
labels = paper_features_label[:, -1]
lbl2idx = {k: v for v, k in enumerate(sorted(np.unique(labels)))}
labels = [lbl2idx[e] for e in labels]
print(lbl2idx, labels[:5])

关系矩阵

另一个数据文件为cora.cites：

#读取论文关系数据，并将其转化为数组
edges = np.genfromtxt(path / 'cora.cites', dtype=np.int32)
print(np.shape(edges))
edges

这个文件中矩阵的大小为(5429, 2)，每一行代表一个引用关系。例如：第一行表示ID为35的论文引用了ID为1033的论文。

因为上面在处理特征与标签矩阵的过程中，我们对论文ID进行了重排序。因此，在这里我们需要对此矩阵中的论文ID也进行重排序。

#转化为新编号节点间的关系
edges = np.asarray([paper_id[e] for e in edges.flatten()], np.int32).reshape(edges.shape)
print(edges.shape)
edges

根据边关系构建邻接矩阵：

# 计算邻接矩阵(Adjacency matrix), 行列都为论文个数.
adj = coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])),
                 shape=(len(labels), len(labels)), dtype=np.float32)

adj.todense()

这里需要注意的是，上面所构建的邻接矩阵为有向图邻接矩阵，而在论文的引用关系中，我们并不需要有向连接。例如：论文A引用了论文B或者是论文B引用了论文A，只要其两者间存在引用就说明这两篇论文存在一定的相似性。

由此就产生了一个问题：如何将有向图的邻接矩阵转换为无向图的邻接矩阵？这里请看我下面总结的一张图：

第一眼看上去好像很难，但是只要你自己画出来图结构，然后手推一遍就可以非常深刻地理解了。

理解了上图，就可以按照数学公式很方便地进行运算了：

# 有向图邻接矩阵转化为无向图对称矩阵
adj_long = adj.multiply(adj.T < adj)
adj = adj_long + adj_long.T

至此，其实我们已经完成了所有的数据准备工作，得到了GCN模型输入的所有内容：

一个无向图邻接矩阵adj；
一个数据特征矩阵features；
一个标签矩阵（向量）：labels.

但是有一点需要注意，就是我们仍需要对输入特征矩阵和邻接矩阵进行归一化处理，原因主要包括以下几点：

防止梯度消失或爆炸：在深度学习模型中，特别是在使用多层网络时，未经归一化的数据可能导致梯度消失或爆炸。归一化可以帮助缓解这个问题，因为它保证了数据在各个维度上的尺度大致相同。
保持特征尺度一致性：在GCN中，节点特征和结构特征（即通过邻接矩阵表示的）是同等重要的。归一化确保这些不同类型的数据在尺度上保持一致，避免了某一类型的数据在模型训练过程中占据主导地位。
增强模型的稳定性和收敛速度：归一化处理有助于提高模型的数值稳定性，并可以加快模型的收敛速度。当数据在一个较小的范围内变化时，优化算法（如梯度下降）更容易找到最优解。
邻接矩阵的特殊性：在GCN中，邻接矩阵用于传播节点特征，从而捕获图结构。如果不进行归一化，节点的特征可能会因为节点的度（即连接的边数）而被放大或缩小，这可能导致信息传播不均衡。通过归一化（例如使用度矩阵的逆平方根），可以保证每个节点的贡献被适当地标准化，从而使特征传播更加有效和平衡。

构造一个对矩阵进行归一化的函数，并对特征矩阵和邻接矩阵进行归一化处理：

def normalize(mx):  #定义函数，对矩阵数据进行归一化
    '''Row-normalize sparse matrix'''
    rowsum = np.array(mx.sum(1))  #每一篇论文的字数
    r_inv = (rowsum ** -1).flatten()  #取总字数的倒数
    r_inv[np.isinf(r_inv)] = 0.  #将Nan值设为0(防止某一行全为0, 即对应的rowsum为0, r_inv就为)
    r_mat_inv = diags(r_inv)  #将总字数的倒数做成对角矩阵
    mx = r_mat_inv.dot(mx)  #左乘一个矩阵，相当于每个元素除以总数
    return mx


#对 features矩阵进行归一化（每行的总和为1）
features = normalize(features)

# 对邻接矩阵对角线添加1，将其变为自循环图。同时再对其进行归一化
adj = normalize(adj + eye(adj.shape[0]))

4. 数据集划分及转移

首先要将Numpy格式的数据转化为Tensor格式：

# Data as tensors
adj = torch.FloatTensor(adj.todense())  #节点间的关系
features = torch.FloatTensor(features.todense())  #节点自身的特征
labels = torch.LongTensor(labels)  #每个节点的分类标签

然后要进行数据集的划分：

#划分数据集
n_train = 200
n_val = 300
n_test = len(features) - n_train - n_val

np.random.seed(34)
idxs = np.random.permutation(len(features))  #将原有索引打乱顺序

#计算每个数据集的索引
idx_train = torch.LongTensor(idxs[:n_train])
idx_val = torch.LongTensor(idxs[n_train:n_train + n_val])
idx_test = torch.LongTensor(idxs[n_train + n_val:])

然后将数据全部转移到GPU上：

#分配运算资源，转到GPU上
adj = adj.to(device)
features = features.to(device)
labels = labels.to(device)
idx_train = idx_train.to(device)
idx_val = idx_val.to(device)
idx_test = idx_test.to(device)

5. 单层图卷积类设计

单层图卷积的运算逻辑如下图所示：

其实就是输入特征矩阵在进行升维或者降维后再左乘一个邻接矩阵，这样就把节点间的关系融合到了网络结构中。

GCN就是在CNN的基础上左乘一个邻接矩阵，而正是这个邻接矩阵中所存储的图结构的信息，使得标签节点间的特征进行传播。

class GraphConvolutionLayer(nn.Module):
    """
    图卷积类——单层图卷积类
    """

    def __init__(self, f_in: int, f_out: int, use_bias: bool = True, activation=None):
        """
        类对象初始化函数
        :param f_in: 输入样本特征数量
        :param f_out: 输出样本特征数量
        :param use_bias: 是否使用偏置
        :param activation: 激活函数
        """
        super().__init__()
        self.f_in = f_in
        self.f_out = f_out
        self.use_bias = use_bias

        # 权重与偏置参数的定义与初始化
        self.weight = nn.Parameter(torch.FloatTensor(f_in, f_out))
        self.bias = nn.Parameter(torch.FloatTensor(f_out)) if use_bias else None

        # 定义激活函数(默认为Mish激活函数)
        self.activation = (lambda x: x * (nn_fun.tanh(nn_fun.softplus(x)))) if activation else activation

        # 学习参数初始化
        self.initialize_weights()

    def initialize_weights(self):
        """
        参数初始化
        """
        # 初始化权重参数
        if self.activation:  # 如果使用激活函数
            nn.init.xavier_uniform_(self.weight)
        else:  # 如果不使用激活函数
            nn.init.kaiming_uniform_(self.weight, nonlinearity='leaky_relu')

        # 初始化偏置参数
        if self.use_bias:  # 偏置项初始化
            nn.init.zeros_(self.bias)

    def forward(self, f_mat, adj_mat) -> torch.tensor:
        """
        前向传播函数
        :param f_mat: 输入特征矩阵，形状：n×feature_in
        :param adj_mat: 样本关系邻接矩阵，形状：n×n, support
        :return: 计算过程参考P285图9-14
        """
        support = torch.mm(f_mat,
                           self.weight)  # input: n×feature_in, self.weight: feature_in×feature_out, support: n×feature_out

        # GCN与CNN唯一不同的地方——每一层都需要与邻接矩阵相乘
        output = torch.mm(adj_mat, support)  # adj: n×n, support: n×feature_out, output: n×feature_out,

        if self.use_bias:  # 如果有偏置项，则将输出进行与偏置进行广播运算
            output.add_(self.bias)

        if self.activation:  # 如果存在激活函数，则将输出传入激活函数
            output = self.activation(output)
        return output

5. 多层图卷积类设计

class GCN(nn.Module):
    """
    图卷积类——多层图卷积类
    """

    def __init__(self, f_in: int, n_classes: int, hidden: list, dropout_p: float = 0.5):
        """
        类对象初始化函数
        :param f_in: 输入样本特征数量
        :param n_classes: 卷积结束后，最后一层分类的数量
        :param hidden: 中间隐藏层输出特征f_out数量
        :param dropout_p: dropout参数
        """
        super().__init__()
        # 循环创建单层图神经网络层
        self.layers = nn.Sequential()
        for i, (f_in, f_out) in enumerate(zip([f_in] + hidden[:-1], hidden)):
            self.layers.add_module(f'GCN Layer-{i}', GraphConvolutionLayer(f_in, f_out))

        # Dropout层
        self.layers.add_module(f'Dropout Layer', nn.Dropout(dropout_p))
        # 最后输出层
        self.layers.add_module(f'Output Layer', GraphConvolutionLayer(hidden[-1], n_classes))

    def forward(self, f_mat, adj_mat):
        """
        前向传播函数
        :param f_mat: 输入特征矩阵，形状：n×feature_in
        :param adj_mat: 样本关系邻接矩阵，形状：n×n, support
        :return: 计算过程参考P285图9-14
        """
        for layer in self.layers:
            f_mat = layer(f_mat) if type(layer) == nn.Dropout else layer(f_mat, adj_mat)  # Dropout层只需要传入x

        return f_mat

6. 创建模型并测试

获取分类个数与节点个数：

n_labels = labels.max().item() + 1  #分类个数 7
n_features = features.shape[1]  #节点个数 1433
print(n_labels, n_features)

创建模型并测试其各层输入输出结构：

model = GCN(n_features, n_labels, hidden=[16, 32, 16]).to(device)

# 测试图神经网络
demo_features, demo_adj = torch.ones_like(features).to(device), torch.ones_like(adj).to(device)

# 创建一个 PrettyTable 对象
table = PrettyTable()
table.field_names = ['Index', "Layer Name", "Input Shape", "Output Shape"]  #  设置表头

for i, layer in enumerate(model.layers):
    input_shape = demo_features.shape
    demo_features = layer(demo_features) if type(layer) == nn.Dropout else layer(demo_features, demo_adj)
    table.add_row([i, layer.__class__.__name__, input_shape, demo_features.shape])

# 打印表格
print(table)

7. 创建训练与测试函数

# 导入Ranger优化器
from ranger import *

# 创建一个多层GCN网络并将其转移到GPU上
model = GCN(n_features, n_labels, hidden=[16, 32, 16]).to(device)

# 创建优化器
optimizer = Ranger(model.parameters())

# 计算准确度函数：返回本轮计算所有预测结果的平均准确度
get_acc = lambda output, y: (output.argmax(1) == y).type(torch.float32).mean().item()  # output.argmax(1)与y都是一个长度为2708的一维张量.


def train():
    """
    训练函数
    :return: 返回损失与精确度
    """
    model.train()  # 告诉模型：即将开始训练
    optimizer.zero_grad()  # 梯度归零
    output = model(features, adj)  # 训练并返回结果
    loss = nn_fun.cross_entropy(output[idx_train], labels[idx_train])  # 计算交叉熵损失函数
    acc = get_acc(output[idx_train], labels[idx_train])  # 计算精准度
    loss.backward()  # 损失反向传播
    optimizer.step()  # 优化器执行
    return loss.item(), acc  # 返回训练损失和精确度


def evaluate(idx):
    """
    测试函数
    :param idx: 需要进行预测的数据行indices
    :return: 返回损失与精确度
    """
    model.eval()  # 告诉模型：开启测试模型（不需要使用Dropout层）
    output = model(features, adj)
    loss = nn_fun.cross_entropy(output[idx], labels[idx]).item()  # 计算交叉熵损失
    return loss, get_acc(output[idx], labels[idx])

8. 训练模型

#训练模型
epochs = 1000
print_steps = 50
train_loss, train_acc = [], []
val_loss, val_acc = [], []

for i in range(epochs):
    tl, ta = train()
    train_loss.append(tl), train_acc.append(ta)
    if (i + 1) % print_steps == 0 or i == 0:
        # tl, ta = evaluate(idx_train)
        vl, va = evaluate(idx_val)
        val_loss.append(vl), val_acc.append(va)

        hues.log(f'[{i + 1:4d}/{epochs}]: train_loss={tl:.4f}, train_acc={ta:.4f}' +
                 f', val_loss={vl:.4f}, val_acc={va:.4f}')
#输出最终结果
final_train, final_val, final_test = evaluate(idx_train), evaluate(idx_val), evaluate(idx_test)
hues.success(f'Train     : loss={final_train[0]:.4f}, accuracy={final_train[1]:.4f}')
hues.success(f'Test      : loss={final_test[0]:.4f}, accuracy={final_test[1]:.4f}')
hues.success(f'Validation: loss={final_val[0]:.4f}, accuracy={final_val[1]:.4f}')

9. Loss与Accuarcy可视化

#可视化训练过程
plt.rcParams.update({
    'font.size': 15,
    'font.family': ['Times New Roman', 'SimSun']
})

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
ax = axes[0]
axes[0].plot(train_loss[::print_steps] + [train_loss[-1]], label='Train')
axes[0].plot(val_loss, label='Validation')
axes[1].plot(train_acc[::print_steps] + [train_acc[-1]], label='Train')
axes[1].plot(val_acc, label='Validation')

from matplotlib.ticker import FuncFormatter

plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda temp, _: '%1.0f' % (100 * temp) + '%'))

for ax, t in zip(axes, ['Loss', 'Accuracy']): ax.set_title(t, size=15), ax.set_xlabel('Epochs')

axes[0].set_ylabel('Value')
axes[1].set_ylabel('Accuracy(%)')

lines, texts = fig.axes[-1].get_legend_handles_labels()
fig.legend(lines, texts, ncol=2, loc='lower center', bbox_to_anchor=(0.5, -.1), markerscale=2)

10. 观察模型预测结果

#输出模型预测结果
output = model(features, adj)

samples = 10
idx_sample = idx_test[torch.randperm(len(idx_test))[:samples]]

idx2lbl = {v: k for k, v in lbl2idx.items()}
df = pd.DataFrame({'Real': [idx2lbl[e] for e in labels[idx_sample].tolist()],
                   'Pred': [idx2lbl[e] for e in output[idx_sample].argmax(1).tolist()]})
df