PyTorch BigGraph简介 - 附带示例

代码医生工作室

发布于 2019-07-05 18:26:31

1.5K0

发布于 2019-07-05 18:26:31

文章被收录于专栏：相约机器人

作者 | Sven Balnojan

来源 | Medium

编辑 | 代码医生团队

PyTorch BigGraph是一个用于创建和处理大型图形嵌入以供机器学习的工具。目前基于图形的神经网络有两种方法：

直接使用图形结构并将其提供给神经网络。然后在每一层保留图形结构。graphCNNs使用这种方法。
但是大多数图表都太大了。因此创建图形的大嵌入也是合理的。然后将其用作传统神经网络中的特征。

PyTorch BigGraph处理第二种方法，将在下面这样做。仅供参考，谈谈一秒钟的尺寸方面。图通常由它们的邻接矩阵编码。如果有一个包含3,000个节点的图形以及每个节点之间的边缘，则最终会在矩阵中生成大约10,000,000个条目。即使这种情况很少，显然这会破坏大多数GPU。

如果考虑推荐系统中使用的常用图表，会发现它们通常比这大得多。有兴趣将BigGraph应用到机器学习问题中，用最简单的例子来解决问题。构建了两个例子，将逐步介绍。

整个代码经过重构，可在GitHub上获得。它改编自BigGraph存储库中的示例。

https://github.com/sbalnojan/biggraph-examples

https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/examples/livejournal.py

第一个示例是LiveJournal图的一部分，数据如下所示：

# FromNodeId ToNodeId
0 1
0 2
0 3
...
0 10
0 11
0 12
...
0 46
1 0
...

第二个例子只是8个带边的节点：

# FromNodeId ToNodeId
0   1
0   2
0   3
0   4
1   0
1   2
1   3
1   4
2   1
2   3
2   4
3   1
3   2
3   4
3   7
4   1
5   1
6   2
7   3

嵌入LiveJournals图的一部分

BigGraph可以解决机器的内存限制问题，因此它完全基于文件。必须触发进程才能创建适当的文件结构。如果想再次运行示例，则必须删除检查点。还必须事先分成训练和测试，再次存档。文件格式为TSV，制表符分隔值。

直接进入它。第一个代码片段声明了两个辅助函数，从BigGraph源获取，设置一些常量并运行文件拆分。

import os
import random
 
def convert_path(fname):
    basename, _ = os.path.splitext(fname)
    out_dir = basename + '_partitioned'
    return out_dir
 
def random_split_file(fpath):
    root = os.path.dirname(fpath)
 
    output_paths = [
        os.path.join(root, FILENAMES['train']),
        os.path.join(root, FILENAMES['test']),
    ]
    if all(os.path.exists(path) for path in output_paths):
        print("Found some files that indicate that the input data "
              "has already been shuffled and split, not doing it again.")
        print("These files are: %s" % ", ".join(output_paths))
        return
 
    print('Shuffling and splitting train/test file. This may take a while.')
    train_file = os.path.join(root, FILENAMES['train'])
    test_file = os.path.join(root, FILENAMES['test'])
 
    print('Reading data from file: ', fpath)
    with open(fpath, "rt") as in_tf:
        lines = in_tf.readlines()
 
    # The first few lines are comments
    lines = lines[4:]
    print('Shuffling data')
    random.shuffle(lines)
    split_len = int(len(lines) * TRAIN_FRACTION)
 
    print('Splitting to train and test files')
    with open(train_file, "wt") as out_tf_train:
        for line in lines[:split_len]:
            out_tf_train.write(line)
 
    with open(test_file, "wt") as out_tf_test:
        for line in lines[split_len:]:
            out_tf_test.write(line)
 
DATA_PATH = "data/example_1/example.txt"
DATA_DIR = "data/example_1"
CONFIG_PATH = "config_1.py"
FILENAMES = {
    'train': 'train.txt',
    'test': 'test.txt',
}
TRAIN_FRACTION = 0.75
 
# ----------------------------------------------------------------------------------------------------------------------
#
 
random_split_file(DATA_PATH)

辅助函数和random_split_file调用

这通过创建两个文件data / example_1 / test.txt和train.txt将边缘分成测试和训练集。接下来使用BigGraphs转换器为我数据集创建基于文件的结构。将“分区”为1个分区。为此已经需要部分配置文件。这是配置文件的相关部分，I / O数据部分和图形结构。

entities_base = 'data/example_1' 
def get_torchbiggraph_config():     
config = dict(       
         # I/O data
        entity_path=entities_base,
        edge_paths=[],
        checkpoint_path='model/example_1',
         # Graph structure
        entities={
            'user_id': {'num_partitions': 1},
        },
        relations=[{
            'name': 'follow',
            'lhs': 'user_id',
            'rhs': 'user_id',
            'operator': 'none',
        }],
...

这告诉BigGraph在哪里可以找到数据以及如何解释制表符分隔值。使用此配置，可以运行下一个Python代码段。

edge_paths = [os.path.join(DATA_DIR, name) for name in FILENAMES.values()]
 
from torchbiggraph.converters.import_from_tsv import convert_input_data
 
convert_input_data(
    CONFIG_PATH,
    edge_paths,
    lhs_col=0,
    rhs_col=1,
    rel_col=None,
)

将数据转换为_partitioned数据

结果应该是数据目录中的一堆新文件，即：

两个文件夹test_partitioned，train_partitioned
每个文件夹一个文件，用于h5格式的边缘，用于快速部分读取
包含“user_ids”和新分配的ID之间的映射的dictionary.json文件。
entity_count_user_id_0.txt包含实体计数，在本例中为47。

dictionary.json对于稍后将BigGraph模型的结果映射到想要的实际嵌入非常重要。足够的准备，训练嵌入看一下config_1.py，它包含三个相关部分。

        # Scoring model - the embedding size
        dimension=1024,
        global_emb=False,
         # Training - the epochs to train and the learning rate
        num_epochs=10,
        lr=0.001,
         # Misc - not important
        hogwild_delay=2,
    )
     return config

为了训练运行以下Python代码。

from torchbiggraph.config import parse_config
import attr
train_config = parse_config(CONFIG_PATH)
 
train_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['train']))]
train_config = attr.evolve(train_config, edge_paths=train_path)
 
from torchbiggraph.train import train
train(train_config)

训练嵌入

可以通过此代码片段在测试集上基于一些预先安装的指标来评估模型。

from torchbiggraph.eval import do_eval
 
eval_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['test']))]
eval_config = attr.evolve(train_config, edge_paths=eval_path)
 
do_eval(eval_config)

评估嵌入

现在尝试检索实际的嵌入。同样因为一切都是基于文件的，现在应该在models /文件夹中找到h5 。可以通过在字典中查找他的映射来加载用户0的嵌入，如下所示：

import json
import h5py
 
with open(os.path.join(DATA_DIR,"dictionary.json"), "rt") as tf:
    dictionary = json.load(tf)
 
user_id = "0"
offset = dictionary["entities"]["user_id"].index(user_id)
print("our offset for user_id " , user_id, " is: ", offset)
 
with h5py.File("model/example_1/embeddings_user_id_" + user_id + ".v10.h5", "r") as hf:
    embedding = hf["embeddings"][0, :]
 
print(embedding)
print(embedding.shape)

输出嵌入

现在切换到第二个例子，一个构建的例子，希望它可以做一些部分有用的事情。liveJournal数据太大，无法在合理的时间内运行。

链接预测和排序的构造示例

将重复第二个示例的步骤，除了将生成维度10的嵌入，因此可以查看它并使用它。除了尺寸10，8个顶点也足够了。在设置upthose事情config_2.py。

entities_base = 'data/example_2'
 def get_torchbiggraph_config():
     config = dict(
        # I/O data
        entity_path=entities_base,
        edge_paths=[],
        checkpoint_path='model/example_2',
        # Graph structure
        entities={
            'user_id': {'num_partitions': 1},
        },
        relations=[{
            'name': 'follow',
            'lhs': 'user_id',
            'rhs': 'user_id',
            'operator': 'none',
        }],
         # Scoring model
        dimension=10,
        global_emb=False,
         # Training
        num_epochs=10,
        lr=0.001,
         # Misc
        hogwild_delay=2,
    )
     return config

然后运行与以前相同的代码但是一次性完成，处理不同的文件路径和格式。在这种情况下，在数据文件的顶部只有3行注释：

import os
import random
 
"""
adapted from https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/examples/livejournal.py
"""
FILENAMES = {
    'train': 'train.txt',
    'test': 'test.txt',
}
TRAIN_FRACTION = 0.75
 
def convert_path(fname):
    basename, _ = os.path.splitext(fname)
    out_dir = basename + '_partitioned'
    return out_dir
 
 
def random_split_file(fpath):
    root = os.path.dirname(fpath)
 
    output_paths = [
        os.path.join(root, FILENAMES['train']),
        os.path.join(root, FILENAMES['test']),
    ]
    if all(os.path.exists(path) for path in output_paths):
        print("Found some files that indicate that the input data "
              "has already been shuffled and split, not doing it again.")
        print("These files are: %s" % ", ".join(output_paths))
        return
 
    print('Shuffling and splitting train/test file. This may take a while.')
    train_file = os.path.join(root, FILENAMES['train'])
    test_file = os.path.join(root, FILENAMES['test'])
 
    print('Reading data from file: ', fpath)
    with open(fpath, "rt") as in_tf:
        lines = in_tf.readlines()
 
    # The first few lines are comments
    lines = lines[3:]
    print('Shuffling data')
    random.shuffle(lines)
    split_len = int(len(lines) * TRAIN_FRACTION)
 
    print('Splitting to train and test files')
    with open(train_file, "wt") as out_tf_train:
        for line in lines[:split_len]:
            out_tf_train.write(line)
 
    with open(test_file, "wt") as out_tf_test:
        for line in lines[split_len:]:
            out_tf_test.write(line)
 
 
 
DATA_PATH = "data/example_2/example.txt"
DATA_DIR = "data/example_2"
CONFIG_PATH = "config_2.py"
 
random_split_file(DATA_PATH)
 
 
edge_paths = [os.path.join(DATA_DIR, name) for name in FILENAMES.values()]
 
from torchbiggraph.converters.import_from_tsv import convert_input_data
 
convert_input_data(
    CONFIG_PATH,
    edge_paths,
    lhs_col=0,
    rhs_col=1,
    rel_col=None,
)
 
 
from torchbiggraph.config import parse_config
import attr
train_config = parse_config(CONFIG_PATH)
 
train_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['train']))]
train_config = attr.evolve(train_config, edge_paths=train_path)
 
from torchbiggraph.train import train
train(train_config)
 
from torchbiggraph.eval import do_eval
 
eval_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['test']))]
eval_config = attr.evolve(train_config, edge_paths=eval_path)
 
do_eval(eval_config)
 
import json
import h5py
 
with open(os.path.join(DATA_DIR,"dictionary.json"), "rt") as tf:
    dictionary = json.load(tf)
 
user_id = "0"
offset = dictionary["entities"]["user_id"].index(user_id)
print("our offset for user_id " , user_id, " is: ", offset)
 
with h5py.File("model/example_2/embeddings_user_id_0.v10.h5", "r") as hf:
    embedding_user_0 = hf["embeddings"][offset, :]
    embedding_all = hf["embeddings"][:]
 
print(embedding_all)
print(embedding_all.shape)

作为最终输出，应该得到一堆东西，特别是所有嵌入。用嵌入做一些基本的任务。当然现在可以使用它并将其加载到喜欢的任何框架中，keras，tensorflow，但是BigGraph已经为链接预测和排名等常见任务带来了一些实现。所以试一试。第一项任务是链接预测。预测实体0-7和0-1的得分，正如从数据中所知，0-1应该更有可能。

print("Now let's do some simple things within torch:")
 
from torchbiggraph.model import DotComparator
src_entity_offset = dictionary["entities"]["user_id"].index("0")  # France
dest_1_entity_offset = dictionary["entities"]["user_id"].index("7")  # Paris
dest_2_entity_offset = dictionary["entities"]["user_id"].index("1")  # Paris
rel_type_index = dictionary["relations"].index("follow") # note we only have one...
 
with h5py.File("model/example_2/embeddings_user_id_0.v10.h5", "r") as hf:
    src_embedding = hf["embeddings"][src_entity_offset, :]
    dest_1_embedding = hf["embeddings"][dest_1_entity_offset, :]
    dest_2_embedding = hf["embeddings"][dest_2_entity_offset, :]
    dest_embeddings = hf["embeddings"][...]
 
 
import torch
comparator = DotComparator()
 
scores_1, _, _ = comparator(
    comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))),
    comparator.prepare(torch.tensor(dest_1_embedding.reshape([1,1,10]))),
    torch.empty(1, 0, 10),  # Left-hand side negatives, not needed
    torch.empty(1, 0, 10),  # Right-hand side negatives, not needed
)
 
scores_2, _, _ = comparator(
    comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))),
    comparator.prepare(torch.tensor(dest_2_embedding.reshape([1,1,10]))),
    torch.empty(1, 0, 10),  # Left-hand side negatives, not needed
    torch.empty(1, 0, 10),  # Right-hand side negatives, not needed
)
 
print(scores_1)
print(scores_2)

作为比较器，加载了“DotComparator”，它计算两个10维向量的点积或标量积。结果输出的数字很小，但至少得分_2远高于预期的得分_1。

最后作为最后一段代码，可以生成类似项目的排名，它使用与以前相同的机制。使用标量积来计算嵌入到所有其他实体的距离，然后对它们进行排名。

print("finally, let's do some ranking...")
entity_count = 8
scores, _, _ = comparator(
    comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))).expand(1, entity_count, 10),
    comparator.prepare(torch.tensor(dest_embeddings.reshape([1,8,10]))),
    torch.empty(1, 0, 10),  # Left-hand side negatives, not needed
    torch.empty(1, 0, 10),  # Right-hand side negatives, not needed
)
permutation = scores.flatten().argsort(descending=True)
 
top_entities = [dictionary["entities"]["user_id"][index] for index in permutation]
print(top_entities)

在这种情况下，顶级实体是订单0,1,3,7 ......如果看看似乎非常正确的数据。

更多乐趣

这是能想到的最基本的例子。没有在freebase数据或LiveJournal数据上运行原始示例，仅仅是因为它们需要相当长的时间来训练。可以在这里找到代码和参考：

PyTorch BigGraph的GitHub存储库

https://github.com/facebookresearch/PyTorch-BigGraph

带有示例代码的GitHub存储库

https://github.com/sbalnojan/biggraph-examples

A. Leerer, et. al. (2019), PyTorch-BigGraph:一个大规模图嵌入系统

https://arxiv.org/pdf/1903.12287.pdf

T. N. Kipf, M. Welling (2016)，带有图形卷积网络的半监督分类

https://arxiv.org/abs/1609.02907

可能遇到的问题

在mac上运行代码并遇到三个问题：

一个错误声明“lib * ... ..Reason：image not found：”解决方案是安装缺少的东西，例如“brew install libomp”
然后遇到了一个错误“AttributeError：module'torch'没有属性'_six'”，这可能仅仅是因为python和torch版本不兼容。无论如何，从python 3.6＆torch 1.1 => python 3.7＆torch 1.X移动并解决了问题。
在继续之前检查train.txt和test.txt，在测试时看到一些丢失的新行。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2019-06-24，如有侵权请联系 cloudcommunity@tencent.com 删除

https