作者 | Sven Balnojan
来源 | Medium
编辑 | 代码医生团队
PyTorch BigGraph是一个用于创建和处理大型图形嵌入以供机器学习的工具。目前基于图形的神经网络有两种方法:
PyTorch BigGraph处理第二种方法,将在下面这样做。仅供参考,谈谈一秒钟的尺寸方面。图通常由它们的邻接矩阵编码。如果有一个包含3,000个节点的图形以及每个节点之间的边缘,则最终会在矩阵中生成大约10,000,000个条目。即使这种情况很少,显然这会破坏大多数GPU。
如果考虑推荐系统中使用的常用图表,会发现它们通常比这大得多。有兴趣将BigGraph应用到机器学习问题中,用最简单的例子来解决问题。构建了两个例子,将逐步介绍。
整个代码经过重构,可在GitHub上获得。它改编自BigGraph存储库中的示例。
https://github.com/sbalnojan/biggraph-examples
https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/examples/livejournal.py
第一个示例是LiveJournal图的一部分,数据如下所示:
# FromNodeId ToNodeId
0 1
0 2
0 3
...
0 10
0 11
0 12
...
0 46
1 0
...
第二个例子只是8个带边的节点:
# FromNodeId ToNodeId
0 1
0 2
0 3
0 4
1 0
1 2
1 3
1 4
2 1
2 3
2 4
3 1
3 2
3 4
3 7
4 1
5 1
6 2
7 3
嵌入LiveJournals图的一部分
BigGraph可以解决机器的内存限制问题,因此它完全基于文件。必须触发进程才能创建适当的文件结构。如果想再次运行示例,则必须删除检查点。还必须事先分成训练和测试,再次存档。文件格式为TSV,制表符分隔值。
直接进入它。第一个代码片段声明了两个辅助函数,从BigGraph源获取,设置一些常量并运行文件拆分。
import os
import random
def convert_path(fname):
basename, _ = os.path.splitext(fname)
out_dir = basename + '_partitioned'
return out_dir
def random_split_file(fpath):
root = os.path.dirname(fpath)
output_paths = [
os.path.join(root, FILENAMES['train']),
os.path.join(root, FILENAMES['test']),
]
if all(os.path.exists(path) for path in output_paths):
print("Found some files that indicate that the input data "
"has already been shuffled and split, not doing it again.")
print("These files are: %s" % ", ".join(output_paths))
return
print('Shuffling and splitting train/test file. This may take a while.')
train_file = os.path.join(root, FILENAMES['train'])
test_file = os.path.join(root, FILENAMES['test'])
print('Reading data from file: ', fpath)
with open(fpath, "rt") as in_tf:
lines = in_tf.readlines()
# The first few lines are comments
lines = lines[4:]
print('Shuffling data')
random.shuffle(lines)
split_len = int(len(lines) * TRAIN_FRACTION)
print('Splitting to train and test files')
with open(train_file, "wt") as out_tf_train:
for line in lines[:split_len]:
out_tf_train.write(line)
with open(test_file, "wt") as out_tf_test:
for line in lines[split_len:]:
out_tf_test.write(line)
DATA_PATH = "data/example_1/example.txt"
DATA_DIR = "data/example_1"
CONFIG_PATH = "config_1.py"
FILENAMES = {
'train': 'train.txt',
'test': 'test.txt',
}
TRAIN_FRACTION = 0.75
# ----------------------------------------------------------------------------------------------------------------------
#
random_split_file(DATA_PATH)
辅助函数和random_split_file调用
这通过创建两个文件data / example_1 / test.txt和train.txt将边缘分成测试和训练集。接下来使用BigGraphs转换器为我数据集创建基于文件的结构。将“分区”为1个分区。为此已经需要部分配置文件。这是配置文件的相关部分,I / O数据部分和图形结构。
entities_base = 'data/example_1'
def get_torchbiggraph_config():
config = dict(
# I/O data
entity_path=entities_base,
edge_paths=[],
checkpoint_path='model/example_1',
# Graph structure
entities={
'user_id': {'num_partitions': 1},
},
relations=[{
'name': 'follow',
'lhs': 'user_id',
'rhs': 'user_id',
'operator': 'none',
}],
...
这告诉BigGraph在哪里可以找到数据以及如何解释制表符分隔值。使用此配置,可以运行下一个Python代码段。
edge_paths = [os.path.join(DATA_DIR, name) for name in FILENAMES.values()]
from torchbiggraph.converters.import_from_tsv import convert_input_data
convert_input_data(
CONFIG_PATH,
edge_paths,
lhs_col=0,
rhs_col=1,
rel_col=None,
)
将数据转换为_partitioned数据
结果应该是数据目录中的一堆新文件,即:
dictionary.json对于稍后将BigGraph模型的结果映射到想要的实际嵌入非常重要。足够的准备,训练嵌入看一下config_1.py,它包含三个相关部分。
# Scoring model - the embedding size
dimension=1024,
global_emb=False,
# Training - the epochs to train and the learning rate
num_epochs=10,
lr=0.001,
# Misc - not important
hogwild_delay=2,
)
return config
为了训练运行以下Python代码。
from torchbiggraph.config import parse_config
import attr
train_config = parse_config(CONFIG_PATH)
train_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['train']))]
train_config = attr.evolve(train_config, edge_paths=train_path)
from torchbiggraph.train import train
train(train_config)
训练嵌入
可以通过此代码片段在测试集上基于一些预先安装的指标来评估模型。
from torchbiggraph.eval import do_eval
eval_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['test']))]
eval_config = attr.evolve(train_config, edge_paths=eval_path)
do_eval(eval_config)
评估嵌入
现在尝试检索实际的嵌入。同样因为一切都是基于文件的,现在应该在models /文件夹中找到h5 。可以通过在字典中查找他的映射来加载用户0的嵌入,如下所示:
import json
import h5py
with open(os.path.join(DATA_DIR,"dictionary.json"), "rt") as tf:
dictionary = json.load(tf)
user_id = "0"
offset = dictionary["entities"]["user_id"].index(user_id)
print("our offset for user_id " , user_id, " is: ", offset)
with h5py.File("model/example_1/embeddings_user_id_" + user_id + ".v10.h5", "r") as hf:
embedding = hf["embeddings"][0, :]
print(embedding)
print(embedding.shape)
输出嵌入
现在切换到第二个例子,一个构建的例子,希望它可以做一些部分有用的事情。liveJournal数据太大,无法在合理的时间内运行。
链接预测和排序的构造示例
将重复第二个示例的步骤,除了将生成维度10的嵌入,因此可以查看它并使用它。除了尺寸10,8个顶点也足够了。在设置upthose事情config_2.py。
entities_base = 'data/example_2'
def get_torchbiggraph_config():
config = dict(
# I/O data
entity_path=entities_base,
edge_paths=[],
checkpoint_path='model/example_2',
# Graph structure
entities={
'user_id': {'num_partitions': 1},
},
relations=[{
'name': 'follow',
'lhs': 'user_id',
'rhs': 'user_id',
'operator': 'none',
}],
# Scoring model
dimension=10,
global_emb=False,
# Training
num_epochs=10,
lr=0.001,
# Misc
hogwild_delay=2,
)
return config
然后运行与以前相同的代码但是一次性完成,处理不同的文件路径和格式。在这种情况下,在数据文件的顶部只有3行注释:
import os
import random
"""
adapted from https://github.com/facebookresearch/PyTorch-BigGraph/blob/master/torchbiggraph/examples/livejournal.py
"""
FILENAMES = {
'train': 'train.txt',
'test': 'test.txt',
}
TRAIN_FRACTION = 0.75
def convert_path(fname):
basename, _ = os.path.splitext(fname)
out_dir = basename + '_partitioned'
return out_dir
def random_split_file(fpath):
root = os.path.dirname(fpath)
output_paths = [
os.path.join(root, FILENAMES['train']),
os.path.join(root, FILENAMES['test']),
]
if all(os.path.exists(path) for path in output_paths):
print("Found some files that indicate that the input data "
"has already been shuffled and split, not doing it again.")
print("These files are: %s" % ", ".join(output_paths))
return
print('Shuffling and splitting train/test file. This may take a while.')
train_file = os.path.join(root, FILENAMES['train'])
test_file = os.path.join(root, FILENAMES['test'])
print('Reading data from file: ', fpath)
with open(fpath, "rt") as in_tf:
lines = in_tf.readlines()
# The first few lines are comments
lines = lines[3:]
print('Shuffling data')
random.shuffle(lines)
split_len = int(len(lines) * TRAIN_FRACTION)
print('Splitting to train and test files')
with open(train_file, "wt") as out_tf_train:
for line in lines[:split_len]:
out_tf_train.write(line)
with open(test_file, "wt") as out_tf_test:
for line in lines[split_len:]:
out_tf_test.write(line)
DATA_PATH = "data/example_2/example.txt"
DATA_DIR = "data/example_2"
CONFIG_PATH = "config_2.py"
random_split_file(DATA_PATH)
edge_paths = [os.path.join(DATA_DIR, name) for name in FILENAMES.values()]
from torchbiggraph.converters.import_from_tsv import convert_input_data
convert_input_data(
CONFIG_PATH,
edge_paths,
lhs_col=0,
rhs_col=1,
rel_col=None,
)
from torchbiggraph.config import parse_config
import attr
train_config = parse_config(CONFIG_PATH)
train_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['train']))]
train_config = attr.evolve(train_config, edge_paths=train_path)
from torchbiggraph.train import train
train(train_config)
from torchbiggraph.eval import do_eval
eval_path = [convert_path(os.path.join(DATA_DIR, FILENAMES['test']))]
eval_config = attr.evolve(train_config, edge_paths=eval_path)
do_eval(eval_config)
import json
import h5py
with open(os.path.join(DATA_DIR,"dictionary.json"), "rt") as tf:
dictionary = json.load(tf)
user_id = "0"
offset = dictionary["entities"]["user_id"].index(user_id)
print("our offset for user_id " , user_id, " is: ", offset)
with h5py.File("model/example_2/embeddings_user_id_0.v10.h5", "r") as hf:
embedding_user_0 = hf["embeddings"][offset, :]
embedding_all = hf["embeddings"][:]
print(embedding_all)
print(embedding_all.shape)
作为最终输出,应该得到一堆东西,特别是所有嵌入。用嵌入做一些基本的任务。当然现在可以使用它并将其加载到喜欢的任何框架中,keras,tensorflow,但是BigGraph已经为链接预测和排名等常见任务带来了一些实现。所以试一试。第一项任务是链接预测。预测实体0-7和0-1的得分,正如从数据中所知,0-1应该更有可能。
print("Now let's do some simple things within torch:")
from torchbiggraph.model import DotComparator
src_entity_offset = dictionary["entities"]["user_id"].index("0") # France
dest_1_entity_offset = dictionary["entities"]["user_id"].index("7") # Paris
dest_2_entity_offset = dictionary["entities"]["user_id"].index("1") # Paris
rel_type_index = dictionary["relations"].index("follow") # note we only have one...
with h5py.File("model/example_2/embeddings_user_id_0.v10.h5", "r") as hf:
src_embedding = hf["embeddings"][src_entity_offset, :]
dest_1_embedding = hf["embeddings"][dest_1_entity_offset, :]
dest_2_embedding = hf["embeddings"][dest_2_entity_offset, :]
dest_embeddings = hf["embeddings"][...]
import torch
comparator = DotComparator()
scores_1, _, _ = comparator(
comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))),
comparator.prepare(torch.tensor(dest_1_embedding.reshape([1,1,10]))),
torch.empty(1, 0, 10), # Left-hand side negatives, not needed
torch.empty(1, 0, 10), # Right-hand side negatives, not needed
)
scores_2, _, _ = comparator(
comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))),
comparator.prepare(torch.tensor(dest_2_embedding.reshape([1,1,10]))),
torch.empty(1, 0, 10), # Left-hand side negatives, not needed
torch.empty(1, 0, 10), # Right-hand side negatives, not needed
)
print(scores_1)
print(scores_2)
作为比较器,加载了“DotComparator”,它计算两个10维向量的点积或标量积。结果输出的数字很小,但至少得分_2远高于预期的得分_1。
最后作为最后一段代码,可以生成类似项目的排名,它使用与以前相同的机制。使用标量积来计算嵌入到所有其他实体的距离,然后对它们进行排名。
print("finally, let's do some ranking...")
entity_count = 8
scores, _, _ = comparator(
comparator.prepare(torch.tensor(src_embedding.reshape([1,1,10]))).expand(1, entity_count, 10),
comparator.prepare(torch.tensor(dest_embeddings.reshape([1,8,10]))),
torch.empty(1, 0, 10), # Left-hand side negatives, not needed
torch.empty(1, 0, 10), # Right-hand side negatives, not needed
)
permutation = scores.flatten().argsort(descending=True)
top_entities = [dictionary["entities"]["user_id"][index] for index in permutation]
print(top_entities)
在这种情况下,顶级实体是订单0,1,3,7 ......如果看看似乎非常正确的数据。
更多乐趣
这是能想到的最基本的例子。没有在freebase数据或LiveJournal数据上运行原始示例,仅仅是因为它们需要相当长的时间来训练。可以在这里找到代码和参考:
https://github.com/facebookresearch/PyTorch-BigGraph
https://github.com/sbalnojan/biggraph-examples
https://arxiv.org/pdf/1903.12287.pdf
https://arxiv.org/abs/1609.02907
可能遇到的问题
在mac上运行代码并遇到三个问题: