TFRecord读写简介

原创

haoxiang

修改于 2022-08-11 16:51:17

4440

修改于 2022-08-11 16:51:17

为了高效地读取数据，比较有帮助的一种做法是对数据进行序列化并将其存储在一组可线性读取的文件（每个文件 100-200MB）中。这尤其适用于通过网络进行流式传输的数据。这种做法对缓冲任何数据预处理也十分有用。TFRecord 格式是一种用于存储二进制记录序列的简单格式。

1. 写入TFRecord

特征数据

feature_data = {
    'name': 'xiaoming',
    'age': 20,
    'height': 172.8,
    'scores': [[120,130,140],[82,95,43]]
}

tf.Example 消息（或 protobuf）是一种灵活的消息类型，表示 {"string": value} 映射。它专为 TensorFlow 而设计，并被用于 TFX 等高级 API。

example_proto = tf.train.Example(
    features=tf.train.Features(feature={
        # 将标准 TensorFlow 类型转换为兼容 tf.Example 的 tf.train.Feature  
        'name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[b'xiaoming'])),
        'age': tf.train.Feature(int64_list=tf.train.Int64List(value=[20])),
        'height': tf.train.Feature(float_list=tf.train.FloatList(value=[172.8])),
        'scores': tf.train.Feature(bytes_list=tf.train.BytesList(
        # 要处理非标量特征，最简单的方法是使用 tf.io.serialize_tensor 将张量转换为二进制字符串
        value=[tf.io.serialize_tensor([[120,130,140],[82,95,43]]).numpy()]))
    })
)
  
""" 输出结果：  
features {
    feature {
        key: "age"
        value {
            int64_list {
                value: 20
            }
        }
    }
    feature {
        key: "height"
        value {
            float_list {
                value: 172.8000030517578
            }
        }
    }
    feature {
        key: "name"
        value {
            bytes_list {
                value: "xiaoming"
            }
        }
    }
    feature {
        key: "scores"
        value {
            bytes_list {
                value: "\010\003\022\010\022\002\010\002\022\002\010\003\"\030x\000\000\000\202\000\000\000\214\000\000\000R\000\000\000_\000\000\000+\000\000\000"
            }
        }
    }
}
"""

使用 .SerializeToString 方法将所有协议消息序列化为二进制字符串

serialized_example = example_proto.SerializeToString()

# 输出结果：b'\nn\n4\n\x06scores\x12*\n(\n&\x08\x03\x12\x08\x12\x02\x08\x02\x12\x02\x08\x03"\x18x\x00\x00\x00\x82\x00\x00\x00\x8c\x00\x00\x00R\x00\x00\x00_\x00\x00\x00+\x00\x00\x00\n\x14\n\x04name\x12\x0c\n\n\n\x08xiaoming\n\x12\n\x06height\x12\x08\x12\x06\n\x04\xcd\xcc,C\n\x0c\n\x03age\x12\x05\x1a\x03\n\x01\x14'

Write TFRecord

with tf.io.TFRecordWriter(file_path) as writer:
    writer.write(serialized_example)

2. 读取TFRecord

feature_description 是必需的，因为数据集使用计算图执行，并且需要以下描述来构建它们的形状和类型签名

feature_description = {    
    'name': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'age': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'height': tf.io.FixedLenFeature([], tf.float32, default_value=0.0),
    'scores': tf.io.FixedLenFeature([], tf.string, default_value=''),
}

解析

def parse_from_example(serialized_example):    
    # tf.parse_example 函数会将 tf.Example 字段解压缩为标准张量
    feature_data = tf.io.parse_single_example(serialized_example, feature_description)
    # 使用 tf.io.parse_tensor 可将二进制字符串转换回张量
    feature_data['scores'] = tf.reshape(tf.io.parse_tensor(feature_data['scores'], out_type=tf.int32), (2, 3))
    return feature_data

parse_from_example(serialized_example)
"""输出结果：
{'age': <tf.Tensor: id=15, shape=(), dtype=int64, numpy=20>, 'height': <tf.Tensor: id=16, shape=(), dtype=float32, numpy=172.8>, 'name': <tf.Tensor: id=17, shape=(), dtype=string, numpy=b'xiaoming'>, 'scores': <tf.Tensor: id=21, shape=(2, 3), dtype=int32, numpy=
array([[120, 130, 140],
        [ 82,  95,  43]], dtype=int32)>}
"""

Read TFRecord

# 使用 tf.data.Dataset.map 方法可将函数应用于 Dataset 的每个元素
# Tips: You can convert tensor into numpy array using tensor.numpy(), But you can't do the same in case of MapDataset. Try tf.numpy_function / tf.py_function
dataset = tf.data.TFRecordDataset(file_path).map(parse_from_example)

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

tensorflow