文章/答案/技术大牛

发布

社区首页 >问答首页 >HDF5数据集只存储0/空值

问HDF5数据集只存储0/空值
EN

Stack Overflow用户

提问于 2020-06-11 22:21:03

回答 1查看 66关注 0票数 1

我正在将CSV文件写入HDF5文件，以便在不填满内存的情况下以更好的方式加载信息。我的CSV文件包含索引，我正在通过字典将这些索引转换为它们的相应值。

CSV文件非常大(4 GB的索引)，相应的值是512大小的数组。要创建数据集，我首先在H5文件中定义数据集，然后按块读取CSV文件，以便它使用适量的内存

num_lines = 1000000
chunksize = 100000
num_features = 512

with h5py.File('./data/dataset.h5', 'w') as h5f:

    # use num_features-1 if the csv file has a column header
    dset1 = h5f.create_dataset('paragraph_embeddings',
                               shape=(num_lines, num_features),
                               compression=None,
                               dtype='float32')
    dset2 = h5f.create_dataset('sentence_embeddings',
                               shape=(num_lines, num_features),
                               compression=None,
                               dtype='float32')
    dset3 = h5f.create_dataset('labels',
                               shape=(num_lines,),
                               compression=None,
                               dtype='int32')

    # Read csv in chunks so that RAM does not overflow
    for i in range(0, num_lines, chunksize):

        df = pd.read_csv(csv_path,
                header=None,
                nrows=chunksize, # number of rows to read at each iteration
                skiprows=i)   # skip rows that were already read
        df.columns = ["para_index", "sentence_index", "label"]

        # Get embeddings from dictionaries (para_mappings and sentence_mappings)
        paragraph_embeddings = df["para_index"].map(para_mappings)
        sentence_embeddings = df["sentence_index"].astype(str).map(sentence_mappings)
        label = df["label"]

        # Append to the datasets
        dset1[i:i+chunksize, num_features:] = paragraph_embeddings
        dset2[i:i+chunksize, num_features:] = sentence_embeddings
        dset3[i:i+chunksize] = label

我使用df.map函数将索引映射到它们的相应值。在此之后，我得到了嵌入(或者我前面描述的512大小的数组)。在此之后，我将它们附加到相应的数据集。

但是，为了进行测试，我打印了H5文件中的嵌入内容，使用如下代码：

with h5py.File('./data/dataset.h5', 'r') as h5f:
    print('Embedding', h5f['paragraph_embeddings'][2])

我得到一个0数组(大小为512)作为输出。

有没有人能给我指点一下我哪里出了问题？根据我的估计，它应该是我将嵌入“附加”到数据集的位置。在我看来，我并没有附加价值观，并在那里做错了什么。

此外，当我测试标签时，它们是正确的。所以，我猜，它主要突出了我的问题，这一行：

dset1[i:i+chunksize, num_features:] = paragraph_embeddings

word-embedding

python

csv

hdf5

h5py

回答 1

Stack Overflow用户

发布于 2020-06-11 22:52:16

问题是这样的：

paragraph_embeddings = df["para_index"].map(para_mappings)

它给了我一个块大小的数据帧，每个元素都是一个数值数组(大小为512)。当我将它写入数据集中时，当我将数据放入时，它不接受：

dset1[i:i+chunksize] = paragraph_embeddings

它给了我一个错误，告诉我它不能将大小为chunk-size的数据输入到大小为[chunk-size, 512]的数据集中。这给了我一个线索，让我先将数据帧转换为形状为[chunk-size, 512]的numpy数组，然后将其添加到数据集中。正如预期的那样，它起作用了。

paragraph_embeddings = np.array(df["para_index"].map(para_mappings).tolist())

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/62326566

复制

相似问题

问HDF5数据集只存储0/空值
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问HDF5数据集只存储0/空值EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问HDF5数据集只存储0/空值
EN