blocks|key|735634|text|我不确定这是否是最有效的方法(我从未使用过它；我只是将一些我独立使用过的工具组合在一起)，但您可以使用matplotlib+helper+methods+for+csv将csv文件读入一个numpy+recarray中。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|735635|您可能还可以找到一种方法，以块的形式读取csv文件，以避免将整个文件加载到磁盘。然后使用recarray+(或其中的片段)将整个(或其中的大部分)写入h5py数据集。我不太确定h5py是如何处理recarray的，但是文档表明它应该没问题。|735636|基本上，如果可能的话，试着一次写大块数据，而不是迭代单个元素。|735637|另一种读取csv文件的方法是numpy.genfromtxt|style|CODE|735638|您可以使用关键字usecols获取所需的列，然后通过正确设置skip_header和skip_footer关键字，只读入一组指定的行。|735639|entityMap|0|LINK|mutability|MUTABLE|url|http://matplotlib.org/api/mlab_api.html#matplotlib.mlab.csv2rec|1|https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html^0|1F|X|0|0|0|0|E|G|E|G|1|0|8|7|U|B|16|B|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@$A|Y|B|Z|1|10]]|C|$]]|$1|D|3|E|5|6|7|11|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|12|8|@]|9|@]|C|$]]|$1|H|3|I|5|6|7|13|8|@$A|14|B|15|J|K]]|9|@$A|16|B|17|1|18]]|C|$]]|$1|L|3|M|5|6|7|19|8|@$A|1A|B|1B|J|K]|$A|1C|B|1D|J|K]|$A|1E|B|1F|J|K]]|9|@]|C|$]]|$1|N|3|-4|5|6|7|1G|8|@]|9|@]|C|$]]]|O|$P|$5|Q|R|S|C|$T|U]]|V|$5|Q|R|S|C|$T|W]]]]

I'm not sure if this is the most efficient way (and I've never used it; I'm just pulling together some tools I've used independently), but you could read the csv file into a numpy recarray using the <a href="http://matplotlib.org/api/mlab_api.html#matplotlib.mlab.csv2rec" rel="nofollow noreferrer">matplotlib helper methods for csv</a>.

You can probably find a way to read the csv files in chunks as well to avoid loading the whole thing to disk. Then use the recarray (or slices therein) to write the whole (or large chunks of it) to the h5py dataset. I'm not exactly sure how h5py handles recarrays, but the documentation indicates that it should be ok.

Basically if possible, try to write big chunks of data at once instead of iterating over individual elements.

Another possibility for reading the csv file is just <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html" rel="nofollow noreferrer"><code>numpy.genfromtxt</code></a>

You can grab the columns you want using the keyword <code>usecols</code>, and then only read in a specified set of lines by properly setting the <code>skip_header</code> and <code>skip_footer</code> keywords.

blocks|key|951758|text|我会避免将数据分块，并将数据存储为一系列单数组数据集(沿着本杰明所建议的路线)。我刚刚完成了将一直在处理的企业应用程序的输出加载到HDF5中，并且能够将大约45亿个复合数据类型打包为450,000个数据集，每个数据集包含10,000个数据数组。写入和读取现在看起来相当即时，但当我最初尝试分块数据时，速度非常慢。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|951759|这只是一个想法！|951760|更新：|951761|这些是从我的实际代码(我用C编写而不是Python编写，但您应该明白我在做什么)中提取的几个代码片段，并且为了清晰起见对它们进行了修改。我只是在数组中写入长的无符号整数(每个数组有10,000个值)，并在需要实际值时读回它们|951762|这是我典型的编写代码。在本例中，我只是将长无符号整数序列写入数组序列，并在创建数组序列时将其加载到hdf5中。|951763|//Our+dummy+data:+a+rolling+count+of+long+unsigned+integers
long+unsigned+int+k+=+0UL;
//We'll+use+this+to+store+our+dummy+data,+10,000+at+a+time
long+unsigned+int+kValues[NUMPERDATASET];
//Create+the+SS+adata+files.
hid_t+ssdb+=+H5Fcreate(SSHDF,+H5F_ACC_TRUNC,+H5P_DEFAULT,+H5P_DEFAULT);
//NUMPERDATASET+=+10,000,+so+we+get+a+1+x+10,000+array
hsize_t+dsDim[1]+=+{NUMPERDATASET};
//Create+the+data+space.
hid_t+dSpace+=+H5Screate_simple(1,+dsDim,+NULL);
//NUMDATASETS+=+MAXSSVALUE+/+NUMPERDATASET,+where+MAXSSVALUE+=+4,500,000,000
for+(unsigned+long+int+i+=+0UL;+i+<+NUMDATASETS;+i%2B%2B){
++++for+(unsigned+long+int+j+=+0UL;+j+<+NUMPERDATASET;+j%2B%2B){
++++++++kValues[j]+=+k;
++++++++k+%2B=+1UL;
++++}
++++//Create+the+data+set.
++++dssSet+=+H5Dcreate2(ssdb,+g_strdup_printf("%25lu",+i),+H5T_NATIVE_ULONG,+dSpace,+H5P_DEFAULT,+H5P_DEFAULT,+H5P_DEFAULT);
++++//Write+data+to+the+data+set.
++++H5Dwrite(dssSet,+H5T_NATIVE_ULONG,+H5S_ALL,+H5S_ALL,+H5P_DEFAULT,+kValues);
++++//Close+the+data+set.
++++H5Dclose(dssSet);
}
//Release+the+data+space
H5Sclose(dSpace);
//Close+the+data+files.
H5Fclose(ssdb);|code-block|syntax|javascript|951764|这是我的阅读器代码的一个稍微修改的版本。有更优雅的方式来做这件事(例如，我可以使用超平面来获取价值)，但对于我相当严格的敏捷/BDD开发过程来说，这是最干净的解决方案。|951765|unsigned+long+int+getValueByIndex(unsigned+long+int+nnValue){
++++//NUMPERDATASET+=+10,000
++++unsigned+long+int+ssValue[NUMPERDATASET];
++++//MAXSSVALUE+=+4,500,000,000;+i+takes+the+smaller+value+of+MAXSSVALUE+or+nnValue
++++//to+avoid+index+out+of+range+error+
++++unsigned+long+int+i+=+MIN(MAXSSVALUE-1,nnValue);
++++//Open+the+data+file+in+read-write+mode.
++++hid_t+db+=+H5Fopen(_indexFilePath,+H5F_ACC_RDONLY,+H5P_DEFAULT);
++++//Create+the+data+set.+In+this+case,+each+dataset+consists+of+a+array+of+10,000
++++//unsigned+long+int+and+is+named+according+to+its+integer+division+value+of+i+divided
++++//by+the+number+per+data+set.
++++hid_t+dSet+=+H5Dopen(db,+g_strdup_printf("%25lu",+i+/+NUMPERDATASET),+H5P_DEFAULT);
++++//Read+the+data+set+array.
++++H5Dread(dSet,+H5T_NATIVE_ULONG,+H5S_ALL,+H5S_ALL,+H5P_DEFAULT,+ssValue);
++++//Close+the+data+set.
++++H5Dclose(dSet);
++++//Close+the+data+file.
++++H5Fclose(db);
++++//Return+the+indexed+value+by+using+the+modulus+of+i+divided+by+the+number+per+dataset
++++return+ssValue[i+%25+NUMPERDATASET];
}|951766|主要的收获是编写代码中的内部循环以及整数除法和模运算，以获得数据集数组的索引和该数组中所需值的索引。让我知道这是否足够清楚，这样你就可以在h5py中放入类似或更好的东西。在C语言中，这非常简单，与分块数据集解决方案相比，它提供了更好的读/写时间。此外，由于我不能对复合数据集使用压缩，因此分块的明显优势是一个未知数，因此我所有的复合数据都以相同的方式存储。|951767|entityMap^0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|X|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Y|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|Z|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|10|8|@]|9|@]|A|$]]|$1|J|3|K|5|L|7|11|8|@]|9|@]|A|$M|N]]|$1|O|3|P|5|6|7|12|8|@]|9|@]|A|$]]|$1|Q|3|R|5|L|7|13|8|@]|9|@]|A|$M|N]]|$1|S|3|T|5|6|7|14|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|15|8|@]|9|@]|A|$]]]|V|$]]

I would avoid chunking the data and would store the data as series of single array datasets (along the lines of what Benjamin is suggesting). I just finished loading the output of an enterprise app I've been working on into HDF5, and was able to pack about 4.5 Billion compound datatypes as 450,000 datasets, each containing a 10,000 array of data. Writes and reads now seem fairly instantaneous, but were painfully slow when I initially tried to chunk the data.

Just a thought!

Update:

These are a couple of snippets lifted from my actual code (I'm coding in C vs. Python, but you should get the idea of what I'm doing) and modified for clarity. I'm just writing long unsigned integers in arrays (10,000 values per array) and reading them back when I need an actual value

This is my typical writer code. In this case, I'm simply writing long unsigned integer sequence into a sequence of arrays and loading each array sequence into hdf5 as they are created.

<pre class="lang-cpp prettyprint-override"><code>//Our dummy data: a rolling count of long unsigned integers
long unsigned int k = 0UL;
//We'll use this to store our dummy data, 10,000 at a time
long unsigned int kValues[NUMPERDATASET];
//Create the SS adata files.
hid_t ssdb = H5Fcreate(SSHDF, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
//NUMPERDATASET = 10,000, so we get a 1 x 10,000 array
hsize_t dsDim[1] = {NUMPERDATASET};
//Create the data space.
hid_t dSpace = H5Screate_simple(1, dsDim, NULL);
//NUMDATASETS = MAXSSVALUE / NUMPERDATASET, where MAXSSVALUE = 4,500,000,000
for (unsigned long int i = 0UL; i &lt; NUMDATASETS; i++){
 for (unsigned long int j = 0UL; j &lt; NUMPERDATASET; j++){
 kValues[j] = k;
 k += 1UL;
 }
 //Create the data set.
 dssSet = H5Dcreate2(ssdb, g_strdup_printf("%lu", i), H5T_NATIVE_ULONG, dSpace, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
 //Write data to the data set.
 H5Dwrite(dssSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, kValues);
 //Close the data set.
 H5Dclose(dssSet);
}
//Release the data space
H5Sclose(dSpace);
//Close the data files.
H5Fclose(ssdb);
</code></pre>

This is a slightly modified version of my reader code. There are more elegant ways of doing this (i.e., I could use hyperplanes to get the value), but this was the cleanest solution with respect to my fairly disciplined Agile/BDD development process.

<pre class="lang-cpp prettyprint-override"><code>unsigned long int getValueByIndex(unsigned long int nnValue){
 //NUMPERDATASET = 10,000
 unsigned long int ssValue[NUMPERDATASET];
 //MAXSSVALUE = 4,500,000,000; i takes the smaller value of MAXSSVALUE or nnValue
 //to avoid index out of range error 
 unsigned long int i = MIN(MAXSSVALUE-1,nnValue);
 //Open the data file in read-write mode.
 hid_t db = H5Fopen(_indexFilePath, H5F_ACC_RDONLY, H5P_DEFAULT);
 //Create the data set. In this case, each dataset consists of a array of 10,000
 //unsigned long int and is named according to its integer division value of i divided
 //by the number per data set.
 hid_t dSet = H5Dopen(db, g_strdup_printf("%lu", i / NUMPERDATASET), H5P_DEFAULT);
 //Read the data set array.
 H5Dread(dSet, H5T_NATIVE_ULONG, H5S_ALL, H5S_ALL, H5P_DEFAULT, ssValue);
 //Close the data set.
 H5Dclose(dSet);
 //Close the data file.
 H5Fclose(db);
 //Return the indexed value by using the modulus of i divided by the number per dataset
 return ssValue[i % NUMPERDATASET];
}
</code></pre>

The main take-away is the inner loop in the writing code and the integer division and mod operations to get the index of the dataset array and index of the desired value in that array. Let me know if this is clear enough so you can put together something similar or better in h5py. In C, this is dead simple and gives me significantly better read/write times vs. a chunked dataset solution. Plus since I can't use compression with compound datasets anyway, the apparent upside of chunking is a moot point, so all my compounds are stored the same way.

blocks|key|951818|text|利用numpy.loadtxt的灵活性，可以将文件中的数据放入numpy+array中，而后者又非常适合初始化hdf5数据集。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|951819|import+h5py
import+numpy+as+np

d+=+np.loadtxt('data.txt')
h+=+h5py.File('data.hdf5',+'w')
dset+=+h.create_dataset('data',+data=d)|code-block|syntax|javascript|951820|entityMap^0|2|D|V|B|1J|4|0|0^^$0|@$1|2|3|4|5|6|7|M|8|@$9|N|A|O|B|C]|$9|P|A|Q|B|C]|$9|R|A|S|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|T|8|@]|D|@]|E|$I|J]]|$1|K|3|-4|5|6|7|U|8|@]|D|@]|E|$]]]|L|$]]

using the flexibility of <code>numpy.loadtxt</code> will get the data from file into a <code>numpy array</code>, which in turn is perfect to initialize the <code>hdf5</code> dataset.

<pre><code>import h5py
import numpy as np

d = np.loadtxt('data.txt')
h = h5py.File('data.hdf5', 'w')
dset = h.create_dataset('data', data=d)
</code></pre>

Given a large (10s of GB) CSV file of mixed text/numbers, what is the fastest way to create an HDF5 file with the same content, while keeping the memory usage reasonable?

I'd like to use the <code>h5py</code> module if possible.

In the toy example below, I've found an incredibly slow and incredibly fast way to write data to HDF5. Would it be best practice to write to HDF5 in chunks of 10,000 rows or so? Or is there a better way to write a massive amount of data to such a file?

<pre><code>import h5py

n = 10000000
f = h5py.File('foo.h5','w')
dset = f.create_dataset('int',(n,),'i')

# this is terribly slow
for i in xrange(n):
 dset[i] = i

# instantaneous
dset[...] = 42
</code></pre>

Fastest way to write HDF5 files with Python?

数据存储

Python

假设有一个大的(10 GB) CSV文件，其中包含混合的文本/数字，那么在保持合理的内存使用的同时，创建具有相同内容的HDF5文件的最快方法是什么？如果可能的话，我想使用h5py模块。在下面的玩具示例中，我发现了一种将数据写入HDF5的非常慢和非常快的方法。在10,000行左右的块中写入HDF5是不是最佳实践？或者，有...

问用Python编写HDF5文件的最快方法？
EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python编写HDF5文件的最快方法？EN

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Python编写HDF5文件的最快方法？
EN