blocks|key|506360|text|我认为最好的两种方法是：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|506361|1)在*.mx文件上使用Get，|506362|(+2)或读取该数据，并将其保存为二进制格式，为其编写LibraryLink代码，然后通过该格式读取内容。当然，这有一个缺点，就是你需要转换你的MX。但也许这是一种选择。|506363|一般来说，使用MX文件的速度相当快。|506364|确定这不是交换问题吗？|506365|编辑1：然后您可以在导入转换器中使用也写：教程/开发输入转换器|offset|length|style|BOLD|506366|entityMap|0|LINK|mutability|MUTABLE|url|http://reference.wolfram.com/mathematica/tutorial/DevelopingAnImportConverter.html^0|0|0|0|0|0|0|3|L|A|0|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Y|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|Z|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|10|8|@]|9|@]|A|$]]|$1|H|3|I|5|6|7|11|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|12|8|@$L|13|M|14|N|O]]|9|@$L|15|M|16|1|17]]|A|$]]|$1|P|3|-4|5|6|7|18|8|@]|9|@]|A|$]]]|Q|$R|$5|S|T|U|A|$V|W]]]]

I think the two best approaches are either:

1) use Get on the *.mx file, 

2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.

Generally speaking Get with MX files is pretty fast.

Are sure this is not a swapping problem?

Edit 1:
You could then use also write in an import converter: <a href="http://reference.wolfram.com/mathematica/tutorial/DevelopingAnImportConverter.html" rel="nofollow">tutorial/DevelopingAnImportConverter</a>

blocks|key|2330247|text|以下是一个想法：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2330248|你说你有一个粗糙的矩阵，也就是不同长度的列表。我假设浮点数。|2330249|您可以将矩阵扁平化，以获得一个长的打包的一维数组(必要时使用Developer`ToPackedArray打包它)，并分别存储子列表的起始索引。然后，在数据导入后重建粗糙矩阵。|offset|length|style|CODE|2330250|这里有一个演示，在Mathematica+(即导入之后)中，从一个大的扁平列表中提取子列表是快速的。|2330251|data+=+RandomReal[1,+10000000];

indexes+=+Union@RandomInteger[{1,+10000000},+10000];++++
ranges+=+#1+;;+(#2+-+1)+&+@@@+Partition[indexes,+2,+1];

data[[#]]+&+/@+ranges;+//+Timing

{0.093,+Null}|code-block|syntax|javascript|2330252|或者，存储一个子列表长度序列，并使用函数来完成此操作。我的观点是，以平面格式存储数据并在内核中对其进行分区将增加微不足道的开销。|2330253|导入打包数组作为MX文件是非常快速的。我只有2GB的内存，所以我不能在非常大的文件上进行测试，但是对于我的机器上的打包数组来说，导入时间总是很短的一秒钟。这将解决未打包的数据导入速度可能较慢的问题(尽管正如我在关于主要问题的评论中所说，我不能重现您提到的那种极其缓慢的情况)。|2330254|如果BinaryReadList是快速的(它现在没有读取MX文件那么快，但它看起来像它将大大加快在数学9)，那么您可以将整个数据集存储为一个大二进制文件，而不需要将它分解为单独的MX文件。然后您可以导入文件的相关部分，如下所示：|2330255|首先创建一个测试文件：|2330256|In[3]:=+f+=+OpenWrite["test.bin",+BinaryFormat+->+True]

In[4]:=+BinaryWrite[f,+RandomReal[1,+80000000],+"Real64"];+//+Timing
Out[4]=+{9.547,+Null}

In[5]:=+Close[f]|2330257|打开它：|2330258|In[6]:=+f+=+OpenRead["test.bin",+BinaryFormat+->+True]++++

In[7]:=+StreamPosition[f]

Out[7]=+0|2330259|跳过前500万项：|2330260|In[8]:=+SetStreamPosition[f,+5000000*8]

Out[8]=+40000000|2330261|改为500万项：|2330262|In[9]:=+BinaryReadList[f,+"Real64",+5000000]+//+Length+//+Timing++++
Out[9]=+{0.609,+5000000}|2330263|阅读其余所有条目：|2330264|In[10]:=+BinaryReadList[f,+"Real64"]+//+Length+//+Timing++++
Out[10]=+{7.782,+70000000}

In[11]:=+Close[f]|2330265|(相比之下，Get通常在这里用不到1.5秒的时间从MX文件读取相同的数据。我在WinXP上。)|2330266|编辑如果您愿意花时间编写一些C代码，另一个想法是创建一个库函数(使用图书馆链接)来存储-映射文件(Windows链接)，并将其直接复制到MTensor对象中(+MTensor只是一个打包的Mathematica数组，从图书馆链接的C端可以看到)。|BOLD|2330267|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/5433867/695132|1|http://library.wolfram.com/infocenter/Conferences/8025/|2|http://reference.wolfram.com/mathematica/LibraryLink/tutorial/Overview.html|3|http://msdn.microsoft.com/en-us/library/ms810613.aspx^0|0|0|U|N|0|0|0|I|2|0|0|0|2|E|16|A|1|0|0|0|0|0|0|0|0|0|0|0|6|3|0|0|2|1W|7|28|7|Y|5|2|1D|9|3|0^^$0|@$1|2|3|4|5|6|7|1Z|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|20|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|21|8|@$F|22|G|23|H|I]]|9|@]|A|$]]|$1|J|3|K|5|6|7|24|8|@]|9|@]|A|$]]|$1|L|3|M|5|N|7|25|8|@]|9|@]|A|$O|P]]|$1|Q|3|R|5|6|7|26|8|@]|9|@$F|27|G|28|1|29]]|A|$]]|$1|S|3|T|5|6|7|2A|8|@]|9|@]|A|$]]|$1|U|3|V|5|6|7|2B|8|@$F|2C|G|2D|H|I]]|9|@$F|2E|G|2F|1|2G]]|A|$]]|$1|W|3|X|5|6|7|2H|8|@]|9|@]|A|$]]|$1|Y|3|Z|5|N|7|2I|8|@]|9|@]|A|$O|P]]|$1|10|3|11|5|6|7|2J|8|@]|9|@]|A|$]]|$1|12|3|13|5|N|7|2K|8|@]|9|@]|A|$O|P]]|$1|14|3|15|5|6|7|2L|8|@]|9|@]|A|$]]|$1|16|3|17|5|N|7|2M|8|@]|9|@]|A|$O|P]]|$1|18|3|19|5|6|7|2N|8|@]|9|@]|A|$]]|$1|1A|3|1B|5|N|7|2O|8|@]|9|@]|A|$O|P]]|$1|1C|3|1D|5|6|7|2P|8|@]|9|@]|A|$]]|$1|1E|3|1F|5|N|7|2Q|8|@]|9|@]|A|$O|P]]|$1|1G|3|1H|5|6|7|2R|8|@$F|2S|G|2T|H|I]]|9|@]|A|$]]|$1|1I|3|1J|5|6|7|2U|8|@$F|2V|G|2W|H|1K]|$F|2X|G|2Y|H|I]|$F|2Z|G|30|H|I]]|9|@$F|31|G|32|1|33]|$F|34|G|35|1|36]]|A|$]]|$1|1L|3|-4|5|6|7|37|8|@]|9|@]|A|$]]]|1M|$1N|$5|1O|1P|1Q|A|$1R|1S]]|1T|$5|1O|1P|1Q|A|$1R|1U]]|1V|$5|1O|1P|1Q|A|$1R|1W]]|1X|$5|1O|1P|1Q|A|$1R|1Y]]]]

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use <code>Developer`ToPackedArray</code> to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.

<hr>

Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

<pre><code>data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000]; 
ranges = #1 ;; (#2 - 1) &amp; @@@ Partition[indexes, 2, 1];

data[[#]] &amp; /@ ranges; // Timing

{0.093, Null}
</code></pre>

Alternatively store a sequence of sublist lengths and use <a href="https://stackoverflow.com/a/5433867/695132">Mr.Wizard's <code>dynamicPartition</code> function</a> which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.

<hr>

Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).

<hr>

If <code>BinaryReadList</code> were fast (it isn't as fast as reading MX files now, but it looks like <a href="http://library.wolfram.com/infocenter/Conferences/8025/" rel="nofollow noreferrer">it will be significantly sped up in Mathematica 9</a>), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

<pre><code>In[3]:= f = OpenWrite["test.bin", BinaryFormat -&gt; True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]
</code></pre>

Open it: 

<pre><code>In[6]:= f = OpenRead["test.bin", BinaryFormat -&gt; True] 

In[7]:= StreamPosition[f]

Out[7]= 0
</code></pre>

Skip the first 5 million entries:

<pre><code>In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000
</code></pre>

Read 5 million entries:

<pre><code>In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing 
Out[9]= {0.609, 5000000}
</code></pre>

Read all the remaining entries:

<pre><code>In[10]:= BinaryReadList[f, "Real64"] // Length // Timing 
Out[10]= {7.782, 70000000}

In[11]:= Close[f]
</code></pre>

(For comparison, <code>Get</code> usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)

<hr>

EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using <a href="http://reference.wolfram.com/mathematica/LibraryLink/tutorial/Overview.html" rel="nofollow noreferrer">Library Link</a>) that will memory-map the file (<a href="http://msdn.microsoft.com/en-us/library/ms810613.aspx" rel="nofollow noreferrer">link for Windows</a>), and copy it directly into an <code>MTensor</code> object (an <code>MTensor</code> is just a packed Mathematica array, as seen from the C side of Library Link).

Can anybody advise an alternative to importing a couple of
GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?

The - too large to post here - research-problem involved simple statistical operations 
with double as much GB of data (around 34) than RAM available (16). 
To handle the data size problem I just split things up and used 
a Get / Clear strategy to do the math.

It does work, but calling <code>Get["bigfile.mx"]</code> takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.

So my question really is:
What is the most efficient way to handle truly large data set imports in Mathematica?

I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than <code>Get["bigfile.mx"]</code>.

Anyone has some experience to share?

(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).

Faster huge data-import than Get["raggedmatrix.mx"]?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

有人能建议从60个GByte文件列表(每个大约650个MByte )中导入几个数值数据(以.mx格式)的替代方案吗？这个问题太大了，以至于无法在这里发布--这个问题涉及到简单的统计操作，其数据量是可用内存(16)的两倍(大约34)。为了处理数据大小问题，我只是把事情分开，使用了一个Get / Clear策略来计算。它确...

问比获取(“raggedmatrix.mx”)更快的海量数据导入？
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比获取(“raggedmatrix.mx”)更快的海量数据导入？EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问比获取(“raggedmatrix.mx”)更快的海量数据导入？
EN