blocks|key|2715127|text|听起来你的代码是受I/O限制的。这意味着多处理不会有什么帮助--如果你花了90%25的时间从磁盘读取数据，那么让额外的7个进程等待下一次读取也不会有任何帮助。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2715128|而且，虽然使用CSV读取模块(无论是stdlib的csv还是NumPy或Pandas之类的模块)可能是一个简单的好主意，但它不太可能在性能上产生太大的差异。|offset|length|style|CODE|2715129|尽管如此，仍然值得检查您是否真的受I/O限制，而不仅仅是猜测。运行你的程序，看看你的CPU使用率是接近0%25还是接近100%25，或者是一个核心。按照阿玛丹在评论中的建议，只用pass运行你的程序来处理，看看这是否减少了5%25或70%25的时间。您甚至可能想尝试与os.open和os.read(1024*1024)或其他东西上的循环进行比较，看看速度是否更快。|2715130|2715131|由于您使用的是Python2.x，因此Python依赖于C+stdio库来猜测一次要缓冲多少，因此可能值得强制它缓冲更多。要做到这一点，最简单的方法是对一些大型bufsize使用readlines(bufsize)。(您可以尝试不同的数字并测量它们，以查看峰值在哪里。根据我的经验，通常64K-8MB的大小大致相同，但取决于您的系统可能会有所不同--特别是如果您正在读取的网络文件系统具有很高的吞吐量，但具有可怕的延迟，这会淹没实际物理驱动器的吞吐量与延迟以及操作系统所做的缓存。)|2715132|所以，举个例子：|2715133|bufsize+=+65536
with+open(path)+as+infile:+
++++while+True:
++++++++lines+=+infile.readlines(bufsize)
++++++++if+not+lines:
++++++++++++break
++++++++for+line+in+lines:
++++++++++++process(line)|code-block|syntax|javascript|2715134|2715135|同时，假设您在64位系统上，您可能希望尝试使用mmap，而不是首先读取文件。这当然不能保证会更好，但可能会更好，这取决于您的系统。例如：|2715136|with+open(path)+as+infile:
++++m+=+mmap.mmap(infile,+0,+access=mmap.ACCESS_READ)|2715137|Python+mmap是一种奇怪的对象-它既像str又像file，所以你可以手动迭代扫描换行符，或者你可以像调用文件一样对它调用readline。与以行的形式迭代文件或执行批处理readlines相比，这两种方法都会占用Python语言更多的处理时间(因为原本用C语言编写的循环现在是用纯Python…编写的也许你可以用re或者一个简单的Cython扩展来解决这个问题。)…但是，操作系统的I/O优势可能会掩盖CPU的劣势。|2715138|不幸的是，Python没有公开你用来在C中优化的madvise调用(例如，显式地设置MADV_SEQUENTIAL而不是让内核猜测，或者强制使用透明的大页面)-but你实际上可以将函数ctypes出libc。|2715139|entityMap|0|LINK|mutability|MUTABLE|url|https://docs.python.org/2/library/mmap.html|1|http://man7.org/linux/man-pages/man2/madvise.2.html^0|0|P|3|0|2D|4|3I|7|3Q|I|0|0|28|7|2H|I|0|0|0|0|N|4|N|4|0|0|0|7|4|N|3|S|4|1S|8|2H|9|4G|2|0|O|7|16|F|2K|6|2R|4|O|7|1|0^^$0|@$1|2|3|4|5|6|7|1C|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|1D|8|@$D|1E|E|1F|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|1G|8|@$D|1H|E|1I|F|G]|$D|1J|E|1K|F|G]|$D|1L|E|1M|F|G]]|9|@]|A|$]]|$1|J|3|-4|5|6|7|1N|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|1O|8|@$D|1P|E|1Q|F|G]|$D|1R|E|1S|F|G]]|9|@]|A|$]]|$1|M|3|N|5|6|7|1T|8|@]|9|@]|A|$]]|$1|O|3|P|5|Q|7|1U|8|@]|9|@]|A|$R|S]]|$1|T|3|-4|5|6|7|1V|8|@]|9|@]|A|$]]|$1|U|3|V|5|6|7|1W|8|@$D|1X|E|1Y|F|G]]|9|@$D|1Z|E|20|1|21]]|A|$]]|$1|W|3|X|5|Q|7|22|8|@]|9|@]|A|$R|S]]|$1|Y|3|Z|5|6|7|23|8|@$D|24|E|25|F|G]|$D|26|E|27|F|G]|$D|28|E|29|F|G]|$D|2A|E|2B|F|G]|$D|2C|E|2D|F|G]|$D|2E|E|2F|F|G]]|9|@]|A|$]]|$1|10|3|11|5|6|7|2G|8|@$D|2H|E|2I|F|G]|$D|2J|E|2K|F|G]|$D|2L|E|2M|F|G]|$D|2N|E|2O|F|G]]|9|@$D|2P|E|2Q|1|2R]]|A|$]]|$1|12|3|-4|5|6|7|2S|8|@]|9|@]|A|$]]]|13|$14|$5|15|16|17|A|$18|19]]|1A|$5|15|16|17|A|$18|1B]]]]

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.

And, while using a CSV reading module (whether the stdlib's <code>csv</code> or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.

Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just <code>pass</code> for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over <code>os.open</code> and <code>os.read(1024*1024)</code> or something and see if that's any faster.

<hr>

Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use <code>readlines(bufsize)</code> for some large <code>bufsize</code>. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)

So, for example:

<pre><code>bufsize = 65536
with open(path) as infile: 
 while True:
 lines = infile.readlines(bufsize)
 if not lines:
 break
 for line in lines:
 process(line)
</code></pre>

<hr>

Meanwhile, assuming you're on a 64-bit system, you may want to try using <a href="https://docs.python.org/2/library/mmap.html" rel="noreferrer"><code>mmap</code></a> instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:

<pre><code>with open(path) as infile:
 m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
</code></pre>

A Python <code>mmap</code> is sort of a weird object—it acts like a <code>str</code> and like a <code>file</code> at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call <code>readline</code> on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch <code>readlines</code> (because a loop that would be in C is now in pure Python… although maybe you can get around that with <code>re</code>, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.

Unfortunately, Python doesn't expose the <a href="http://man7.org/linux/man-pages/man2/madvise.2.html" rel="noreferrer"><code>madvise</code></a> call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting <code>MADV_SEQUENTIAL</code> instead of making the kernel guess, or forcing transparent huge pages)—but you can actually <code>ctypes</code> the function out of <code>libc</code>.

blocks|key|2715153|text|我知道这个问题有点老了，但我也想做类似的事情，我创建了一个简单的框架来帮助你并行读取和处理一个大文件。留下了我想要的答案。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2715154|这是代码，我在最后给出了一个例子|2715155|def+chunkify_file(fname,+size=1024*1024*1000,+skiplines=-1):
++++"""
++++function+to+divide+a+large+text+file+into+chunks+each+having+size+~=+size+so+that+the+chunks+are+line+aligned

++++Params+:+
++++++++fname+:+path+to+the+file+to+be+chunked
++++++++size+:+size+of+each+chink+is+~>+this
++++++++skiplines+:+number+of+lines+in+the+begining+to+skip,+-1+means+don't+skip+any+lines
++++Returns+:+
++++++++start+and+end+position+of+chunks+in+Bytes
++++"""
++++chunks+=+[]
++++fileEnd+=+os.path.getsize(fname)
++++with+open(fname,+"rb")+as+f:
++++++++if(skiplines+>+0):
++++++++++++for+i+in+range(skiplines):
++++++++++++++++f.readline()

++++++++chunkEnd+=+f.tell()
++++++++count+=+0
++++++++while+True:
++++++++++++chunkStart+=+chunkEnd
++++++++++++f.seek(f.tell()+%2B+size,+os.SEEK_SET)
++++++++++++f.readline()++#+make+this+chunk+line+aligned
++++++++++++chunkEnd+=+f.tell()
++++++++++++chunks.append((chunkStart,+chunkEnd+-+chunkStart,+fname))
++++++++++++count%2B=1

++++++++++++if+chunkEnd+>+fileEnd:
++++++++++++++++break
++++return+chunks

def+parallel_apply_line_by_line_chunk(chunk_data):
++++"""
++++function+to+apply+a+function+to+each+line+in+a+chunk

++++Params+:
++++++++chunk_data+:+the+data+for+this+chunk+
++++Returns+:
++++++++list+of+the+non-None+results+for+this+chunk
++++"""
++++chunk_start,+chunk_size,+file_path,+func_apply+=+chunk_data[:4]
++++func_args+=+chunk_data[4:]

++++t1+=+time.time()
++++chunk_res+=+[]
++++with+open(file_path,+"rb")+as+f:
++++++++f.seek(chunk_start)
++++++++cont+=+f.read(chunk_size).decode(encoding='utf-8')
++++++++lines+=+cont.splitlines()

++++++++for+i,line+in+enumerate(lines):
++++++++++++ret+=+func_apply(line,+*func_args)
++++++++++++if(ret+!=+None):
++++++++++++++++chunk_res.append(ret)
++++return+chunk_res

def+parallel_apply_line_by_line(input_file_path,+chunk_size_factor,+num_procs,+skiplines,+func_apply,+func_args,+fout=None):
++++"""
++++function+to+apply+a+supplied+function+line+by+line+in+parallel

++++Params+:
++++++++input_file_path+:+path+to+input+file
++++++++chunk_size_factor+:+size+of+1+chunk+in+MB
++++++++num_procs+:+number+of+parallel+processes+to+spawn,+max+used+is+num+of+available+cores+-+1
++++++++skiplines+:+number+of+top+lines+to+skip+while+processing
++++++++func_apply+:+a+function+which+expects+a+line+and+outputs+None+for+lines+we+don't+want+processed
++++++++func_args+:+arguments+to+function+func_apply
++++++++fout+:+do+we+want+to+output+the+processed+lines+to+a+file
++++Returns+:
++++++++list+of+the+non-None+results+obtained+be+processing+each+line
++++"""
++++num_parallel+=+min(num_procs,+psutil.cpu_count())+-+1

++++jobs+=+chunkify_file(input_file_path,+1024+*+1024+*+chunk_size_factor,+skiplines)

++++jobs+=+[list(x)+%2B+[func_apply]+%2B+func_args+for+x+in+jobs]

++++print("Starting+the+parallel+pool+for+{}+jobs+".format(len(jobs)))

++++lines_counter+=+0

++++pool+=+mp.Pool(num_parallel,+maxtasksperchild=1000)++#+maxtaskperchild+-+if+not+supplied+some+weird+happend+and+memory+blows+as+the+processes+keep+on+lingering

++++outputs+=+[]
++++for+i+in+range(0,+len(jobs),+num_parallel):
++++++++print("Chunk+start+=+",+i)
++++++++t1+=+time.time()
++++++++chunk_outputs+=+pool.map(parallel_apply_line_by_line_chunk,+jobs[i+:+i+%2B+num_parallel])

++++++++for+i,+subl+in+enumerate(chunk_outputs):
++++++++++++for+x+in+subl:
++++++++++++++++if(fout+!=+None):
++++++++++++++++++++print(x,+file=fout)
++++++++++++++++else:
++++++++++++++++++++outputs.append(x)
++++++++++++++++lines_counter+%2B=+1
++++++++del(chunk_outputs)
++++++++gc.collect()
++++++++print("All+Done+in+time+",+time.time()+-+t1)

++++print("Total+lines+we+have+=+{}".format(lines_counter))

++++pool.close()
++++pool.terminate()
++++return+outputs|code-block|syntax|javascript|2715156|例如，假设我有一个文件，其中我想要计算每行的字数，那么每行的处理将如下所示|2715157|def+count_words_line(line):
++++return+len(line.strip().split())|2715158|然后像这样调用函数：|2715159|parallel_apply_line_by_line(input_file_path,+100,+8,+0,+count_words_line,+[],+fout=None)|2715160|使用这种方法，与在大小约为20+in的示例文件上逐行读取相比，我的速度提高了约8倍。|2715161|entityMap^0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|V|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|W|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|X|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|Y|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|Z|8|@]|9|@]|A|$]]|$1|O|3|P|5|F|7|10|8|@]|9|@]|A|$G|H]]|$1|Q|3|R|5|6|7|11|8|@]|9|@]|A|$]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.

This is the code, I give an example in the end

<pre><code>def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
 """
 function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned

 Params : 
 fname : path to the file to be chunked
 size : size of each chink is ~&gt; this
 skiplines : number of lines in the begining to skip, -1 means don't skip any lines
 Returns : 
 start and end position of chunks in Bytes
 """
 chunks = []
 fileEnd = os.path.getsize(fname)
 with open(fname, "rb") as f:
 if(skiplines &gt; 0):
 for i in range(skiplines):
 f.readline()

 chunkEnd = f.tell()
 count = 0
 while True:
 chunkStart = chunkEnd
 f.seek(f.tell() + size, os.SEEK_SET)
 f.readline() # make this chunk line aligned
 chunkEnd = f.tell()
 chunks.append((chunkStart, chunkEnd - chunkStart, fname))
 count+=1

 if chunkEnd &gt; fileEnd:
 break
 return chunks

def parallel_apply_line_by_line_chunk(chunk_data):
 """
 function to apply a function to each line in a chunk

 Params :
 chunk_data : the data for this chunk 
 Returns :
 list of the non-None results for this chunk
 """
 chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
 func_args = chunk_data[4:]

 t1 = time.time()
 chunk_res = []
 with open(file_path, "rb") as f:
 f.seek(chunk_start)
 cont = f.read(chunk_size).decode(encoding='utf-8')
 lines = cont.splitlines()

 for i,line in enumerate(lines):
 ret = func_apply(line, *func_args)
 if(ret != None):
 chunk_res.append(ret)
 return chunk_res

def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
 """
 function to apply a supplied function line by line in parallel

 Params :
 input_file_path : path to input file
 chunk_size_factor : size of 1 chunk in MB
 num_procs : number of parallel processes to spawn, max used is num of available cores - 1
 skiplines : number of top lines to skip while processing
 func_apply : a function which expects a line and outputs None for lines we don't want processed
 func_args : arguments to function func_apply
 fout : do we want to output the processed lines to a file
 Returns :
 list of the non-None results obtained be processing each line
 """
 num_parallel = min(num_procs, psutil.cpu_count()) - 1

 jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)

 jobs = [list(x) + [func_apply] + func_args for x in jobs]

 print("Starting the parallel pool for {} jobs ".format(len(jobs)))

 lines_counter = 0

 pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering

 outputs = []
 for i in range(0, len(jobs), num_parallel):
 print("Chunk start = ", i)
 t1 = time.time()
 chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])

 for i, subl in enumerate(chunk_outputs):
 for x in subl:
 if(fout != None):
 print(x, file=fout)
 else:
 outputs.append(x)
 lines_counter += 1
 del(chunk_outputs)
 gc.collect()
 print("All Done in time ", time.time() - t1)

 print("Total lines we have = {}".format(lines_counter))

 pool.close()
 pool.terminate()
 return outputs
</code></pre>

Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like

<pre><code>def count_words_line(line):
 return len(line.strip().split())
</code></pre>

and then call the function like:

<pre><code>parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
</code></pre>

Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
<ol>
<li>Reading line-by-line?
<pre><code>with open() as infile:
 for line in infile:
</code></pre>
</li>
<li>Reading the file into memory in chunks and processing it, say 250 MB at a time?
</li>
</ol>
The processing is not very complicated, I am just grabbing value in column1 to <code>List1</code>, column2 to <code>List2</code> etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any <code>CSVReader</code> module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.

Fastest way to process a large file?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋 

腾讯云代码助手

CODING DevOps

Cloud Studio

SDK中心

API中心

命令行工具

我有多个3 GB制表符分隔的文件。每个文件中有2000万行。所有行都必须独立处理，任何两行之间都没有关系。我的问题是，什么会更快？是否逐行阅读？将open()作为infile:用于infile中的行：以区块为单位将文件读取到内存中并对其进行处理，例如一次250 MB？处理并不是很复杂，我只是抓取column1中的值到L...

问处理大文件的最快方法？
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问处理大文件的最快方法？EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问处理大文件的最快方法？
EN