blocks|key|1025798|text|如果要读取的文件很大，并且您不想一次读取内存中的整个文件：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1025799|fp+=+open("file")
for+i,+line+in+enumerate(fp):
++++if+i+==+25:
++++++++#+26th+line
++++elif+i+==+29:
++++++++#+30th+line
++++elif+i+>+29:
++++++++break
fp.close()|code-block|syntax|javascript|1025800|请注意，第n行的i+==+n-1。|offset|length|style|CODE|1025801|entityMap^0|0|0|8|8|0^^$0|@$1|2|3|4|5|6|7|O|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|P|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|Q|8|@$I|R|J|S|K|L]]|9|@]|A|$]]|$1|M|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|N|$]]

If the file to read is big, and you don't want to read the whole file in memory at once:

<pre><code>fp = open("file")
for i, line in enumerate(fp):
 if i == 25:
 # 26th line
 elif i == 29:
 # 30th line
 elif i &gt; 29:
 break
fp.close()
</code></pre>

Note that <code>i == n-1</code> for the nth line.

blocks|key|1025854|text|如果文件不是固定长度的，那你就倒霉了。某些函数将不得不读取该文件。如果文件是固定长度的，可以使用函数file.seek(line*linesize)打开该文件。然后从那里读取文件。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1025855|entityMap^0|1E|O|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|K|8|@]|D|@]|E|$]]]|G|$]]

If the file is not fixed length, you are out of luck. Some function will have to read the file. If the file is fixed length, you can open the file, use the function <code>file.seek(line*linesize)</code>. Then read the file from there.

blocks|key|1109319|text|您可以使用fileObject.seek(offset[,+whence])方法|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1109320|#offset+--+This+is+the+position+of+the+read/write+pointer+within+the+file.

#whence+--+This+is+optional+and+defaults+to+0+which+means+absolute+file+positioning,+other+values+are+1+which+means+seek+relative+to+the+current+position+and+2+means+seek+relative+to+the+file's+end.


file+=+open("test.txt",+"r")
line_size+=+8+#+Because+there+are+6+numbers+and+the+newline
line_number+=+5
file.seek(line_number+*+line_size,+0)
for+i+in+range(5):
++++print(file.readline())
file.close()|code-block|syntax|javascript|1109321|对于这段代码，我使用了下面的文件：|1109322|100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111|1109323|entityMap^0|5|X|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Q|8|@$9|R|A|S|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|T|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|U|8|@]|D|@]|E|$]]|$1|M|3|N|5|H|7|V|8|@]|D|@]|E|$I|J]]|$1|O|3|-4|5|6|7|W|8|@]|D|@]|E|$]]]|P|$]]

You can use the method <code>fileObject.seek(offset[, whence])</code>

<pre><code>#offset -- This is the position of the read/write pointer within the file.

#whence -- This is optional and defaults to 0 which means absolute file positioning, other values are 1 which means seek relative to the current position and 2 means seek relative to the file's end.


file = open("test.txt", "r")
line_size = 8 # Because there are 6 numbers and the newline
line_number = 5
file.seek(line_number * line_size, 0)
for i in range(5):
 print(file.readline())
file.close()
</code></pre>

To this code I use the next file:

<pre><code>100101
101102
102103
103104
104105
105106
106107
107108
108109
109110
110111
</code></pre>

blocks|key|3342774|text|python无法跳过文件中的“行”。据我所知，最好的方法是使用生成器根据特定条件生成行，即date+>+'YYYY-MM-DD'。至少通过这种方式，您可以减少i/o上的内存使用和时间。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|3342775|示例：|3342776|#+using+python+3.4+syntax+(parameter+type+annotation)

from+datetime+import+datetime

def+yield_right_dates(filepath:+str,+mydate:+datetime):

++++with+open(filepath,+'r')+as+myfile:

++++++++for+line+in+myfile:
++++++++#+assume:
++++++++#++++the+file+is+tab+separated+(because+.tsv+is+the+extension)+
++++++++#++++the+date+column+has+column-index+==+0
++++++++#++++the+date+format+is+'%25Y-%25m-%25d'
++++++++++++line_splt+=+line.split('\t')
++++++++++++if+datetime.strptime(line_splt[0],+'%25Y-%25m-%25d')+>+mydate:
++++++++++++++++yield+line_splt

my_file_gen+=+yield_right_dates(filepath='/path/to/my/file',+mydate=datetime(2015,01,01))
#+then+you+can+do+whatever+processing+you+need+on+the+stream,+or+put+it+in+one+giant+list.
desired_lines+=+[line+for+line+in+my_file_gen]|code-block|syntax|javascript|3342777|但这仍然将您限制在一个处理器上:(|3342778|假设您在一个类unix系统上，并且bash是您的外壳，我将使用外壳实用程序split分割文件，然后使用上面定义的多进程和生成器。|3342779|我现在没有一个大的文件来测试，但稍后我会用一个基准测试来更新这个答案，这个基准是整体迭代，而不是拆分，然后用生成器和多处理模块迭代它。|3342780|有了更多关于文件的知识(例如，如果所有所需的日期都聚集在开头%7C中心%7C结尾)，您可能能够进一步优化读取。|3342781|entityMap^0|19|J|0|0|0|0|11|5|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@$9|X|A|Y|B|C]]|D|@]|E|$]]|$1|F|3|G|5|6|7|Z|8|@]|D|@]|E|$]]|$1|H|3|I|5|J|7|10|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|11|8|@]|D|@]|E|$]]|$1|O|3|P|5|6|7|12|8|@$9|13|A|14|B|C]]|D|@]|E|$]]|$1|Q|3|R|5|6|7|15|8|@]|D|@]|E|$]]|$1|S|3|T|5|6|7|16|8|@]|D|@]|E|$]]|$1|U|3|-4|5|6|7|17|8|@]|D|@]|E|$]]]|V|$]]

python has no way to skip "lines" in a file. the best way that I know is to employ a generator to yield lines based on certain condition, i.e. <code>date &gt; 'YYYY-MM-DD'</code>. At least this way you reduce memory usage &amp; time spent on i/o.

example:

<pre><code># using python 3.4 syntax (parameter type annotation)

from datetime import datetime

def yield_right_dates(filepath: str, mydate: datetime):

 with open(filepath, 'r') as myfile:

 for line in myfile:
 # assume:
 # the file is tab separated (because .tsv is the extension) 
 # the date column has column-index == 0
 # the date format is '%Y-%m-%d'
 line_splt = line.split('\t')
 if datetime.strptime(line_splt[0], '%Y-%m-%d') &gt; mydate:
 yield line_splt

my_file_gen = yield_right_dates(filepath='/path/to/my/file', mydate=datetime(2015,01,01))
# then you can do whatever processing you need on the stream, or put it in one giant list.
desired_lines = [line for line in my_file_gen]
</code></pre>

But this is still limiting you to one processor :(

Assuming you're on a unix-like system and bash is your shell, I would split the file using the shell utility <code>split</code>, then use multiprocessing and the generator defined above.

I don't have a large file to test with right now, but I'll update this answer later with a benchmark on iterating it whole, vs. splitting and then iterating it with the generator and multiprocessing module.

With greater knowledge on the file (e.g. if all the desired dates are clustered at the beginning | center | end), you might be able to optimize the read further.

I am working with a very large text file (tsv) around 200 million entries. One of the column is date and records are sorted on date. Now I want to start reading the record from a given date. Currently I was just reading from start which is very slow since I need to read almost 100-150 million records just to reach that record. I was thinking if I can use binary search to speed it up, I can do away in just max 28 extra record reads (log(200 million)). Does python allow to read nth line without caching or reading lines before it?

Python goto text file line without reading previous lines

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在处理一个非常大的文本文件(tsv)，大约有2亿个条目。其中一列是date，记录按date排序。现在，我想开始读取给定日期的记录。目前我只是从头开始阅读，速度非常慢，因为我需要阅读近100-1.5亿条记录才能达到这个记录。我在想，如果我可以使用二进制搜索来加快速度，我可以在最多28次额外的记录读取(log(2亿))...

问Python转到文本文件行，而不读取前面的行
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Python转到文本文件行，而不读取前面的行EN