blocks|key|1694699|text|备注:这不是一个答案，而是一个快速列出的问题和建议|type|unstyled|depth|inlineStyleRanges|offset|length|style|BOLD|entityRanges|data|1694700|你在用ThreadPool()+from+multiprocessing.pool吗？这并不是很好的文档(在python3中)，我宁愿使用ThreadPoolExecutor+(也参见这里)。|unordered-list-item|CODE|1694701|尝试在每个循环的末尾调试哪些对象保存在内存中，例如使用这个解决方案，它依赖于sys.getsizeof()返回所有声明的globals()的列表，以及它们的内存占用。|1694702|也可以调用del+results+(虽然我猜这不应该太大)|1694703|entityMap|0|LINK|mutability|MUTABLE|url|https://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor|1|https://stackoverflow.com/a/11529742/565489|2|https://stackoverflow.com/a/40997868/565489^0|0|P|0|3|C|G|P|1I|7|1X|I|0|2K|2|1|0|12|F|1O|9|R|6|2|0|5|B|0^^$0|@$1|2|3|4|5|6|7|Z|8|@$9|10|A|11|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|12|8|@$9|13|A|14|B|I]|$9|15|A|16|B|I]|$9|17|A|18|B|I]]|D|@$9|19|A|1A|1|1B]|$9|1C|A|1D|1|1E]]|E|$]]|$1|J|3|K|5|H|7|1F|8|@$9|1G|A|1H|B|I]|$9|1I|A|1J|B|I]]|D|@$9|1K|A|1L|1|1M]]|E|$]]|$1|L|3|M|5|H|7|1N|8|@$9|1O|A|1P|B|I]]|D|@]|E|$]]|$1|N|3|-4|5|6|7|1Q|8|@]|D|@]|E|$]]]|O|$P|$5|Q|R|S|E|$T|U]]|V|$5|Q|R|S|E|$T|W]]|X|$5|Q|R|S|E|$T|Y]]]]

Note: this is not an answer, rather a quick list of questions &amp; suggestions

<ul>
<li>Are you using <code>ThreadPool()</code> <code>from multiprocessing.pool</code>? That isn't really well documented (in <code>python3</code>) and I'd rather use <a href="https://docs.python.org/dev/library/concurrent.futures.html#threadpoolexecutor" rel="nofollow noreferrer">ThreadPoolExecutor</a>, (also see <a href="https://stackoverflow.com/a/11529742/565489">here</a>)</li>
<li>try to debug which objects are held in memory at the very end of each loop, e.g. using <a href="https://stackoverflow.com/a/40997868/565489">this solution</a> which relies on <code>sys.getsizeof()</code> to return a list of all declared <code>globals()</code>, together with their memory footprint. </li>
<li>also call <code>del results</code> (although that shouldn't be to large, I guess)</li>
</ul>

blocks|key|1694725|text|不要调用list()，它正在创建内存中的列表，其中包含从divide_chunks()返回的内容。这就是你的记忆问题可能发生的地方。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1694726|你不需要所有这些数据作为纪念。只需一次迭代一个文件名，这样所有的数据都不是一次在内存中。|1694727|请张贴堆栈跟踪以便我们有更多的信息。|1694728|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

Do NOT call list(), it is creating an in-memory
list of whatever is being returned from divide_chunks().
That is where your memory issue is probably happening.

You don’t need all of that data in memeory at once. 
Just iterate over the filenames one at a time, that way all of the data is not in memory at once.

Please post the stack trace so we have more information

blocks|key|1694742|text|简而言之，您不能在Python解释器中释放内存。您最好的选择是使用多处理，因为每个进程都可以自己处理内存。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1694743|垃圾收集器将“释放”内存，但不是在您可能期望的上下文中。可以在CPython源代码中探索页面和池的处理方法。这里还有一篇高级文章：https://realpython.com/python-memory-management/|offset|length|1694744|entityMap|0|LINK|mutability|MUTABLE|url|https://realpython.com/python-memory-management/^0|0|1T|1C|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@]|9|@$D|P|E|Q|1|R]]|A|$]]|$1|F|3|-4|5|6|7|S|8|@]|9|@]|A|$]]]|G|$H|$5|I|J|K|A|$L|M]]]]

In short you cant release memory back in the Python interpreter. Your best bet would be to use multiprocessing as each process can handle memory on its own.

The garbage collector will "free" memory, but not in the context you may expect. The handling of pages and pools can be explored in the CPython source. There is also a high level article here: <a href="https://realpython.com/python-memory-management/" rel="nofollow noreferrer">https://realpython.com/python-memory-management/</a>

blocks|key|247634|text|我认为使用芹菜是可能的，感谢芹菜，您可以轻松地使用python的并发性和并行性。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|247635|处理图像似乎是幂等的和原子的，所以它可以是一个芹菜任务。|247636|您可以运行几个工人来处理任务--处理图像。|247637|此外，它还有用于内存泄漏的配置。|247638|entityMap|0|LINK|mutability|MUTABLE|url|http://docs.celeryproject.org/en/latest/index.html|1|http://docs.celeryproject.org/en/latest/userguide/tasks.html#basics|2|http://docs.celeryproject.org/en/latest/userguide/workers.html#starting-the-worker|3|https://docs.celeryproject.org/en/latest/userguide/workers.html#max-tasks-per-child-setting^0|5|2|0|0|N|4|1|0|5|4|2|0|D|2|3|0^^$0|@$1|2|3|4|5|6|7|X|8|@]|9|@$A|Y|B|Z|1|10]]|C|$]]|$1|D|3|E|5|6|7|11|8|@]|9|@$A|12|B|13|1|14]]|C|$]]|$1|F|3|G|5|6|7|15|8|@]|9|@$A|16|B|17|1|18]]|C|$]]|$1|H|3|I|5|6|7|19|8|@]|9|@$A|1A|B|1B|1|1C]]|C|$]]|$1|J|3|-4|5|6|7|1D|8|@]|9|@]|C|$]]]|K|$L|$5|M|N|O|C|$P|Q]]|R|$5|M|N|O|C|$P|S]]|T|$5|M|N|O|C|$P|U]]|V|$5|M|N|O|C|$P|W]]]]

I think it will be possible with <a href="http://docs.celeryproject.org/en/latest/index.html" rel="nofollow noreferrer">celery</a>, thanks to celery you can use concurrency and parallelism easily with python.

Processing images seems are idempotent and atomic so it can be a <a href="http://docs.celeryproject.org/en/latest/userguide/tasks.html#basics" rel="nofollow noreferrer">celery task</a>.

You can run <a href="http://docs.celeryproject.org/en/latest/userguide/workers.html#starting-the-worker" rel="nofollow noreferrer">a few workers</a> that will process tasks - work with image.

Additionally it have <a href="https://docs.celeryproject.org/en/latest/userguide/workers.html#max-tasks-per-child-setting" rel="nofollow noreferrer">configuration</a> for memory leaks.

blocks|key|1694952|text|我对这类问题的解决办法是使用一些并行处理工具。我更喜欢强权b，因为它允许并行化甚至本地创建的函数(这些函数是“实现的细节”，所以最好避免在模块中使它们成为全局函数)。我的另一个建议是:不要在python中使用线程(和线程池)，而是使用进程(和进程池)--这几乎总是一个更好的主意！只需确保在Just中创建至少2个进程池，否则它将运行原始python进程中的所有进程，因此最终不会释放RAM。一旦joblib工作进程自动关闭，它们分配的RAM将被操作系统完全释放。我最喜欢的武器是joblib.Parallel。如果需要将大型数据(即大于2GB的数据)传输给工作人员，请使用joblib.dump+(将python对象写入主进程中的文件)和joblib.load+(在工作进程中读取)。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1694953|关于del+object：在python中，命令实际上并不删除对象。它只减少了它的参考计数器。当您运行import+gc;+gc.collect()时，垃圾收集器会自行决定要释放哪个内存，以及应该分配哪些内存，我不知道有什么方法可以强制它释放所有可能的内存。更糟糕的是，如果某些内存实际上不是由python分配的，而是在一些外部的C/C%2B%2B/Cython/etc代码中分配的，并且代码没有将python引用计数器与内存关联起来，那么除了我上面写的东西之外，您绝对无法将它从python中释放出来，除了终止分配RAM的python进程，在这种情况下，操作系统将保证释放它。这就是为什么--在python中释放一些内存的唯一可靠方法--是运行在并行进程中分配它的代码，然后终止进程。|style|CODE|BOLD|1694954|entityMap|0|LINK|mutability|MUTABLE|url|https://joblib.readthedocs.io/en/latest/|1|https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html|2|https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html|3|https://joblib.readthedocs.io/en/latest/generated/joblib.load.html^0|R|3|0|6N|F|1|7Y|B|2|8V|B|3|0|2|A|1F|N|83|1B|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@$A|X|B|Y|1|Z]|$A|10|B|11|1|12]|$A|13|B|14|1|15]|$A|16|B|17|1|18]]|C|$]]|$1|D|3|E|5|6|7|19|8|@$A|1A|B|1B|F|G]|$A|1C|B|1D|F|G]|$A|1E|B|1F|F|H]]|9|@]|C|$]]|$1|I|3|-4|5|6|7|1G|8|@]|9|@]|C|$]]]|J|$K|$5|L|M|N|C|$O|P]]|Q|$5|L|M|N|C|$O|R]]|S|$5|L|M|N|C|$O|T]]|U|$5|L|M|N|C|$O|V]]]]

My solution to this kind of problems is to use some parallel processing tool. I prefer <a href="https://joblib.readthedocs.io/en/latest/" rel="nofollow noreferrer">joblib</a> since it allows to parallelize even locally created functions (which are "details of implementation" and so it is better to avoid making them global in a module). My other advise: do not use threads (and thread pools) in python, use processes (and process pools) instead - this is almost always a better idea! Just make sure to create a pool of at least 2 processes in joblib, otherwise it would run everything in the original python process and so RAM would not be released in the end. Once the joblib worker processes are closed automatically, RAM which they allocated will be fully released by the OS. My favorite weapon of choice is <a href="https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html" rel="nofollow noreferrer">joblib.Parallel</a>. If you need to transfer to workers large data (i.e. larger than 2GB), use <a href="https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html" rel="nofollow noreferrer">joblib.dump</a> (to write a python object into a file in the main process) and <a href="https://joblib.readthedocs.io/en/latest/generated/joblib.load.html" rel="nofollow noreferrer">joblib.load</a> (to read it in a worker process).

About <code>del object</code>: in python, the command does not actually delete an object. It only decreases its reference counter. When you run <code>import gc; gc.collect()</code>, the garbage collector decides for itself which memory to free and which to leave allocated, and I am not aware of a way to force it to free all the memory possible. Even worse, if some memory was actually allocated not by python but, instead, for example, in some external C/C++/Cython/etc code and the code did not associate a python reference counter with the memory, there would be absolutely nothing you could do to free it from within python, except what I wrote above, i.e. by terminating the python process which allocated the RAM, in which case it would be guaranteed to be freed by the OS. That is why the only 100% reliable way to free some memory in python, is to run the code which allocates it in a parallel process and then to terminate the process.

blocks|key|1064024|text|现在，它可能是在第50000个非常大的东西，这导致了OOM，所以，为了测试这一点，我首先尝试：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1064025|file_list_chunks+=+list(divide_chunks(file_list_1,20000))[30000:]|code-block|syntax|javascript|1064026|如果在10,000次失败，这将确认20k是否太大，或者如果它再次在5万次失败时，代码会出现问题.|1064027|好的，在密码里.|1064028|首先，不需要显式的list构造函数，在python中迭代要好得多，而不是将整个列表生成到内存中。|offset|length|style|CODE|1064029|file_list_chunks+=+list(divide_chunks(file_list_1,20000))
#+becomes
file_list_chunks+=+divide_chunks(file_list_1,20000)|1064030|我认为您可能在这里滥用ThreadPool：|1064031|防止任何其他任务被提交到池中。一旦完成了所有任务，工作进程就会退出。|blockquote|1064032|这看上去好像close可能还有一些思考仍然在运行，虽然我认为这是安全的，但我觉得有点非pythonic的感觉更好，最好为ThreadPool使用上下文管理器：|1064033|with+ThreadPool(64)+as+pool:+
++++results+=+pool.map(get_image_features,f)
++++#+etc.|1064034|python+实际上并没有保证释放内存中的显式实际上并没有保证释放内存。|1064035|您应该在连接后收集/在with之后收集：|1064036|with+ThreadPool(..):
++++...
++++pool.join()
gc.collect()|1064037|你也可以试着把它分成更小的块，比如10,000甚至更小！|1064038|锤子1|1064039|有一件事，我会考虑在这里做，而不是使用熊猫DataFrames和大列表是使用SQL数据库，您可以在本地使用sqlite3。|1064040|import+sqlite3
conn+=+sqlite3.connect(':memory:',+check_same_thread=False)++#+or,+use+a+file+e.g.+'image-features.db'|1064041|并使用上下文管理器：|1064042|with+conn:
++++conn.execute('''CREATE+TABLE+images
++++++++++++++++++++(filename+text,+features+text)''')

with+conn:
++++#+Insert+a+row+of+data
++++conn.execute("INSERT+INTO+images+VALUES+('my-image.png','feature1,feature2')")|1064043|这样，我们就不必处理大列表对象或DataFrame了。|1064044|你可以把连接传递到每个线程..。你可能会有一些奇怪的事情，比如：|1064045|results+=+pool.map(get_image_features,+zip(itertools.repeat(conn),+f))|1064046|然后，在计算完成之后，您可以从数据库中选择所有您喜欢的格式。例如使用sql。|1064047|锤子2|1064048|这里使用一个子进程，而不是在python+"shell“的同一个实例中将其运行到另一个实例中。|1064049|因为您可以将开始和结束传递到python作为sys.args，所以您可以将这些切片：|1064050|#+main.py
#+a+for+loop+to+iterate+over+this
subprocess.check_call(["python",+"chunk.py",+"0",+"20000"])

#+chunk.py+a+b
for+count,f+in+enumerate(file_list_chunks):
++++if+count+<+int(sys.argv[1])+or+count+>+int(sys.argv[2]):
+++++++++pass
++++#+do+stuff|1064051|这样，子进程将正确地清理python+(不可能有内存泄漏，因为进程将被终止)。|1064052|我敢打赌，Hammer+1是要走的路，感觉就像在粘很多数据，不必要地将其读入python列表，而使用sqlite3+(或其他一些数据库)则完全避免了这种情况。|1064053|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/1316788/1240268|1|2|https://docs.python.org/3/library/sqlite3.html|3|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html^0|0|0|0|0|9|4|0|0|0|0|6|5|0|0|7|C|0|N|C|1|0|0|0|0|0|1H|7|2|0|0|0|0|0|0|0|Y|3|3|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|2G|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|2H|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|2I|8|@]|9|@]|A|$]]|$1|I|3|J|5|6|7|2J|8|@]|9|@]|A|$]]|$1|K|3|L|5|6|7|2K|8|@$M|2L|N|2M|O|P]]|9|@]|A|$]]|$1|Q|3|R|5|D|7|2N|8|@]|9|@]|A|$E|F]]|$1|S|3|T|5|6|7|2O|8|@]|9|@]|A|$]]|$1|U|3|V|5|W|7|2P|8|@]|9|@]|A|$]]|$1|X|3|Y|5|6|7|2Q|8|@$M|2R|N|2S|O|P]]|9|@]|A|$]]|$1|Z|3|10|5|D|7|2T|8|@]|9|@]|A|$E|F]]|$1|11|3|12|5|6|7|2U|8|@]|9|@$M|2V|N|2W|1|2X]|$M|2Y|N|2Z|1|30]]|A|$]]|$1|13|3|14|5|6|7|31|8|@]|9|@]|A|$]]|$1|15|3|16|5|D|7|32|8|@]|9|@]|A|$E|F]]|$1|17|3|18|5|6|7|33|8|@]|9|@]|A|$]]|$1|19|3|1A|5|6|7|34|8|@]|9|@]|A|$]]|$1|1B|3|1C|5|6|7|35|8|@]|9|@$M|36|N|37|1|38]]|A|$]]|$1|1D|3|1E|5|D|7|39|8|@]|9|@]|A|$E|F]]|$1|1F|3|1G|5|6|7|3A|8|@]|9|@]|A|$]]|$1|1H|3|1I|5|D|7|3B|8|@]|9|@]|A|$E|F]]|$1|1J|3|1K|5|6|7|3C|8|@]|9|@]|A|$]]|$1|1L|3|1M|5|6|7|3D|8|@]|9|@]|A|$]]|$1|1N|3|1O|5|D|7|3E|8|@]|9|@]|A|$E|F]]|$1|1P|3|1Q|5|6|7|3F|8|@]|9|@$M|3G|N|3H|1|3I]]|A|$]]|$1|1R|3|1S|5|6|7|3J|8|@]|9|@]|A|$]]|$1|1T|3|1U|5|6|7|3K|8|@]|9|@]|A|$]]|$1|1V|3|1W|5|6|7|3L|8|@]|9|@]|A|$]]|$1|1X|3|1Y|5|D|7|3M|8|@]|9|@]|A|$E|F]]|$1|1Z|3|20|5|6|7|3N|8|@]|9|@]|A|$]]|$1|21|3|22|5|6|7|3O|8|@]|9|@]|A|$]]|$1|23|3|-4|5|6|7|3P|8|@]|9|@]|A|$]]]|24|$25|$5|26|27|28|A|$29|2A]]|2B|$5|26|27|28|A|$29|2A]]|2C|$5|26|27|28|A|$29|2D]]|2E|$5|26|27|28|A|$29|2F]]]]

Now, it could be that something in the 50,000th is very large, and that's causing the OOM, so to test this I'd first try:

<pre><code>file_list_chunks = list(divide_chunks(file_list_1,20000))[30000:]
</code></pre>

If it fails at 10,000 this will confirm whether 20k is too big a chunksize, or if it fails at 50,000 again, there is an issue with the code...

<hr>

Okay, onto the code...

Firstly, you don't need the explicit <code>list</code> constructor, it's much better in python to iterate rather than generate the entire the list into memory.

<pre><code>file_list_chunks = list(divide_chunks(file_list_1,20000))
# becomes
file_list_chunks = divide_chunks(file_list_1,20000)
</code></pre>

I think you might be misusing ThreadPool here:

<blockquote>
 Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.
</blockquote>

This reads like <code>close</code> might have some thinks still running, although I guess this is safe it feels a little un-pythonic, it's better to use the context manager for ThreadPool:

<pre><code>with ThreadPool(64) as pool: 
 results = pool.map(get_image_features,f)
 # etc.
</code></pre>

The explicit <code>del</code>s in python <a href="https://stackoverflow.com/a/1316788/1240268">aren't actually guaranteed to free memory</a>.

You should collect after the join/after the with:

<pre><code>with ThreadPool(..):
 ...
 pool.join()
gc.collect()
</code></pre>

You could also try chunk this into smaller pieces e.g. 10,000 or even smaller!

<hr>

<h3>Hammer 1</h3>

One thing, I would consider doing here, instead of using pandas DataFrames and large lists is to use a SQL database, you can do this locally with <a href="https://docs.python.org/3/library/sqlite3.html" rel="noreferrer">sqlite3</a>:

<pre><code>import sqlite3
conn = sqlite3.connect(':memory:', check_same_thread=False) # or, use a file e.g. 'image-features.db'
</code></pre>

and use context manager:

<pre><code>with conn:
 conn.execute('''CREATE TABLE images
 (filename text, features text)''')

with conn:
 # Insert a row of data
 conn.execute("INSERT INTO images VALUES ('my-image.png','feature1,feature2')")
</code></pre>

That way, we won't have to handle the large list objects or DataFrame.

You can pass the connection to each of the threads... you might have to something a little weird like:

<pre><code>results = pool.map(get_image_features, zip(itertools.repeat(conn), f))
</code></pre>

Then, after the calculation is complete you can select all from the database, into which ever format you like. E.g. using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html" rel="noreferrer">read_sql</a>.

<hr>

<h3>Hammer 2</h3>

Use a subprocess here, rather than running this in the same instance of python "shell out" to another.

Since you can pass start and end to python as sys.args, you can slice these:

<pre><code># main.py
# a for loop to iterate over this
subprocess.check_call(["python", "chunk.py", "0", "20000"])

# chunk.py a b
for count,f in enumerate(file_list_chunks):
 if count &lt; int(sys.argv[1]) or count &gt; int(sys.argv[2]):
 pass
 # do stuff
</code></pre>

That way, the subprocess will properly clean up python (there's no way there'll be memory leaks, since the process will be terminated).

<hr>

My bet is that Hammer 1 is the way to go, it feels like you're gluing up a lot of data, and reading it into python lists unnecessarily, and using sqlite3 (or some other database) completely avoids that.

blocks|key|1064069|text|您的问题是，您所使用的线程应该使用多处理(CPU绑定与IO绑定)。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1064070|我会像这样重构你的代码：|1064071|from+multiprocessing+import+Pool

if+__name__+==+'__main__':
++++cpus+=+multiprocessing.cpu_count()++++++++
++++with+Pool(cpus-1)+as+p:
++++++++p.map(get_image_features,+file_list_1)|code-block|syntax|javascript|1064072|然后，我将修改函数get_image_features，将(类似于)这两行添加到它的末尾。我不知道您是如何处理这些图像的，但想法是在每个进程中处理每个图像，然后立即将其保存到磁盘中：|offset|length|style|CODE|1064073|df+=+pd.DataFrame({'filename':list_a,'image_features':list_b})
df.to_pickle("PATH_TO_FILE"%2Bstr(count)%2B".pickle")|1064074|因此，dataframe将被腌制并保存在每个进程中，而不是在它退出之后。进程一退出就会从内存中清除出来，因此这应该可以使内存占用保持较低。|1064075|entityMap^0|0|0|0|9|I|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|V|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|W|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|X|8|@$K|Y|L|Z|M|N]]|9|@]|A|$]]|$1|O|3|P|5|F|7|10|8|@]|9|@]|A|$G|H]]|$1|Q|3|R|5|6|7|11|8|@]|9|@]|A|$]]|$1|S|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|T|$]]

Your problem is that you are using threading where multiprocessing should be used (CPU bound vs IO bound).

I would refactor your code a bit like this:

<pre><code>from multiprocessing import Pool

if __name__ == '__main__':
 cpus = multiprocessing.cpu_count() 
 with Pool(cpus-1) as p:
 p.map(get_image_features, file_list_1)
</code></pre>

and then I would change the function <code>get_image_features</code> by appending (something like) these two lines to the end of it. I can't tell how exactly you are processing those images but the idea is to do every image inside each process and then immediately also save it to disk:

<pre><code>df = pd.DataFrame({'filename':list_a,'image_features':list_b})
df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")
</code></pre>

So the dataframe will be pickled and saved inside of each process, instead after it exits. Processes get cleaned out of memory as soon as they exit, so this should work to keep the memory footprint low.

blocks|key|1695131|text|pd.DataFrame(...)可能会在某些linux构建中泄漏(参见github+问题和“解决办法”)，因此即使是del+df也可能没有帮助。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|1695132|在您的例子中，github的解决方案可以在不对pd.DataFrame.__del__进行猴子修补的情况下使用。|1695133|from+ctypes+import+cdll,+CDLL
try:
++++cdll.LoadLibrary("libc.so.6")
++++libc+=+CDLL("libc.so.6")
++++libc.malloc_trim(0)
except+(OSError,+AttributeError):
++++libc+=+None


if+no+libc:
++++print("Sorry,+but+pandas.DataFrame+may+leak+over+time+even+if+it's+instances+are+deleted...")


CHUNK_SIZE+=+20000


#file_list_1+contains+100,000+images
with+ThreadPool(64)+as+pool:
++++for+count,f+in+enumerate(divide_chunks(file_list_1,+CHUNK_SIZE)):
++++++++#+make+the+Pool+of+workers
++++++++results+=+pool.map(get_image_features,f)
++++++++#+close+the+pool+and+wait+for+the+work+to+finish+
++++++++list_a,+list_b+=+zip(*results)
++++++++df+=+pd.DataFrame({'filename':list_a,'image_features':list_b})
++++++++df.to_pickle("PATH_TO_FILE"%2Bstr(count)%2B".pickle")

++++++++del+df

++++++++#+2+new+lines+of+code:
++++++++if+libc:++#+Fix+leaking+of+pd.DataFrame(...)
++++++++++++libc.malloc_trim(0)

print("pool+closed")|code-block|syntax|javascript|1695134|如果任何单一的数据文件太大，这个解决方案就不会有帮助。这只能通过减少CHUNK_SIZE来帮助。|1695135|entityMap|0|LINK|mutability|MUTABLE|url|https://github.com/pandas-dev/pandas/issues/2659#issuecomment-415177442^0|0|H|1N|6|17|9|0|0|N|K|0|0|Y|A|0^^$0|@$1|2|3|4|5|6|7|W|8|@$9|X|A|Y|B|C]|$9|Z|A|10|B|C]]|D|@$9|11|A|12|1|13]]|E|$]]|$1|F|3|G|5|6|7|14|8|@$9|15|A|16|B|C]]|D|@]|E|$]]|$1|H|3|I|5|J|7|17|8|@]|D|@]|E|$K|L]]|$1|M|3|N|5|6|7|18|8|@$9|19|A|1A|B|C]]|D|@]|E|$]]|$1|O|3|-4|5|6|7|1B|8|@]|D|@]|E|$]]]|P|$Q|$5|R|S|T|E|$U|V]]]]

<code>pd.DataFrame(...)</code> may leak on some linux builds (see github <a href="https://github.com/pandas-dev/pandas/issues/2659#issuecomment-415177442" rel="nofollow noreferrer">issue and "workaround"</a>), so even <code>del df</code> might not help.

In your case solution from github can be used without monkey-patching of <code>pd.DataFrame.__del__</code>:

<pre class="lang-py prettyprint-override"><code>from ctypes import cdll, CDLL
try:
 cdll.LoadLibrary("libc.so.6")
 libc = CDLL("libc.so.6")
 libc.malloc_trim(0)
except (OSError, AttributeError):
 libc = None


if no libc:
 print("Sorry, but pandas.DataFrame may leak over time even if it's instances are deleted...")


CHUNK_SIZE = 20000


#file_list_1 contains 100,000 images
with ThreadPool(64) as pool:
 for count,f in enumerate(divide_chunks(file_list_1, CHUNK_SIZE)):
 # make the Pool of workers
 results = pool.map(get_image_features,f)
 # close the pool and wait for the work to finish 
 list_a, list_b = zip(*results)
 df = pd.DataFrame({'filename':list_a,'image_features':list_b})
 df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")

 del df

 # 2 new lines of code:
 if libc: # Fix leaking of pd.DataFrame(...)
 libc.malloc_trim(0)

print("pool closed")
</code></pre>

P.S. This solution will not help if any single dataframe is too big. This can only be helped by reducing <code>CHUNK_SIZE</code>

I am trying to iterate over 100,000 images and capture some image features and store the resulting dataFrame on disk as a pickle file. 

Unfortunately due to RAM constraints, i am forced to split the images into chunks of 20,000 and perform operations on them before saving the results onto disk.

The code written below is supposed to save the dataframe of results for 20,000 images before starting the loop to process the next 20,000 images. 

However - This does not seem to be solving my problem as the memory is not getting released from RAM at the end of the first for loop

So somewhere while processing the 50,000th record, the program crashes due to Out of Memory Error.

I tried deleting the objects after saving them to disk and invoking the garbage collector, however the RAM usage does not seem to be going down.

What am i missing? 

<pre><code>#file_list_1 contains 100,000 images
file_list_chunks = list(divide_chunks(file_list_1,20000))
for count,f in enumerate(file_list_chunks):
 # make the Pool of workers
 pool = ThreadPool(64) 
 results = pool.map(get_image_features,f)
 # close the pool and wait for the work to finish 
 list_a, list_b = zip(*results)
 df = pd.DataFrame({'filename':list_a,'image_features':list_b})
 df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")
 del list_a
 del list_b
 del df
 gc.collect()
 pool.close() 
 pool.join()
 print("pool closed")
</code></pre>

How to destroy Python objects and free up memory

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我试图迭代100,000多个图像，捕获一些图像特性，并将产生的dataFrame存储在磁盘上，作为一个泡菜文件。不幸的是，由于RAM的限制，我不得不将图像分割成20,000块，并在将结果保存到磁盘之前对它们执行操作。下面编写的代码应该在开始循环之前保存20,000个图像的结果数据，以处理接下来的20,000幅图像。然而...

问如何销毁Python对象并释放内存
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何销毁Python对象并释放内存EN