文章/答案/技术大牛

发布

社区首页 >问答首页 >Tensorflow gpu培训在OOM中失败时，准确的代码运行在一台没有问题的旧计算机上。

问Tensorflow gpu培训在OOM中失败时，准确的代码运行在一台没有问题的旧计算机上。
EN

Stack Overflow用户

提问于 2022-02-02 13:56:29

回答 1查看 238关注 0票数 0

我最近买了一台新的笔记本电脑，里面有英特尔的内置显卡，还有一台来自Nvidia的显卡。我安装了cuda和此版本的驱动程序- NVIDIA-SMI 510.39.01驱动程序版本: 510.39.01 CUDA版本: 11.6

我还有Tensorflow 2.7。我正在尝试运行一个在我的旧计算机上工作得很好的网络，它或多或少是从这个存储库中获取的：

https://github.com/zhixuhao/unet.git

但是，当我启动该模型时，我会收到以下警告：

2022-02-02 14:47:03.039319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

当我试着训练这个模型时，它完全不用训练就能运行OOM。(错误消息简化为字符限制)

2022-02-02 14:50:00.390958: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:00.391000: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:02.218748: W tensorflow/core/common_runtime/bfc_allocator.cc:275] 
2022-02-02 14:50:12.687857: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16daade200 of size 524288 next 155
2022-02-02 14:50:12.687865: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16dab5e200 of size 1179648 next 157
2022-02-02 14:50:12.688670: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 5.17GiB
2022-02-02 14:50:12.688678: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 6427901952 memory_limit_: 6427901952 available bytes: 0 curr_region_allocation_bytes_: 12855803904
2022-02-02 14:50:12.688695: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats: 
Limit:                      6427901952
InUse:                      5546784512
MaxInUse:                   6126439680
NumAllocs:                         645
MaxAllocSize:               3716153344
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-02-02 14:50:12.688718: W tensorflow/core/common_runtime/bfc_allocator.cc:474] **********************************************************_____******************************_______
2022-02-02 14:50:12.688819: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_input_ops.cc:335 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Input In [7], in <module>
----> 1 model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])

File ~/.local/lib/python3.8/site-packages/keras/engine/training.py:2016, in Model.fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   2005 """Fits the model on data yielded batch-by-batch by a Python generator.
   2006 
   2007 DEPRECATED:
   2008   `Model.fit` now supports generators, so there is no longer any need to use
   2009   this endpoint.
   2010 """
   2011 warnings.warn(
   2012     '`Model.fit_generator` is deprecated and '
   2013     'will be removed in a future version. '
   2014     'Please use `Model.fit`, which supports generators.',
   2015     stacklevel=2)
-> 2016 return self.fit(
   2017     generator,
   2018     steps_per_epoch=steps_per_epoch,
   2019     epochs=epochs,
   2020     verbose=verbose,
   2021     callbacks=callbacks,
   2022     validation_data=validation_data,
   2023     validation_steps=validation_steps,
   2024     validation_freq=validation_freq,
   2025     class_weight=class_weight,
   2026     max_queue_size=max_queue_size,
   2027     workers=workers,
   2028     use_multiprocessing=use_multiprocessing,
   2029     shuffle=shuffle,
   2030     initial_epoch=initial_epoch)

File ~/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     65 except Exception as e:  # pylint: disable=broad-except
     66   filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67   raise e.with_traceback(filtered_tb) from None
     68 finally:
     69   del filtered_tb

File ~/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     56 try:
     57   ctx.ensure_initialized()
---> 58   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     59                                       inputs, attrs, num_outputs)
     60 except core._NotOkStatusException as e:
     61   if name is not None:

ResourceExhaustedError:  OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput
 (defined at /home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py:464)
]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_3020]

Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput:
In[0] gradient_tape/model/conv2d_20/Conv2D/ShapeN:      
In[1] model/conv2d_20/Conv2D/ReadVariableOp (defined at /home/john/.local/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py:224)      
In[2] gradient_tape/model/conv2d_20/ReluGrad:

Operation defined at: (most recent call last)
>>>   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
>>>     return _run_code(code, main_globals, None,
>>> 
>>>   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
>>>     exec(code, run_globals)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
>>>     app.launch_new_instance()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
>>>     app.start()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 677, in start
>>>     self.io_loop.start()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
>>>     self.asyncio_loop.run_forever()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
>>>     self._run_once()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
>>>     handle._run()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
>>>     self._context.run(self._callback, *self._args)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 461, in dispatch_queue
>>>     await self.process_one()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 450, in process_one
>>>     await dispatch(*args)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 357, in dispatch_shell
>>>     await result
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 652, in execute_request
>>>     reply_content = await reply_content
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 353, in do_execute
>>>     res = shell.run_cell(code, store_history=store_history, silent=silent)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 532, in run_cell
>>>     return super().run_cell(*args, **kwargs)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2768, in run_cell
>>>     result = self._run_cell(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2814, in _run_cell
>>>     return runner(coro)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
>>>     coro.send(None)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3012, in run_cell_async
>>>     has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3191, in run_ast_nodes
>>>     if await self.run_code(code, result, async_=asy):
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3251, in run_code
>>>     exec(code_obj, self.user_global_ns, self.user_ns)
>>> 
>>>   File "/tmp/ipykernel_20858/1898079364.py", line 1, in <module>
>>>     model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 2016, in fit_generator
>>>     return self.fit(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
>>>     tmp_logs = self.train_function(iterator)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
>>>     return step_function(self, iterator)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
>>>     outputs = model.train_step(data)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
>>>     self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 530, in minimize
>>>     grads_and_vars = self._compute_gradients(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 583, in _compute_gradients
>>>     grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 464, in _get_gradients
>>>     grads = tape.gradient(loss, var_list, grad_loss)
>>>

有没有人知道是什么导致了这个/如何解决这个问题？最好的

python

tensorflow

回答 1

Stack Overflow用户

发布于 2022-02-14 08:22:35

尝试对整个GPU内存设置一个硬限制，如此指南所示，并让我们知道它是否工作。

import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)

也可以检查这里

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70956910

复制

相似问题

问Tensorflow gpu培训在OOM中失败时，准确的代码运行在一台没有问题的旧计算机上。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow gpu培训在OOM中失败时，准确的代码运行在一台没有问题的旧计算机上。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Tensorflow gpu培训在OOM中失败时，准确的代码运行在一台没有问题的旧计算机上。
EN