首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >Tensorflow gpu培训在OOM中失败时,准确的代码运行在一台没有问题的旧计算机上。

Tensorflow gpu培训在OOM中失败时,准确的代码运行在一台没有问题的旧计算机上。
EN

Stack Overflow用户
提问于 2022-02-02 13:56:29
回答 1查看 238关注 0票数 0

我最近买了一台新的笔记本电脑,里面有英特尔的内置显卡,还有一台来自Nvidia的显卡。我安装了cuda和此版本的驱动程序- NVIDIA-SMI 510.39.01驱动程序版本: 510.39.01 CUDA版本: 11.6

我还有Tensorflow 2.7。我正在尝试运行一个在我的旧计算机上工作得很好的网络,它或多或少是从这个存储库中获取的:

https://github.com/zhixuhao/unet.git

但是,当我启动该模型时,我会收到以下警告:

代码语言:javascript
运行
复制
2022-02-02 14:47:03.039319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

当我试着训练这个模型时,它完全不用训练就能运行OOM。(错误消息简化为字符限制)

代码语言:javascript
运行
复制
2022-02-02 14:50:00.390958: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:00.391000: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:02.218748: W tensorflow/core/common_runtime/bfc_allocator.cc:275] 
2022-02-02 14:50:12.687857: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16daade200 of size 524288 next 155
2022-02-02 14:50:12.687865: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16dab5e200 of size 1179648 next 157
2022-02-02 14:50:12.688670: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 5.17GiB
2022-02-02 14:50:12.688678: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 6427901952 memory_limit_: 6427901952 available bytes: 0 curr_region_allocation_bytes_: 12855803904
2022-02-02 14:50:12.688695: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats: 
Limit:                      6427901952
InUse:                      5546784512
MaxInUse:                   6126439680
NumAllocs:                         645
MaxAllocSize:               3716153344
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2022-02-02 14:50:12.688718: W tensorflow/core/common_runtime/bfc_allocator.cc:474] **********************************************************_____******************************_______
2022-02-02 14:50:12.688819: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_input_ops.cc:335 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
Input In [7], in <module>
----> 1 model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])

File ~/.local/lib/python3.8/site-packages/keras/engine/training.py:2016, in Model.fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   2005 """Fits the model on data yielded batch-by-batch by a Python generator.
   2006 
   2007 DEPRECATED:
   2008   `Model.fit` now supports generators, so there is no longer any need to use
   2009   this endpoint.
   2010 """
   2011 warnings.warn(
   2012     '`Model.fit_generator` is deprecated and '
   2013     'will be removed in a future version. '
   2014     'Please use `Model.fit`, which supports generators.',
   2015     stacklevel=2)
-> 2016 return self.fit(
   2017     generator,
   2018     steps_per_epoch=steps_per_epoch,
   2019     epochs=epochs,
   2020     verbose=verbose,
   2021     callbacks=callbacks,
   2022     validation_data=validation_data,
   2023     validation_steps=validation_steps,
   2024     validation_freq=validation_freq,
   2025     class_weight=class_weight,
   2026     max_queue_size=max_queue_size,
   2027     workers=workers,
   2028     use_multiprocessing=use_multiprocessing,
   2029     shuffle=shuffle,
   2030     initial_epoch=initial_epoch)

File ~/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     65 except Exception as e:  # pylint: disable=broad-except
     66   filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67   raise e.with_traceback(filtered_tb) from None
     68 finally:
     69   del filtered_tb

File ~/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     56 try:
     57   ctx.ensure_initialized()
---> 58   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     59                                       inputs, attrs, num_outputs)
     60 except core._NotOkStatusException as e:
     61   if name is not None:

ResourceExhaustedError:  OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput
 (defined at /home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py:464)
]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_3020]

Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput:
In[0] gradient_tape/model/conv2d_20/Conv2D/ShapeN:      
In[1] model/conv2d_20/Conv2D/ReadVariableOp (defined at /home/john/.local/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py:224)      
In[2] gradient_tape/model/conv2d_20/ReluGrad:

Operation defined at: (most recent call last)
>>>   File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
>>>     return _run_code(code, main_globals, None,
>>> 
>>>   File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
>>>     exec(code, run_globals)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
>>>     app.launch_new_instance()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
>>>     app.start()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 677, in start
>>>     self.io_loop.start()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
>>>     self.asyncio_loop.run_forever()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
>>>     self._run_once()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
>>>     handle._run()
>>> 
>>>   File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
>>>     self._context.run(self._callback, *self._args)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 461, in dispatch_queue
>>>     await self.process_one()
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 450, in process_one
>>>     await dispatch(*args)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 357, in dispatch_shell
>>>     await result
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 652, in execute_request
>>>     reply_content = await reply_content
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 353, in do_execute
>>>     res = shell.run_cell(code, store_history=store_history, silent=silent)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 532, in run_cell
>>>     return super().run_cell(*args, **kwargs)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2768, in run_cell
>>>     result = self._run_cell(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2814, in _run_cell
>>>     return runner(coro)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
>>>     coro.send(None)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3012, in run_cell_async
>>>     has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3191, in run_ast_nodes
>>>     if await self.run_code(code, result, async_=asy):
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3251, in run_code
>>>     exec(code_obj, self.user_global_ns, self.user_ns)
>>> 
>>>   File "/tmp/ipykernel_20858/1898079364.py", line 1, in <module>
>>>     model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 2016, in fit_generator
>>>     return self.fit(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>>     return fn(*args, **kwargs)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
>>>     tmp_logs = self.train_function(iterator)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
>>>     return step_function(self, iterator)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
>>>     outputs = model.distribute_strategy.run(run_step, args=(data,))
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
>>>     outputs = model.train_step(data)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
>>>     self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 530, in minimize
>>>     grads_and_vars = self._compute_gradients(
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 583, in _compute_gradients
>>>     grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
>>> 
>>>   File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 464, in _get_gradients
>>>     grads = tape.gradient(loss, var_list, grad_loss)
>>> 

有没有人知道是什么导致了这个/如何解决这个问题?最好的

EN

回答 1

Stack Overflow用户

发布于 2022-02-14 08:22:35

尝试对整个GPU内存设置一个硬限制,如此指南所示,并让我们知道它是否工作。

代码语言:javascript
运行
复制
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.compat.v1.Session(config=config)

也可以检查这里

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/70956910

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档