我最近买了一台新的笔记本电脑,里面有英特尔的内置显卡,还有一台来自Nvidia的显卡。我安装了cuda和此版本的驱动程序- NVIDIA-SMI 510.39.01驱动程序版本: 510.39.01 CUDA版本: 11.6
我还有Tensorflow 2.7。我正在尝试运行一个在我的旧计算机上工作得很好的网络,它或多或少是从这个存储库中获取的:
https://github.com/zhixuhao/unet.git
但是,当我启动该模型时,我会收到以下警告:
2022-02-02 14:47:03.039319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
当我试着训练这个模型时,它完全不用训练就能运行OOM。(错误消息简化为字符限制)
2022-02-02 14:50:00.390958: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:00.391000: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.02GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-02-02 14:50:02.218748: W tensorflow/core/common_runtime/bfc_allocator.cc:275]
2022-02-02 14:50:12.687857: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16daade200 of size 524288 next 155
2022-02-02 14:50:12.687865: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f16dab5e200 of size 1179648 next 157
2022-02-02 14:50:12.688670: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 5.17GiB
2022-02-02 14:50:12.688678: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 6427901952 memory_limit_: 6427901952 available bytes: 0 curr_region_allocation_bytes_: 12855803904
2022-02-02 14:50:12.688695: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit: 6427901952
InUse: 5546784512
MaxInUse: 6126439680
NumAllocs: 645
MaxAllocSize: 3716153344
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-02-02 14:50:12.688718: W tensorflow/core/common_runtime/bfc_allocator.cc:474] **********************************************************_____******************************_______
2022-02-02 14:50:12.688819: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_input_ops.cc:335 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
Input In [7], in <module>
----> 1 model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
File ~/.local/lib/python3.8/site-packages/keras/engine/training.py:2016, in Model.fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
2005 """Fits the model on data yielded batch-by-batch by a Python generator.
2006
2007 DEPRECATED:
2008 `Model.fit` now supports generators, so there is no longer any need to use
2009 this endpoint.
2010 """
2011 warnings.warn(
2012 '`Model.fit_generator` is deprecated and '
2013 'will be removed in a future version. '
2014 'Please use `Model.fit`, which supports generators.',
2015 stacklevel=2)
-> 2016 return self.fit(
2017 generator,
2018 steps_per_epoch=steps_per_epoch,
2019 epochs=epochs,
2020 verbose=verbose,
2021 callbacks=callbacks,
2022 validation_data=validation_data,
2023 validation_steps=validation_steps,
2024 validation_freq=validation_freq,
2025 class_weight=class_weight,
2026 max_queue_size=max_queue_size,
2027 workers=workers,
2028 use_multiprocessing=use_multiprocessing,
2029 shuffle=shuffle,
2030 initial_epoch=initial_epoch)
File ~/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py:67, in filter_traceback.<locals>.error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
File ~/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py:58, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
56 try:
57 ctx.ensure_initialized()
---> 58 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
59 inputs, attrs, num_outputs)
60 except core._NotOkStatusException as e:
61 if name is not None:
ResourceExhaustedError: OOM when allocating tensor with shape[1,128,1024,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput
(defined at /home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py:464)
]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_3020]
Errors may have originated from an input operation.
Input Source operations connected to node gradient_tape/model/conv2d_20/Conv2D/Conv2DBackpropInput:
In[0] gradient_tape/model/conv2d_20/Conv2D/ShapeN:
In[1] model/conv2d_20/Conv2D/ReadVariableOp (defined at /home/john/.local/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py:224)
In[2] gradient_tape/model/conv2d_20/ReluGrad:
Operation defined at: (most recent call last)
>>> File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
>>> return _run_code(code, main_globals, None,
>>>
>>> File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
>>> exec(code, run_globals)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
>>> app.launch_new_instance()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
>>> app.start()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 677, in start
>>> self.io_loop.start()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
>>> self.asyncio_loop.run_forever()
>>>
>>> File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
>>> self._run_once()
>>>
>>> File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
>>> handle._run()
>>>
>>> File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
>>> self._context.run(self._callback, *self._args)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 461, in dispatch_queue
>>> await self.process_one()
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 450, in process_one
>>> await dispatch(*args)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 357, in dispatch_shell
>>> await result
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py", line 652, in execute_request
>>> reply_content = await reply_content
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/ipkernel.py", line 353, in do_execute
>>> res = shell.run_cell(code, store_history=store_history, silent=silent)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/ipykernel/zmqshell.py", line 532, in run_cell
>>> return super().run_cell(*args, **kwargs)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2768, in run_cell
>>> result = self._run_cell(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 2814, in _run_cell
>>> return runner(coro)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner
>>> coro.send(None)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3012, in run_cell_async
>>> has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3191, in run_ast_nodes
>>> if await self.run_code(code, result, async_=asy):
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3251, in run_code
>>> exec(code_obj, self.user_global_ns, self.user_ns)
>>>
>>> File "/tmp/ipykernel_20858/1898079364.py", line 1, in <module>
>>> model.fit_generator(myGene,steps_per_epoch=300,epochs=10,callbacks=[model_checkpoint])
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 2016, in fit_generator
>>> return self.fit(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 1216, in fit
>>> tmp_logs = self.train_function(iterator)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function
>>> return step_function(self, iterator)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function
>>> outputs = model.distribute_strategy.run(run_step, args=(data,))
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step
>>> outputs = model.train_step(data)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/engine/training.py", line 816, in train_step
>>> self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 530, in minimize
>>> grads_and_vars = self._compute_gradients(
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 583, in _compute_gradients
>>> grads_and_vars = self._get_gradients(tape, loss, var_list, grad_loss)
>>>
>>> File "/home/john/.local/lib/python3.8/site-packages/keras/optimizer_v2/optimizer_v2.py", line 464, in _get_gradients
>>> grads = tape.gradient(loss, var_list, grad_loss)
>>>
有没有人知道是什么导致了这个/如何解决这个问题?最好的
https://stackoverflow.com/questions/70956910
复制相似问题