文章/答案/技术大牛

发布

问用火把耗尽记忆
EN

Stack Overflow用户

提问于 2021-08-02 15:39:25

回答 1查看 1.6K关注 0票数 1

我试图训练一个模型使用拥抱脸的wav2vec的音频分类。我一直在犯这个错误：

The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
  Num examples = 2708
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 42
 [ 2/42 : < :, Epoch 0.02/1]
Step    Training Loss   Validation Loss

RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
    return_dict=return_dict,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
    return_dict=return_dict,
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
    hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
    hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
    return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)

我在一个AWS深入学习的AMI ec2上。

我一直在研究这个问题。我已经试过了

减少批处理大小的torch.cuda.empty_cache()

(我想要4，但已降到1，没有错误)
添加: import gc.collect()

删除数据集中超过6秒

的所有wav文件

还有什么我能做的吗？我在一个p2.8xLargeDataSet上挂载了105个GiB。

运行torch.cuda.memory_summary(device=None, abbreviated=False)会给我提供如下信息：

|===========================================================================|\n|                  PyTorch CUDA memory summary, device ID 0                 |\n|---------------------------------------------------------------------------|\n|            CUDA OOMs: 3            |        cudaMalloc retries: 4         |\n|===========================================================================|\n|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |\n|---------------------------------------------------------------------------|\n| Allocated memory      |    7550 MB |   10852 MB |  209624 MB |  202073 MB |\n|       from large pool |    7544 MB |   10781 MB |  209325 MB |  201780 MB |\n|       from small pool |       5 MB |      87 MB |     298 MB |     293 MB |\n|---------------------------------------------------------------------------|\n| Active memory         |    7550 MB |   10852 MB |  209624 MB |  202073 MB |\n|       from large pool |    7544 MB |   10781 MB |  209325 MB |  201780 MB |\n|       from small pool |       5 MB |      87 MB |     298 MB |     293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory   |   10936 MB |   10960 MB |   63236 MB |   52300 MB |\n|       from large pool |   10928 MB |   10954 MB |   63124 MB |   52196 MB |\n|       from small pool |       8 MB |      98 MB |     112 MB |     104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory |  443755 KB |    1309 MB |  155426 MB |  154992 MB |\n|       from large pool |  443551 KB |    1306 MB |  155081 MB |  154648 MB |\n|       from small pool |     204 KB |      12 MB |     344 MB |     344 MB |\n|---------------------------------------------------------------------------|\n| Allocations           |    1940    |    2622    |   32288    |   30348    |\n|       from large pool |    1036    |    1618    |   21855    |   20819    |\n|       from small pool |     904    |    1203    |   10433    |    9529    |\n|---------------------------------------------------------------------------|\n| Active allocs         |    1940    |    2622    |   32288    |   30348    |\n|       from large pool |    1036    |    1618    |   21855    |   20819    |\n|       from small pool |     904    |    1203    |   10433    |    9529    |\n|---------------------------------------------------------------------------|\n| GPU reserved segments |     495    |     495    |    2169    |    1674    |\n|       from large pool |     491    |     491    |    2113    |    1622    |\n|       from small pool |       4    |      49    |      56    |      52    |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs |     179    |     335    |   15998    |   15819    |\n|       from large pool |     165    |     272    |   12420    |   12255    |\n|       from small pool |      14    |      63    |    3578    |    3564    |\n|===========================================================================|\n'

在将数据减少到长度小于2秒的输入之后，它会进行更多的训练，但仍然会出现错误：

The following columns in the training set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
  Num examples = 1411
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 2
  Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 [11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step    Training Loss   Validation Loss Accuracy
10  2.428100    2.257138    0.300283
The following columns in the evaluation set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
  Num examples = 353
  Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    378             with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    380                 return

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
    498         num_bytes = storage.size() * storage.element_size()
--> 499         zip_file.write_record(name, storage.data_ptr(), num_bytes)
    500 

OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1334                     self.control = self.callback_handler.on_step_end(args, self.state, self.control)
   1335 
-> 1336                     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1337                 else:
   1338                     self.control = self.callback_handler.on_substep_end(args, self.state, self.control)

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
   1441 
   1442         if self.control.should_save:
-> 1443             self._save_checkpoint(model, trial, metrics=metrics)
   1444             self.control = self.callback_handler.on_save(self.args, self.state, self.control)
   1445 

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
   1531         elif self.args.should_save and not self.deepspeed:
   1532             # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533             torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
   1534             with warnings.catch_warnings(record=True) as caught_warnings:
   1535                 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    378             with _open_zipfile_writer(opened_file) as opened_zipfile:
    379                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380                 return
    381         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
    382 

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
    257 
    258     def __exit__(self, *args) -> None:
--> 259         self.file_like.write_end_of_file()
    260         self.buffer.flush()
    261 

RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888

当我在笔记本上运行!free时，我得到：

The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
              total        used        free      shared  buff/cache   available
Mem:      503392908     6223452   478499292      346492    18670164   492641984
Swap:             0           0           0

对于训练代码，我实际上是运行这个colab笔记本作为一个例子：https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1

我正在更改的只是传入的数据/标签，我有意地将这些数据/标签插入到教程笔记本中使用的相同目录结构中。尽管我的数据有相似的大小/数字类，但由于某种原因，本教程笔记本运行得很好。

deep-learning

pytorch

huggingface-transformers

Stack Overflow用户

回答已采纳

发布于 2021-08-23 13:17:51

您可以在Pytorch中使用DataParallel或DistributedDataParallel框架。

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

在这种方法中，模型被复制到每个设备(gpu)上，并且数据分布在各个设备上。

DataParallel会自动拆分数据，并在多个GPU上向多个模型发送作业订单。在每个模型完成工作之后，DataParallel在将结果返回给您之前收集并合并结果。

如果模型不适合一个gpu的内存，那么应该采用模型并行方法。

从您现有的模型中，您可以知道使用.to('cuda:0')、.to('cuda:1')等的gpu位于哪个层。

class ModelParallelResNet50(ResNet):
    def __init__(self, *args, **kwargs):
        super(ModelParallelResNet50, self).__init__(
            Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)

        self.seq1 = nn.Sequential(
            self.conv1,
            self.bn1,
            self.relu,
            self.maxpool,

            self.layer1,
            self.layer2
        ).to('cuda:0')

        self.seq2 = nn.Sequential(
            self.layer3,
            self.layer4,
            self.avgpool,
        ).to('cuda:1')

        self.fc.to('cuda:1')

    def forward(self, x):
        x = self.seq2(self.seq1(x).to('cuda:1'))
        return self.fc(x.view(x.size(0), -1))

由于您可能会失去性能，因此可能需要使用流水线方法，即进一步将输入数据分块到批处理中，这些批处理在不同的设备上并行运行。

票数 1

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68624392

复制

相似问题

问用火把耗尽记忆
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用火把耗尽记忆EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用火把耗尽记忆
EN