我试图训练一个模型使用拥抱脸的wav2vec的音频分类。我一直在犯这个错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: name, emotion, path.
***** Running training *****
Num examples = 2708
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 42
[ 2/42 : < :, Epoch 0.02/1]
Step Training Loss Validation Loss
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "<ipython-input-81-dd9fe3ea0f13>", line 77, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1073, in forward
return_dict=return_dict,
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 732, in forward
hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 574, in forward
hidden_states = hidden_states + self.feed_forward(self.final_layer_norm(hidden_states))
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 510, in forward
hidden_states = self.intermediate_act_fn(hidden_states)
File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1555, in gelu
return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 11.44 MiB free; 10.68 GiB reserved in total by PyTorch)
我在一个AWS深入学习的AMI ec2上。
我一直在研究这个问题。我已经试过了
减少批处理大小的torch.cuda.empty_cache()
的所有wav文件
还有什么我能做的吗?我在一个p2.8xLargeDataSet上挂载了105个GiB。
运行torch.cuda.memory_summary(device=None, abbreviated=False)
会给我提供如下信息:
|===========================================================================|\n| PyTorch CUDA memory summary, device ID 0 |\n|---------------------------------------------------------------------------|\n| CUDA OOMs: 3 | cudaMalloc retries: 4 |\n|===========================================================================|\n| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |\n|---------------------------------------------------------------------------|\n| Allocated memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| Active memory | 7550 MB | 10852 MB | 209624 MB | 202073 MB |\n| from large pool | 7544 MB | 10781 MB | 209325 MB | 201780 MB |\n| from small pool | 5 MB | 87 MB | 298 MB | 293 MB |\n|---------------------------------------------------------------------------|\n| GPU reserved memory | 10936 MB | 10960 MB | 63236 MB | 52300 MB |\n| from large pool | 10928 MB | 10954 MB | 63124 MB | 52196 MB |\n| from small pool | 8 MB | 98 MB | 112 MB | 104 MB |\n|---------------------------------------------------------------------------|\n| Non-releasable memory | 443755 KB | 1309 MB | 155426 MB | 154992 MB |\n| from large pool | 443551 KB | 1306 MB | 155081 MB | 154648 MB |\n| from small pool | 204 KB | 12 MB | 344 MB | 344 MB |\n|---------------------------------------------------------------------------|\n| Allocations | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| Active allocs | 1940 | 2622 | 32288 | 30348 |\n| from large pool | 1036 | 1618 | 21855 | 20819 |\n| from small pool | 904 | 1203 | 10433 | 9529 |\n|---------------------------------------------------------------------------|\n| GPU reserved segments | 495 | 495 | 2169 | 1674 |\n| from large pool | 491 | 491 | 2113 | 1622 |\n| from small pool | 4 | 49 | 56 | 52 |\n|---------------------------------------------------------------------------|\n| Non-releasable allocs | 179 | 335 | 15998 | 15819 |\n| from large pool | 165 | 272 | 12420 | 12255 |\n| from small pool | 14 | 63 | 3578 | 3564 |\n|===========================================================================|\n'
在将数据减少到长度小于2秒的输入之后,它会进行更多的训练,但仍然会出现错误:
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running training *****
Num examples = 1411
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 2
Total optimization steps = 22
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
[11/22 01:12 < 01:28, 0.12 it/s, Epoch 0.44/1]
Step Training Loss Validation Loss Accuracy
10 2.428100 2.257138 0.300283
The following columns in the evaluation set don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, emotion, name.
***** Running Evaluation *****
Num examples = 353
Batch size = 32
Saving model checkpoint to trainingArgs/checkpoint-10
Configuration saved in trainingArgs/checkpoint-10/config.json
Model weights saved in trainingArgs/checkpoint-10/pytorch_model.bin
Configuration saved in trainingArgs/checkpoint-10/preprocessor_config.json
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
380 return
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
498 num_bytes = storage.size() * storage.element_size()
--> 499 zip_file.write_record(name, storage.data_ptr(), num_bytes)
500
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-25-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1334 self.control = self.callback_handler.on_step_end(args, self.state, self.control)
1335
-> 1336 self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
1337 else:
1338 self.control = self.callback_handler.on_substep_end(args, self.state, self.control)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _maybe_log_save_evaluate(self, tr_loss, model, trial, epoch, ignore_keys_for_eval)
1441
1442 if self.control.should_save:
-> 1443 self._save_checkpoint(model, trial, metrics=metrics)
1444 self.control = self.callback_handler.on_save(self.args, self.state, self.control)
1445
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/transformers/trainer.py in _save_checkpoint(self, model, trial, metrics)
1531 elif self.args.should_save and not self.deepspeed:
1532 # deepspeed.save_checkpoint above saves model/optim/sched
-> 1533 torch.save(self.optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
1534 with warnings.catch_warnings(record=True) as caught_warnings:
1535 torch.save(self.lr_scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
378 with _open_zipfile_writer(opened_file) as opened_zipfile:
379 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
--> 380 return
381 _legacy_save(obj, opened_file, pickle_module, pickle_protocol)
382
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/serialization.py in __exit__(self, *args)
257
258 def __exit__(self, *args) -> None:
--> 259 self.file_like.write_end_of_file()
260 self.buffer.flush()
261
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 1849920000 vs 1849919888
当我在笔记本上运行!free
时,我得到:
The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.
total used free shared buff/cache available
Mem: 503392908 6223452 478499292 346492 18670164 492641984
Swap: 0 0 0
对于训练代码,我实际上是运行这个colab笔记本作为一个例子:https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=6M8bNvLLJnG1
我正在更改的只是传入的数据/标签,我有意地将这些数据/标签插入到教程笔记本中使用的相同目录结构中。尽管我的数据有相似的大小/数字类,但由于某种原因,本教程笔记本运行得很好。
发布于 2021-08-23 13:17:51
您可以在Pytorch中使用DataParallel
或DistributedDataParallel
框架。
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
在这种方法中,模型被复制到每个设备(gpu)上,并且数据分布在各个设备上。
DataParallel会自动拆分数据,并在多个GPU上向多个模型发送作业订单。在每个模型完成工作之后,DataParallel在将结果返回给您之前收集并合并结果。
这里还有更多的例子,https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html。
如果模型不适合一个gpu的内存,那么应该采用模型并行方法。
从您现有的模型中,您可以知道使用.to('cuda:0')
、.to('cuda:1')
等的gpu位于哪个层。
class ModelParallelResNet50(ResNet):
def __init__(self, *args, **kwargs):
super(ModelParallelResNet50, self).__init__(
Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
self.seq1 = nn.Sequential(
self.conv1,
self.bn1,
self.relu,
self.maxpool,
self.layer1,
self.layer2
).to('cuda:0')
self.seq2 = nn.Sequential(
self.layer3,
self.layer4,
self.avgpool,
).to('cuda:1')
self.fc.to('cuda:1')
def forward(self, x):
x = self.seq2(self.seq1(x).to('cuda:1'))
return self.fc(x.view(x.size(0), -1))
由于您可能会失去性能,因此可能需要使用流水线方法,即进一步将输入数据分块到批处理中,这些批处理在不同的设备上并行运行。
https://stackoverflow.com/questions/68624392
复制相似问题