“他山之石,可以攻玉”,站在巨人的肩膀才能看得更高,走得更远。在科研的道路上,更需借助东风才能更快前行。为此,我们特别搜集整理了一些实用的代码链接,数据集,软件,编程技巧等,开辟“他山之石”专栏,助你乘风破浪,一路奋勇向前,敬请关注。
作者:知乎—Takanashi
地址:https://zhuanlan.zhihu.com/p/353985363
GitHub:https://github.com/miracleyoo
写在前面
Pytorch-Lightning这个库我“发现”过两次。第一次发现时,感觉它很重很难学,而且似乎自己也用不上。但是后面随着做的项目开始出现了一些稍微高阶的要求,我发现我总是不断地在相似工程代码上花费大量时间,Debug也是这些代码花的时间最多,而且渐渐产生了一个矛盾之处:如果想要更多更好的功能,如TensorBoard支持,Early Stop,LR Scheduler,分布式训练,快速测试等,代码就无可避免地变得越来越长,看起来也越来越乱,同时核心的训练逻辑也渐渐被这些工程代码盖过。那么有没有更好的解决方案,甚至能一键解决所有这些问题呢?
于是我第二次发现了Pytorch-Lightning。
真香。
但是问题还是来了。这个框架并没有因为香而变得更加易学。官网的教程很丰富,可以看出来开发者们在努力做了。但是很多相连的知识点都被分布在了不同的版块里,还有一些核心的理解要点并没有被强调出来,而是小字带过,这让我想做一个普惠的教程,包含所有我在学习过程中认为重要的概念,好用的参数,一些注意点、坑点,大量的示例代码段和一些核心问题的集中讲解。
最后,第三部分提供了一个我总结出来的易用于大型项目、容易迁移、易于复用的模板,有兴趣的可以去GitHub[1]试用。
核心
Pytorch-Lighting 的一大特点是把模型和系统分开来看。模型是像Resnet18, RNN之类的纯模型, 而系统定义了一组模型如何相互交互,如GAN(生成器网络与判别器网络)、Seq2Seq(Encoder与Decoder网络)和Bert。同时,有时候问题只涉及一个模型,那么这个系统则可以是一个通用的系统,用于描述模型如何使用,并可以被复用到很多其他项目。
Pytorch-Lighting 的核心设计思想是“自给自足”。每个网络也同时包含了如何训练、如何测试、优化器定义等内容。
推荐使用方法
这一部分放在最前面,因为全文内容太长,如果放后面容易忽略掉这部分精华。
Pytorch-Lightning 是一个很好的库,或者说是pytorch的抽象和包装。它的好处是可复用性强,易维护,逻辑清晰等。缺点也很明显,这个包需要学习和理解的内容还是挺多的,或者换句话说,很重。如果直接按照官方的模板写代码,小型project还好,如果是大型项目,有复数个需要调试验证的模型和数据集,那就不太好办,甚至更加麻烦了。经过几天的摸索和调试,我总结出了下面这样一套好用的模板,也可以说是对Pytorch-Lightning的进一步抽象。
欢迎大家尝试这一套代码风格,如果用习惯的话还是相当方便复用的,也不容易半道退坑。
root-
|-data
|-__init__.py
|-data_interface.py
|-xxxdataset1.py
|-xxxdataset2.py
|-...
|-model
|-__init__.py
|-model_interface.py
|-xxxmodel1.py
|-xxxmodel2.py
|-...
|-main.py
如果对每个模型直接上plmodule,对于已有项目、别人的代码等的转换将相当耗时。另外,这样的话,你需要给每个模型都加上一些相似的代码,如training_step,validation_step。显然,这并不是我们想要的,如果真的这样做,不但不易于维护,反而可能会更加杂乱。同理,如果把每个数据集类都直接转换成pl的DataModule,也会面临相似的问题。基于这样的考量,我建议使用上述架构:
完事。
完全版模板可以在GitHub找到。
Lightning Module
主页面[2]
outs = []
for batch in data:
out = training_step(batch)
outs.append(out)
training_epoch_end(outs)
等价Lightning代码:
def training_step(self, batch, batch_idx):
prediction = ...
return prediction
def training_epoch_end(self, training_step_outputs):
for prediction in predictions:
# do something with these
我们需要做的,就是像填空一样,填这些函数。
API页面[3]
返回值:Any of.
返回值无论如何也需要有一个loss量。如果是字典,要有这个key。没loss这个batch就被跳过了。例:
def training_step(self, batch, batch_idx):
x, y, z = batch
out = self.encoder(x)
loss = self.loss(out, x)
return loss
# Multiple optimizers (e.g.: GANs)
def training_step(self, batch, batch_idx, optimizer_idx):
if optimizer_idx == 0:
# do training_step with encoder
if optimizer_idx == 1:
# do training_step with decoder
# Truncated back-propagation through time
def training_step(self, batch, batch_idx, hiddens):
# hiddens are the hidden states from the previous truncated backprop step
...
out, hiddens = self.lstm(data, hiddens)
...
return {'loss': loss, 'hiddens': hiddens}
configure_optimizers: 优化器定义,返回一个优化器,或数个优化器,或两个List(优化器,Scheduler)。如:
# most cases
def configure_optimizers(self):
opt = Adam(self.parameters(), lr=1e-3)
return opt
# multiple optimizer case (e.g.: GAN)
def configure_optimizers(self):
generator_opt = Adam(self.model_gen.parameters(), lr=0.01)
disriminator_opt = Adam(self.model_disc.parameters(), lr=0.02)
return generator_opt, disriminator_opt
# example with learning rate schedulers
def configure_optimizers(self):
generator_opt = Adam(self.model_gen.parameters(), lr=0.01)
disriminator_opt = Adam(self.model_disc.parameters(), lr=0.02)
discriminator_sched = CosineAnnealing(discriminator_opt, T_max=10)
return [generator_opt, disriminator_opt], [discriminator_sched]
# example with step-based learning rate schedulers
def configure_optimizers(self):
gen_opt = Adam(self.model_gen.parameters(), lr=0.01)
dis_opt = Adam(self.model_disc.parameters(), lr=0.02)
gen_sched = {'scheduler': ExponentialLR(gen_opt, 0.99),
'interval': 'step'} # called after each training step
dis_sched = CosineAnnealing(discriminator_opt, T_max=10) # called every epoch
return [gen_opt, dis_opt], [gen_sched, dis_sched]
# example with optimizer frequencies
# see training procedure in `Improved Training of Wasserstein GANs`, Algorithm 1
# https://arxiv.org/abs/1704.00028
def configure_optimizers(self):
gen_opt = Adam(self.model_gen.parameters(), lr=0.01)
dis_opt = Adam(self.model_disc.parameters(), lr=0.02)
n_critic = 5
return (
{'optimizer': dis_opt, 'frequency': n_critic},
{'optimizer': gen_opt, 'frequency': 1}
)
参数 name (str) – key name value (Any) – value name prog_bar (bool) – if True logs to the progress bar logger (bool) – if True logs to the logger on_step (Optional[bool]) – if True logs at this step. None auto-logs at the training_step but not validation/test_step on_epoch (Optional[bool]) – if True logs epoch accumulated metrics. None auto-logs at the val/test step but not training_step reduce_fx (Callable) – reduction function over step values for end of epoch. Torch.mean by default tbptt_reduce_fx (Callable) – function to reduce on truncated back prop tbptt_pad_token (int) – token to use for padding enable_graph (bool) – if True, will not auto detach the graph sync_dist (bool) – if True, reduces the metric across GPUs/TPUs sync_dist_op (Union[Any, str]) – the op to sync across GPUs/TPUs sync_dist_group (Optional[Any]) – the ddp group
class LitModel(pl.LightningModule):
def __init__(...):
def forward(...):
def training_step(...)
def training_step_end(...)
def training_epoch_end(...)
def validation_step(...)
def validation_step_end(...)
def validation_epoch_end(...)
def test_step(...)
def test_step_end(...)
def test_epoch_end(...)
def configure_optimizers(...)
def any_extra_hook(...)
Trainer
model = MyLightningModule()
trainer = Trainer()
trainer.fit(model, train_dataloader, val_dataloader)
如果连validation_step都没有,那val_dataloader也就算了。
Hooks页面[4]
def fit(...):
on_fit_start()
if global_rank == 0:
# prepare data is called on GLOBAL_ZERO only
prepare_data()
for gpu/tpu in gpu/tpus:
train_on_device(model.copy())
on_fit_end()
def train_on_device(model):
# setup is called PER DEVICE
setup()
configure_optimizers()
on_pretrain_routine_start()
for epoch in epochs:
train_loop()
teardown()
def train_loop():
on_train_epoch_start()
train_outs = []
for train_batch in train_dataloader():
on_train_batch_start()
# ----- train_step methods -------
out = training_step(batch)
train_outs.append(out)
loss = out.loss
backward()
on_after_backward()
optimizer_step()
on_before_zero_grad()
optimizer_zero_grad()
on_train_batch_end(out)
if should_check_val:
val_loop()
# end training epoch
logs = training_epoch_end(outs)
def val_loop():
model.eval()
torch.set_grad_enabled(False)
on_validation_epoch_start()
val_outs = []
for val_batch in val_dataloader():
on_validation_batch_start()
# -------- val step methods -------
out = validation_step(val_batch)
val_outs.append(out)
on_validation_batch_end(out)
validation_epoch_end(val_outs)
on_validation_epoch_end()
# set up for train
model.train()
torch.set_grad_enabled(True)
参数介绍(附视频)[5]
类定义与默认参数[6]
# default used by the Trainer (no scaling of batch size)
trainer = Trainer(auto_scale_batch_size=None)
# run batch size scaling, result overrides hparams.batch_size
trainer = Trainer(auto_scale_batch_size='binsearch')
# call tune to find the batch size
trainer.tune(model)
# run learning rate finder, results override hparams.learning_rate
trainer = Trainer(auto_lr_find=True)
# run learning rate finder, results override hparams.my_lr_arg
trainer = Trainer(auto_lr_find='my_lr_arg')
# call tune to find the lr
trainer.tune(model)
# default used by the Trainer
trainer = Trainer(precision=32)
# 16-bit precision
trainer = Trainer(precision=16, gpus=1)
use (float) to check within a training epoch:此时这个值为一个epoch的百分比。每百分之多少测试一次。 use (int) to check every n steps (batches):每多少个batch测试一次。
# default used by the Trainer
trainer = Trainer(val_check_interval=1.0)
# check validation set 4 times during a training epoch
trainer = Trainer(val_check_interval=0.25)
# check validation set every 1000 training batches
# use this when using iterableDataset and your dataset has no length
# (ie: production cases with streaming data)
trainer = Trainer(val_check_interval=1000)
# default used by the Trainer (ie: train on CPU)
trainer = Trainer(gpus=None)
# equivalent
trainer = Trainer(gpus=0)
# int: train on 2 gpus
trainer = Trainer(gpus=2)
# list: train on GPUs 1, 4 (by bus ordering)
trainer = Trainer(gpus=[1, 4])
trainer = Trainer(gpus='1, 4') # equivalent
# -1: train on all gpus
trainer = Trainer(gpus=-1)
trainer = Trainer(gpus='-1') # equivalent
# combine with num_nodes to train on multiple GPUs across nodes
# uses 8 gpus in total
trainer = Trainer(gpus=2, num_nodes=4)
# train only on GPUs 1 and 4 across nodes
trainer = Trainer(gpus=[1, 4], num_nodes=4)
# default used by the Trainer
trainer = Trainer(limit_train_batches=1.0)
# run through only 25% of the training set each epoch
trainer = Trainer(limit_train_batches=0.25)
# run through only 10 batches of the training set each epoch
trainer = Trainer(limit_train_batches=10)
Setting this argument will disable tuner, checkpoint callbacks, early stopping callbacks, loggers and logger callbacks like LearningRateLogger and runs for only 1 epoch
# default used by the Trainer
trainer = Trainer(fast_dev_run=False)
# runs 1 train, val, test batch and program ends
trainer = Trainer(fast_dev_run=True)
# runs 7 train, val, test batches and program ends
trainer = Trainer(fast_dev_run=7)
Trainer.fit(model, train_dataloader=None, val_dataloaders=None, datamodule=None):输入第一个量一定是model,然后可以跟一个LigntningDataModule或一个普通的Train DataLoader。如果定义了Val step,也要有Val DataLoader。
参数 datamodule (Optional[LightningDataModule]) – A instance of LightningDataModule. model (LightningModule) – Model to fit. train_dataloader (Optional[DataLoader]) – A Pytorch DataLoader with training samples. If the model has a predefined train_dataloader method this will be skipped. val_dataloaders (Union[DataLoader, List[DataLoader], None]) – Either a single Pytorch Dataloader or a list of them, specifying validation samples. If the model has a predefined val_dataloaders method this will be skipped
1. 手动添加命令行参数:
from argparse import ArgumentParser
def main(hparams):
model = LightningModule()
trainer = Trainer(gpus=hparams.gpus)
trainer.fit(model)
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--gpus', default=None)
args = parser.parse_args()
main(args)
2. 自动添加所有Trainer会用到的命令行参数:
from argparse import ArgumentParser
def main(args):
model = LightningModule()
trainer = Trainer.from_argparse_args(args)
trainer.fit(model)
if __name__ == '__main__':
parser = ArgumentParser()
parser = Trainer.add_argparse_args(
# group the Trainer arguments together
parser.add_argument_group(title="pl.Trainer args")
)
args = parser.parse_args()
main(args)
3. 混合式,既使用Trainer相关参数,又使用一些自定义参数,如各种模型超参:
from argparse import ArgumentParser
import pytorch_lightning as pl
from pytorch_lightning import LightningModule, Trainer
def main(args):
model = LightningModule()
trainer = Trainer.from_argparse_args(args)
trainer.fit(model)
if __name__ == '__main__':
parser = ArgumentParser()
parser.add_argument('--batch_size', default=32, type=int)
parser.add_argument('--hidden_dim', type=int, default=128)
parser = Trainer.add_argparse_args(
# group the Trainer arguments together
parser.add_argument_group(title="pl.Trainer args")
)
args = parser.parse_args()
main(args)
Trainer.``__init__(logger=True, checkpoint_callback=True, callbacks=None, default_root_dir=None, gradient_clip_val=0, process_position=0, num_nodes=1, num_processes=1, gpus=None, auto_select_gpus=False, tpu_cores=None, log_gpu_memory=None, progress_bar_refresh_rate=None, overfit_batches=0.0, track_grad_norm=- 1, check_val_every_n_epoch=1, fast_dev_run=False, accumulate_grad_batches=1, max_epochs=None, min_epochs=None, max_steps=None, min_steps=None, limit_train_batches=1.0, limit_val_batches=1.0, limit_test_batches=1.0, limit_predict_batches=1.0, val_check_interval=1.0, flush_logs_every_n_steps=100, log_every_n_steps=50, accelerator=None, sync_batchnorm=False, precision=32, weights_summary='top', weights_save_path=None, num_sanity_val_steps=2, truncated_bptt_steps=None, resume_from_checkpoint=None, profiler=None, benchmark=False, deterministic=False, reload_dataloaders_every_epoch=False, auto_lr_find=False, replace_sampler_ddp=True, terminate_on_nan=False, auto_scale_batch_size=False, prepare_data_per_node=True, plugins=None, amp_backend='native', amp_level='O2', distributed_backend=None, move_metrics_to_cpu=False, multiple_trainloader_mode='max_size_cycle', stochastic_weight_avg=False)
To add a training loop use the training_step method
class LitClassifier(pl.LightningModule):
def __init__(self, model):
super().__init__()
self.model = model
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
return loss
Under the hood, Lightning does the following (pseudocode):
# put model in train mode
model.train()
torch.set_grad_enabled(True)
losses = []
for batch in train_dataloader:
# forward
loss = training_step(batch)
losses.append(loss.detach())
# backward
loss.backward()
# apply and clear grads
optimizer.step()
optimizer.zero_grad()
If you want to calculate epoch-level metrics and log them, use the .log method
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
# logs metrics for each training_step,
# and the average across the epoch, to the progress bar and logger
self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
return loss
The .log object automatically reduces the requested metrics across the full epoch. Here’s the pseudocode of what it does under the hood:
outs = []
for batch in train_dataloader:
# forward
out = training_step(val_batch)
# backward
loss.backward()
# apply and clear grads
optimizer.step()
optimizer.zero_grad()
epoch_metric = torch.mean(torch.stack([x['train_loss'] for x in outs]))
If you need to do something with all the outputs of each training_step, override training_epoch_end yourself.
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
preds = ...
return {'loss': loss, 'other_stuff': preds}
def training_epoch_end(self, training_step_outputs):
for pred in training_step_outputs:
# do something
The matching pseudocode is:
outs = []
for batch in train_dataloader:
# forward
out = training_step(val_batch)
# backward
loss.backward()
# apply and clear grads
optimizer.step()
optimizer.zero_grad()
training_epoch_end(outs)
DataModule
主页面[7]
class MNISTDataModule(pl.LightningDataModule):
def __init__(self, data_dir: str = './', batch_size: int = 64, num_workers: int = 8):
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.num_workers = num_workers
self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# self.dims is returned when you call dm.size()
# Setting default dims here because we know them.
# Could optionally be assigned dynamically in dm.setup()
self.dims = (1, 28, 28)
self.num_classes = 10
def prepare_data(self):
# download
MNIST(self.data_dir, train=True, download=True)
MNIST(self.data_dir, train=False, download=True)
def setup(self, stage=None):
# Assign train/val datasets for use in dataloaders
if stage == 'fit' or stage is None:
mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
# Assign test dataset for use in dataloader(s)
if stage == 'test' or stage is None:
self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)
def train_dataloader(self):
return DataLoader(self.mnist_train, batch_size=self.batch_size, num_workers=self.num_workers)
def val_dataloader(self):
return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=self.num_workers)
def test_dataloader(self):
return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=self.num_workers)
要点
Saving and Loading
主页面[8]
from pytorch_lightning.callbacks import ModelCheckpoint
# saves a file like: my/path/sample-mnist-epoch=02-val_loss=0.32.ckpt
checkpoint_callback = ModelCheckpoint(
monitor='val_loss',
filename='sample-mnist-{epoch:02d}-{val_loss:.2f}',
save_top_k=3,
mode='min',
save_last=True
)
trainer = pl.Trainer(gpus=1, max_epochs=3, progress_bar_refresh_rate=20, callbacks=[checkpoint_callback])
model = MyLightingModule.load_from_checkpoint(PATH)
print(model.learning_rate)
# prints the learning_rate you used in this checkpoint
model.eval()
y_hat = model(x)
load模型时替换一些超参数:
class LitModel(LightningModule):
def __init__(self, in_dim, out_dim):
super().__init__()
self.save_hyperparameters()
self.l1 = nn.Linear(self.hparams.in_dim, self.hparams.out_dim)
# if you train and save the model like this it will use these values when loading
# the weights. But you can overwrite this
LitModel(in_dim=32, out_dim=10)
# uses in_dim=32, out_dim=10
model = LitModel.load_from_checkpoint(PATH)
# uses in_dim=128, out_dim=10
model = LitModel.load_from_checkpoint(PATH, in_dim=128, out_dim=10)
model = LitModel()
trainer = Trainer(resume_from_checkpoint='some/path/to/my_checkpoint.ckpt')
# automatically restores model, epoch, step, LR schedulers, apex, etc...
trainer.fit(model)
Callbacks
内建Callbacks[9]
参数: monitor (str) – quantity to be monitored. Default: 'early_stop_on'. min_delta (float) – minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement. Default: 0.0. patience (int) – number of validation epochs with no improvement after which training will be stopped. Default: 3. verbose (bool) – verbosity mode. Default: False. mode (str) – one of 'min', 'max'. In 'min' mode, training will stop when the quantity monitored has stopped decreasing and in 'max' mode it will stop when the quantity monitored has stopped increasing. strict (bool) – whether to crash the training if monitor is not found in the validation metrics. Default: True.
示例:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import EarlyStopping
early_stopping = EarlyStopping('val_loss')
trainer = Trainer(callbacks=[early_stopping])
from pl_bolts.callbacks import PrintTableMetricsCallback
callback = PrintTableMetricsCallback()
trainer = pl.Trainer(callbacks=[callback])
trainer.fit(...)
# ------------------------------
# at the end of every epoch it will print
# ------------------------------
# loss│train_loss│val_loss│epoch
# ──────────────────────────────
# 2.2541470527648926│2.2541470527648926│2.2158432006835938│0
Logging
from pytorch_lightning import loggers as pl_loggers
# Default
tb_logger = pl_loggers.TensorBoardLogger(
save_dir=os.getcwd(),
version=None,
name='lightning_logs'
)
trainer = Trainer(logger=tb_logger)
# Or use the same format as others
tb_logger = pl_loggers.TensorBoardLogger('logs/')
# One Logger
comet_logger = pl_loggers.CometLogger(save_dir='logs/')
trainer = Trainer(logger=comet_logger)
# Save code snapshot
logger = pl_loggers.TestTubeLogger('logs/', create_git_tag=True)
# Multiple Logger
tb_logger = pl_loggers.TensorBoardLogger('logs/')
comet_logger = pl_loggers.CometLogger(save_dir='logs/')
trainer = Trainer(logger=[tb_logger, comet_logger])
默认情况下,每50个batch log一次,可以通过调整参数
def training_step(...):
...
# the logger you used (in this case tensorboard)
tensorboard = self.logger.experiment
tensorboard.add_image()
tensorboard.add_histogram(...)
tensorboard.add_figure(...)
使用log:如果是TensorBoard,那么:tensorboard --logdir ./lightning_logs。在Jupyter Notebook中,可以使用:
# Start tensorboard.
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
在行内打开TensorBoard。
tensorboard --logdir lightning_logs --bind_all -> http://SERVER-NAME:6006/
Transfer Learning
主页面[10]
import torchvision.models as models
class ImagenetTransferLearning(LightningModule):
def __init__(self):
super().__init__()
# init a pretrained resnet
backbone = models.resnet50(pretrained=True)
num_filters = backbone.fc.in_features
layers = list(backbone.children())[:-1]
self.feature_extractor = nn.Sequential(*layers)
# use the pretrained model to classify cifar-10 (10 image classes)
num_target_classes = 10
self.classifier = nn.Linear(num_filters, num_target_classes)
def forward(self, x):
self.feature_extractor.eval()
with torch.no_grad():
representations = self.feature_extractor(x).flatten(1)
x = self.classifier(representations)
...
关于device操作
LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
# bad
t = torch.rand(2, 2).cuda()
# good (self is LightningModule)
t = torch.rand(2, 2, device=self.device)
For tensors that need to be model attributes, it is best practice to register them as buffers in the modules’ __init__ method:
# bad
self.t = torch.rand(2, 2, device=self.device)
# good
self.register_buffer("t", torch.rand(2, 2))
前面两段是教程中的文本。然而实际上有一个暗坑:
如果你使用了一个中继的pl.LightningModule,而这个module里面实例化了某个普通的nn.Module,而这个模型中又需要内部生成一些tensor,比如图片每个通道的mean,std之类,那么如果你从pl.LightningModule中pass一个self.device,实际上在一开始这个self.device永远是cpu。所以如果你在调用的nn.Module的__init__()中初始化,使用to(device)或干脆什么都不用,结果就是它永远都在cpu上。
但是,经过实验,虽然pl.LightningModule在__init__()阶段self.device还是cpu,当进入了training_step()之后,就迅速变为了cuda。所以,对于子模块,最佳方案是,使用一个forward中传入的量,如x,作为一个reference变量,用type_as函数将在模型中生成的tensor都放到和这个参考变量相同的device上即可。
class RDNFuse(nn.Module):
...
def init_norm_func(self, ref):
self.mean = torch.tensor(np.array(self.mean_sen), dtype=torch.float32).type_as(ref)
def forward(self, x):
if not hasattr(self, 'mean'):
self.init_norm_func(x)
Points
# Single optimizer
for epoch in epochs:
for batch in data:
loss = model.training_step(batch, batch_idx, ...)
loss.backward()
optimizer.step()
optimizer.zero_grad()
for scheduler in schedulers:
scheduler.step()
# Multiple optimizers
for epoch in epochs:
for batch in data:
for opt in optimizers:
disable_grads_for_other_optimizers()
train_step(opt)
opt.step()
for scheduler in schedulers:
scheduler.step()
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
Parameters: dataset (Dataset)[12] – Dataset to be split lengths (sequence) – lengths of splits to be produced generator (Generator)[13] – Generator used for the random permutation.
链接:
[1] https://github.com/miracleyoo/pytorch-lightning-template
[2] https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html
[3] https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#lightningmodule-api
[4] https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#hooks
[5] https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-flags
[6] https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#trainer-class-api
[7] https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html
[8] https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html
[9] https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html#built-in-callbacks
[10] https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction_guide.html#transfer-learning
[11] https://pytorch-lightning.readthedocs.io/en/latest/common/optimizers.html?highlight=scheduler#
[12] https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
[13] https://pytorch.org/docs/stable/generated/torch.Generator.html#torch.Generator