前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >SlowFast介绍

SlowFast介绍

作者头像
算法之名
发布2022-05-06 10:31:43
2.2K0
发布2022-05-06 10:31:43
举报
文章被收录于专栏:算法之名算法之名

SlowFast是何凯明大神于Facebook发表于ICCV2019的关于人体行为识别的双流模型框架。

上图是SlowFast模型的主要结构,上面的部分为空间分支,它是一个低帧率(Low frame rate)的分支,我们希望它主要捕捉一些RGB的图像特征。它具有较少的帧数,较大的通道数。下面的部分为时间分支,它是一个高帧率(High frame rate)的分支,我们希望它捕捉一些动作的特征,因为动作变化比较快,所以它有较多的帧数,较小的通道数。

在空间分支的右边的C表示通道数,T表示时间维度。我们可以看到它的通道数都比较大,时间维度比较小;在时间分支的右边的βC表示通道数,αT表示时间维度。我们可以看到它的通道数都比较小,时间维度比较大。这里的α=8、β=1/8、τ=16(采样步长)

这两个分支的输入维度大小都是3*64*224*224(3是通道数,64为帧数,224代表宽高),区别在于时间步长不同。

在上表中,我们可以看到它的原始切片为64*224^2。对于slow分支,它的步长为16,所以采样帧数为64/16=4;而fast分支的步长为2,所以采样帧数为64/2=32。经过了帧采样进入了卷积层,对于slow分支来说,从conv1到res3,我们看到它的时间卷积核大小都是1,说明都没有在时间维度进行卷积。作者发现在浅层去提取时间特征会有害。直到res4和res5的时间卷积核大小为3,才对时间特征进行提取。而在fast分支,从conv1开始就有时间卷积核的大小,为5。后面的res2到res5都一直在提取时间特征。到最后,slow分支的维度为4*7*7,fast分支的维度为32*7*7。我们发现它们的帧数都没有发生变化,所以在时间维度上是没有进行下采样的。

  • 3种信息融合策略

在网络结构图中,我们可以看到会有时间分支的箭头指向空间分支中,它的融合方式有3种

  1. 从时间到通道融合,具体是将时间分支的特征图(αT时间维度,S^2空间维度,βC通道数)进行转置为(T,S^2,αβC),这样就跟空间分支的特征图一样了,然后进行拼接。
  2. 基于时间采样的融合,因为时间分支的特征图的时间维度为αT,表示每α帧进行一次采样,这样我们将时间步长采样到跟空间分支一样,这样(αT,S^2,βC)就变成了(T,S^2,βC),这样就可以进行拼接。
  3. 基于时间的3D卷积,可以基于时间步长的采样,同样可以基于时间步长的卷积,使用5*1*1的3D卷积核可以得到一个2βC的输出通道,步长为α。

总体思想就是要将时间分支的特征图变成跟空间分支的特征图的时间维度和空间维度大小一样就可以了,这样就可以在通道维度上进行拼接。

数据处理

我们要使用的数据集为UCF101行为识别数据集,共101个类别,13320个短视频,发布于2012年。

下载地址:https://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar

官方数据划分下载地址:https://www.crcv.ucf.edu/wp-content/uploads/2019/03/UCF101TrainTestSplits-RecognitionTask.zip

在自然环境下的101种人类动作,包含5大类动作:人与物体互动、人体动作、人与人互动、乐器演奏、体育运动。每一类由25个人做动作,每个人做4-7组,视频大小为320*240。

具体类别:涂抹眼妆,涂抹口红,射箭,婴儿爬行,平衡木,乐队游行,棒球场,篮球投篮,篮球扣篮,卧推,骑自行车,台球射击,吹干头发,吹蜡烛,体重蹲,保龄球,拳击沙袋,拳击速度袋,蛙泳,刷牙,清洁和挺举,悬崖跳水,板球保龄球,板球射击,在厨房切割,潜水,打鼓,击剑,曲棍球罚款,地板体操,飞盘接球,前爬网,高尔夫挥杆,理发,链球掷,锤击,倒立俯卧撑,倒立行走,头部按摩,跳高,跑马,骑马,呼啦圈,冰舞,标枪掷,杂耍球,跳绳,跳跃杰克,皮划艇,针织,跳远,刺,阅兵,混合击球手,拖地板,修女夹头,双杠,披萨折腾,弹吉他,弹钢琴,弹塔布拉琴,弹小提琴,弹大提琴,弹Daf,弹Dhol,弹长笛,弹奏锡塔琴,撑竿跳高,鞍马,引体向上,拳打,俯卧撑,漂流,室内攀岩,爬绳,划船,莎莎旋转,剃胡子,铅球,滑板溜冰,滑雪,Skijet,跳伞,足球杂耍,足球罚球,静环,相扑摔跤,冲浪,秋千,乒乓球拍,太极拳,网球秋千,投掷铁饼,蹦床跳跃,打字,高低杠,排球突刺,与狗同行,墙上俯卧撑,在船上写字,溜溜球。剃胡须,铅球,滑冰登机,滑雪,Skijet,跳伞,足球杂耍,足球罚款,静物环,相扑,冲浪,秋千,乒乓球射击,太极拳,网球秋千,掷铁饼,蹦床跳跃,打字,不均匀酒吧,排球突刺,与狗同行,壁式俯卧撑,船上写字,溜溜球。剃胡须,铅球,滑冰登机,滑雪,Skijet,跳伞,足球杂耍,足球罚款,静物环,相扑,冲浪,秋千,乒乓球射击,太极拳,网球秋千,掷铁饼,蹦床跳跃,打字,不均匀酒吧,排球突刺,与狗同行,壁式俯卧撑,船上写字,溜溜球。

数据处理主要包含3部分的工作:

  1. 视频抽帧,获得用于训练的RGB图片
  2. 训练集与验证集的划分
  3. 数据预处理(数据增强)与读取

我们先将下载好的UCF101数据集按照第一方案划分成训练集和验证集

代码语言:javascript
复制
import shutil
import os

if __name__ == '__main__':

    train_txtlist = ['trainlist01.txt']
    valid_txtlist = ['testlist01.txt']
    dataset_dir = '/Users/admin/Downloads/UCF-101/'
    list_dir = '/Users/admin/Downloads/ucfTrainTestlist/'
    copy_train_path = '/Users/admin/Downloads/UCF-101/train/'
    copy_valid_path = '/Users/admin/Downloads/UCF-101/validation/'

    for txtfile in train_txtlist:
        with open(list_dir + txtfile, 'r') as f:
            for line in f:
                o_filename = dataset_dir + line.strip().split(' ')[0]
                n_filename = copy_train_path + line.strip().split(' ')[0]
                if not os.path.exists('/'.join(n_filename.split('/')[:-1])):
                    os.makedirs('/'.join(n_filename.split('/')[:-1]))
                shutil.move(o_filename, n_filename)

    for txtfile in valid_txtlist:
        with open(list_dir + txtfile, 'r') as f:
            for line in f:
                o_filename = dataset_dir + line.strip().split(' ')[0]
                n_filename = copy_valid_path + line.strip().split(' ')[0]
                if not os.path.exists('/'.join(n_filename.split('/')[:-1])):
                    os.makedirs('/'.join(n_filename.split('/')[:-1]))
                shutil.move(o_filename, n_filename)

    folders = os.listdir(dataset_dir)[1:]
    for folder in folders:
        folder_sub = os.listdir(dataset_dir + folder)
        if folder_sub == []:
            os.rmdir(dataset_dir + folder)

运行结果

  • 构建Pytorch数据集
代码语言:javascript
复制
import os
from pathlib import Path
import cv2
import numpy as np
from torch.utils.data import DataLoader, Dataset


class VideoDataset(Dataset):

    def __init__(self, directory, mode='train', clip_len=16, frame_sample_rate=1):
        folder = Path(directory)/mode  # get the directory of the specified split]
        # 切片长度(需要采集多少帧图像)
        self.clip_len = clip_len
        # 缩放尺寸
        self.short_side = [128, 160]
        # 裁剪大小
        self.crop_size = 112
        # 帧样本率
        self.frame_sample_rate = frame_sample_rate
        self.mode = mode
        self.fnames, labels = [], []
        # 遍历所有的视频文件夹,以视频文件夹的名称作为标签
        for label in sorted(os.listdir(folder)[1:]):
            if label == ".DS_Store":
                continue
            for fname in os.listdir(os.path.join(folder, label)):
                # 获取所有的视频文件名
                self.fnames.append(os.path.join(folder, label, fname))
                # 获取所有的标签
                labels.append(label)
        # 创建字符串类型的标签名字与索引的对应
        self.label2index = {label: index for index, label in enumerate(sorted(set(labels)))}
        # 将string类型标签转为int标签
        self.label_array = np.array([self.label2index[label] for label in labels], dtype=int)
        # 创建分类标签文件
        label_file = str(len(os.listdir(folder)))+'class_labels.txt'
        # 将所有的分类标签写入该文件
        with open(label_file, 'w') as f:
            for id, label in enumerate(sorted(self.label2index)):
                f.writelines(str(id + 1) + ' ' + label + '\n')

    def __getitem__(self, index):
        # 加载视频的采样图像
        buffer = self.loadvideo(self.fnames[index])
        # 如果采样图像数量小于切片长度+2
        while buffer.shape[0] < self.clip_len + 2:
            # 随机获取一个视频的索引
            index = np.random.randint(self.__len__())
            # 加载该视频的采样图像
            buffer = self.loadvideo(self.fnames[index])
        # 预处理
        if self.mode == 'train' or self.mode == 'training':
            # 如果是训练模式,对采样图像进行图像翻转增强
            buffer = self.randomflip(buffer)
        # 对采样图像进行裁剪
        buffer = self.crop(buffer, self.clip_len, self.crop_size)
        # 归一化采样图像
        buffer = self.normalize(buffer)
        # 调整成pytorch tensor格式
        buffer = self.to_tensor(buffer)

        return buffer, self.label_array[index]

    def to_tensor(self, buffer):
        # convert from [D, H, W, C] format to [C, D, H, W] (what PyTorch uses)
        # D = Depth (in this case, time), H = Height, W = Width, C = Channels
        return buffer.transpose((3, 0, 1, 2))

    def loadvideo(self, fname):
        # 0
        remainder = np.random.randint(self.frame_sample_rate)
        # 读取一个视频
        capture = cv2.VideoCapture(fname)
        # 获取该视频的帧数
        frame_count = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
        # 获取该视频的宽
        frame_width = int(capture.get(cv2.CAP_PROP_FRAME_WIDTH))
        # 获取该视频的高
        frame_height = int(capture.get(cv2.CAP_PROP_FRAME_HEIGHT))
        # 如果视频的高小于宽
        if frame_height < frame_width:
            # resize高为128到160之间
            resize_height = np.random.randint(self.short_side[0], self.short_side[1] + 1)
            # 按照比例调整resize宽
            resize_width = int(float(resize_height) / frame_height * frame_width)
        else:  # 如果视频的高大于宽
            # resize宽为128到160之间
            resize_width = np.random.randint(self.short_side[0], self.short_side[1] + 1)
            # 按照比例调整resize高
            resize_height = int(float(resize_width) / frame_width * frame_height)

        # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later
        start_idx = 0
        end_idx = frame_count - 1
        # 帧样本数
        frame_count_sample = frame_count // self.frame_sample_rate - 1
        # 如果视频帧数大于300,取300帧之后的数据
        if frame_count > 300:
            end_idx = np.random.randint(300, frame_count)
            start_idx = end_idx - 300
            frame_count_sample = 301 // self.frame_sample_rate - 1
        # 初始化一个全0的,形状为(帧样本数, resize高, resize宽, 3)的numpy数组
        buffer = np.empty((frame_count_sample, resize_height, resize_width, 3), np.dtype('float32'))

        count = 0
        retaining = True
        sample_count = 0

        # 对视频今夕ing图像采样
        # 一次读取一帧数据放入numpy数组
        while (count <= end_idx and retaining):
            retaining, frame = capture.read()
            if count < start_idx:
                count += 1
                continue
            if retaining is False or count > end_idx:
                break
            if count % self.frame_sample_rate == remainder and sample_count < frame_count_sample:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                # resize视频帧
                if (frame_height != resize_height) or (frame_width != resize_width):
                    frame = cv2.resize(frame, (resize_width, resize_height))
                buffer[sample_count] = frame
                sample_count = sample_count + 1
            count += 1
        capture.release()
        return buffer
    
    def crop(self, buffer, clip_len, crop_size):
        # 随机获取一个时间维度索引
        time_index = np.random.randint(buffer.shape[0] - clip_len)
        # 随机获取一个高维度索引
        height_index = np.random.randint(buffer.shape[1] - crop_size)
        # 随机获取一个宽维度索引
        width_index = np.random.randint(buffer.shape[2] - crop_size)

        # 在各个维度进行裁剪
        buffer = buffer[time_index: time_index + clip_len,
                 height_index: height_index + crop_size,
                 width_index: width_index + crop_size, :]

        return buffer                

    def normalize(self, buffer):
        # 归一化
        # buffer = (buffer - 128)/128.0
        for i, frame in enumerate(buffer):
            frame = (frame - np.array([[[128.0, 128.0, 128.0]]])) / 128.0
            buffer[i] = frame
        return buffer

    def randomflip(self, buffer):
        """Horizontally flip the given image and ground truth randomly with a probability of 0.5."""
        if np.random.random() < 0.5:
            for i, frame in enumerate(buffer):
                buffer[i] = cv2.flip(frame, flipCode=1)

        return buffer

    def __len__(self):
        # 获取所有视频的数量
        return len(self.fnames)


if __name__ == '__main__':

    datapath = '/Users/admin/Downloads/UCF-101/'
    train_dataloader = DataLoader(VideoDataset(datapath, mode='train'), batch_size=10, shuffle=True, num_workers=0)
    for step, (buffer, label) in enumerate(train_dataloader):
        # 打印采集图像的尺寸
        print("buffer", buffer.size())
        # 打印采集图像的标签
        print("label: ", label)

运行结果(部分)

代码语言:javascript
复制
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([34, 14,  3, 95, 27, 99, 37, 88, 93, 75])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([10, 72, 77, 57, 88, 80, 97, 94, 74, 48])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([65, 14, 91, 40, 98,  7, 34, 42, 93, 15])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([39, 39, 37, 91, 74, 14, 18,  1, 56, 63])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([64, 32, 60, 29, 49, 49, 38, 22, 81, 30])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([33, 64, 23, 11, 64, 22,  7, 76, 10, 75])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([55, 52, 74, 84, 74, 81, 38, 46, 92, 54])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([17, 59, 95, 88, 87, 59, 49, 72,  3, 14])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([69, 57, 61, 57,  6, 41, 82, 25, 21, 75])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([97, 34, 29, 29, 60, 86, 94,  0, 24, 72])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([45, 26, 81, 71, 18, 41, 79, 61, 73, 26])
buffer torch.Size([10, 3, 16, 112, 112])
label:  tensor([ 2, 87, 23, 17, 98, 33, 20, 62, 51, 38])

并且创建了一份标签的文本文件

内容如下

代码语言:javascript
复制
1 ApplyEyeMakeup
2 ApplyLipstick
3 Archery
4 BabyCrawling
5 BandMarching
6 BaseballPitch
7 Basketball
8 BasketballDunk
9 BenchPress
10 Biking
11 Billiards
12 BlowDryHair
13 BlowingCandles
14 BodyWeightSquats
15 Bowling
16 BoxingPunchingBag
17 BoxingSpeedBag
18 BreastStroke
19 BrushingTeeth
20 CleanAndJerk
21 CliffDiving
22 CricketBowling
23 CricketShot
24 CuttingInKitchen
25 Diving
26 Drumming
27 Fencing
28 FieldHockeyPenalty
29 FloorGymnastics
30 FrisbeeCatch
31 FrontCrawl
32 GolfSwing
33 Haircut
34 HammerThrow
35 Hammering
36 HandstandPushups
37 HandstandWalking
38 HeadMassage
39 HighJump
40 HorseRace
41 HorseRiding
42 HulaHoop
43 IceDancing
44 JavelinThrow
45 JugglingBalls
46 JumpRope
47 JumpingJack
48 Kayaking
49 Knitting
50 LongJump
51 Lunges
52 MilitaryParade
53 Mixing
54 MoppingFloor
55 Nunchucks
56 ParallelBars
57 PizzaTossing
58 PlayingCello
59 PlayingDaf
60 PlayingDhol
61 PlayingFlute
62 PlayingGuitar
63 PlayingPiano
64 PlayingSitar
65 PlayingTabla
66 PlayingViolin
67 PoleVault
68 PommelHorse
69 PullUps
70 Punch
71 PushUps
72 Rafting
73 RockClimbingIndoor
74 RopeClimbing
75 Rowing
76 SalsaSpin
77 ShavingBeard
78 Shotput
79 SkateBoarding
80 Skiing
81 Skijet
82 SkyDiving
83 SoccerJuggling
84 SoccerPenalty
85 StillRings
86 SumoWrestling
87 Surfing
88 Swing
89 TableTennisShot
90 TaiChi
91 TennisSwing
92 ThrowDiscus
93 TrampolineJumping
94 Typing
95 UnevenBars
96 VolleyballSpiking
97 WalkingWithDog
98 WallPushups
99 WritingOnBoard
100 YoYo

构建SlowFast网络

在SlowFast的结构表中有res2、res3、res4、res5,所以我们先定义它的ResBlock。

代码语言:javascript
复制
import torch
import torch.nn as nn
from torch.autograd import Variable
# 使用多少层的3DResNet
__all__ = ['resnet50', 'resnet101', 'resnet152', 'resnet200']


class Bottleneck(nn.Module):
    # ResBlock
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, head_conv=1):
        super(Bottleneck, self).__init__()
        # 如果是空间分支(slow分支)的第一层卷积
        if head_conv == 1:
            # 没有时间卷积核大小
            self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=1, bias=False)
            self.bn1 = nn.BatchNorm3d(planes)
        elif head_conv == 3:  # 如果是时间分支(fast分支)的第一层卷积
            # 有时间卷积核大小为3
            self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=(3, 1, 1), bias=False, padding=(1, 0, 0))
            self.bn1 = nn.BatchNorm3d(planes)
        else:
            raise ValueError("Unsupported head_conv!")
        # 两个分支的第二层卷积相同
        self.conv2 = nn.Conv3d(planes, planes, kernel_size=(1, 3, 3), stride=(1, stride, stride),
                               padding=(0, 1, 1), bias=False)
        self.bn2 = nn.BatchNorm3d(planes)
        # 两个分支的第三层卷积相同
        self.conv3 = nn.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm3d(planes * 4)
        self.relu = nn.ReLU(inplace=True)
        # 是否降采样
        self.downsample = downsample
        # 步长
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)

        return out

接下来是搭建SlowFast的主网络

代码语言:javascript
复制
class SlowFast(nn.Module):

    def __init__(self, block=Bottleneck, layers=[3, 4, 6, 3], class_num=10, dropout=0.5):
        super(SlowFast, self).__init__()
        # 时间分支
        self.fast_inplanes = 8
        # fast分支第一个卷积,对空间降采样
        self.fast_conv1 = nn.Conv3d(3, 8, kernel_size=(5, 7, 7), stride=(1, 2, 2), padding=(2, 3, 3), bias=False)
        self.fast_bn1 = nn.BatchNorm3d(8)
        self.fast_relu = nn.ReLU(inplace=True)
        # 池化空间降采样
        self.fast_maxpool = nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1))
        # 堆叠res2、res3、res4、res5,每一次堆叠只降采样一次
        self.fast_res2 = self._make_layer_fast(block, 8, layers[0], head_conv=3)
        self.fast_res3 = self._make_layer_fast(block, 16, layers[1], stride=2, head_conv=3)
        self.fast_res4 = self._make_layer_fast(block, 32, layers[2], stride=2, head_conv=3)
        self.fast_res5 = self._make_layer_fast(block, 64, layers[3], stride=2, head_conv=3)
        # 这里是采用第三种融合策略,将fast分支的特征图融合到slow分支中的卷积
        self.lateral_p1 = nn.Conv3d(8, 8 * 2, kernel_size=(5, 1, 1), stride=(8, 1, 1), bias=False, padding=(2, 0, 0))
        self.lateral_res2 = nn.Conv3d(32, 32 * 2, kernel_size=(5, 1, 1), stride=(8, 1, 1), bias=False, padding=(2, 0, 0))
        self.lateral_res3 = nn.Conv3d(64, 64 * 2, kernel_size=(5, 1, 1), stride=(8, 1, 1), bias=False, padding=(2, 0, 0))
        self.lateral_res4 = nn.Conv3d(128, 128 * 2, kernel_size=(5, 1, 1), stride=(8, 1, 1), bias=False, padding=(2, 0, 0))
        # 空间分支
        self.slow_inplanes = 64 + 64 // 8 * 2
        # slow分支第一个卷积,对空间降采样
        self.slow_conv1 = nn.Conv3d(3, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), bias=False)
        self.slow_bn1 = nn.BatchNorm3d(64)
        self.slow_relu = nn.ReLU(inplace=True)
        # 池化空间降采样
        self.slow_maxpool = nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1))
        # 堆叠res2、res3、res4、res5,每一次堆叠只降采样一次
        self.slow_res2 = self._make_layer_slow(block, 64, layers[0], head_conv=1)
        self.slow_res3 = self._make_layer_slow(block, 128, layers[1], stride=2, head_conv=1)
        self.slow_res4 = self._make_layer_slow(block, 256, layers[2], stride=2, head_conv=3)
        self.slow_res5 = self._make_layer_slow(block, 512, layers[3], stride=2, head_conv=3)
        self.dp = nn.Dropout(dropout)
        # 产生最终的101维的向量的全连接层
        self.fc = nn.Linear(self.fast_inplanes + 2048, class_num, bias=False)

    def forward(self, input):
        # 将采集的图像以时间步长为2的方式送入fast分支
        fast, lateral = self.FastPath(input[:, :, ::2, :, :])
        # 将采集的图像以时间步长为16的方式送入slow分支,并融合fast分支的各层特征图
        slow = self.SlowPath(input[:, :, ::16, :, :], lateral)
        x = torch.cat([slow, fast], dim=1)
        x = self.dp(x)
        x = self.fc(x)
        return x

    def SlowPath(self, input, lateral):
        x = self.slow_conv1(input)
        x = self.slow_bn1(x)
        x = self.slow_relu(x)
        x = self.slow_maxpool(x)
        # 融合来自fast分支的特征图
        x = torch.cat([x, lateral[0]], dim=1)
        x = self.slow_res2(x)
        x = torch.cat([x, lateral[1]], dim=1)
        x = self.slow_res3(x)
        x = torch.cat([x, lateral[2]], dim=1)
        x = self.slow_res4(x)
        x = torch.cat([x, lateral[3]], dim=1)
        x = self.slow_res5(x)
        x = nn.AdaptiveAvgPool3d(1)(x)
        x = x.view(-1, x.size(1))
        return x

    def FastPath(self, input):
        lateral = []
        x = self.fast_conv1(input)
        x = self.fast_bn1(x)
        x = self.fast_relu(x)
        pool1 = self.fast_maxpool(x)
        # 产生与slow分支特征图相同的时间、空间大小,通道数不同的特征图
        lateral_p = self.lateral_p1(pool1)
        lateral.append(lateral_p)

        res2 = self.fast_res2(pool1)
        lateral_res2 = self.lateral_res2(res2)
        lateral.append(lateral_res2)
        
        res3 = self.fast_res3(res2)
        lateral_res3 = self.lateral_res3(res3)
        lateral.append(lateral_res3)

        res4 = self.fast_res4(res3)
        lateral_res4 = self.lateral_res4(res4)
        lateral.append(lateral_res4)

        res5 = self.fast_res5(res4)
        x = nn.AdaptiveAvgPool3d(1)(res5)
        x = x.view(-1, x.size(1))

        return x, lateral

    def _make_layer_fast(self, block, planes, blocks, stride=1, head_conv=1):
        downsample = None
        # 如果需要空间降采样,对进入fast分支网络层的特征进行直接降采样
        if stride != 1 or self.fast_inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv3d(self.fast_inplanes, planes * block.expansion, kernel_size=1,
                          stride=(1, stride, stride), bias=False),
                nn.BatchNorm3d(planes * block.expansion))

        layers = []
        # 添加ResBlock,带降采样
        layers.append(block(self.fast_inplanes, planes, stride, downsample, head_conv=head_conv))
        self.fast_inplanes = planes * block.expansion
        # 堆叠ResBlock,不带降采样
        for i in range(1, blocks):
            layers.append(block(self.fast_inplanes, planes, head_conv=head_conv))
        return nn.Sequential(*layers)

    def _make_layer_slow(self, block, planes, blocks, stride=1, head_conv=1):
        downsample = None
        # 如果需要空间降采样,对进入slow分支网络层的特征进行直接降采样
        if stride != 1 or self.slow_inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv3d(self.slow_inplanes, planes * block.expansion, kernel_size=1,
                          stride=(1,stride,stride), bias=False), 
                nn.BatchNorm3d(planes * block.expansion))

        layers = []
        # 添加ResBlock,带降采样
        layers.append(block(self.slow_inplanes, planes, stride, downsample, head_conv=head_conv))
        self.slow_inplanes = planes * block.expansion
        # 堆叠ResBlock,不带降采样
        for i in range(1, blocks):
            layers.append(block(self.slow_inplanes, planes, head_conv=head_conv))
  
        self.slow_inplanes = planes * block.expansion + planes * block.expansion // 8 * 2
        return nn.Sequential(*layers)

使用不同层数的3DResNet来构建主网络

代码语言:javascript
复制
def resnet50(**kwargs):
    """Constructs a ResNet-50 model.
    """
    # 每一次堆叠有3个ResBlock,所以总层数为3*3+4*3+6*3+3*3=48,加上网络自身的第一层卷积+池化=50
    model = SlowFast(Bottleneck, [3, 4, 6, 3], **kwargs)
    return model


def resnet101(**kwargs):
    """Constructs a ResNet-101 model.
    """
    # 3 * 3 + 4 * 3 + 23 * 3 + 3 * 3 = 99,加上网络自身的第一层卷积+池化=101
    model = SlowFast(Bottleneck, [3, 4, 23, 3], **kwargs)
    return model


def resnet152(**kwargs):
    """Constructs a ResNet-101 model.
    """
    # 3 * 3 + 8 * 3 + 36 * 3 + 3 * 3 = 150,加上网络自身的第一层卷积+池化=152
    model = SlowFast(Bottleneck, [3, 8, 36, 3], **kwargs)
    return model


def resnet200(**kwargs):
    """Constructs a ResNet-101 model.
    """
    # 3 * 3 + 24 * 3 + 36 * 3 + 3 * 3 = 198,加上网络自身的第一层卷积+池化=200
    model = SlowFast(Bottleneck, [3, 24, 36, 3], **kwargs)
    return model


if __name__ == "__main__":

    num_classes = 101
    input_tensor = Variable(torch.rand(1, 3, 64, 224, 224))
    model = resnet50(class_num=num_classes)
    output = model(input_tensor)
    print(output.size())

运行结果

代码语言:javascript
复制
torch.Size([1, 101])

网络模型训练

先设置训练模型需要的参数

代码语言:javascript
复制
params = dict()
# 视频动作分类数量
params['num_classes'] = 101

params['dataset'] = '/Users/admin/Downloads/UCF-101/'
# 训练循环次数
params['epoch_num'] = 40
# 批次数量
params['batch_size'] = 16
params['step'] = 10
# CPU线程数
params['num_workers'] = 4
# 学习率
params['learning_rate'] = 1e-2
params['momentum'] = 0.9
params['weight_decay'] = 1e-5
params['display'] = 10
# 是否包含预训练模型
params['pretrained'] = None
# 使用的显卡
params['gpu'] = [0]
params['log'] = 'log'
params['save_path'] = 'UCF101'
# 切片长度
params['clip_len'] = 64
# 样本帧率
params['frame_sample_rate'] = 1

训练代码

代码语言:javascript
复制
import os
import time
import torch
from config import params
from torch import nn, optim
from torch.utils.data import DataLoader
import torch.backends.cudnn as cudnn
from lib.dataset import VideoDataset
from lib import slowfastnet
from tensorboardX import SummaryWriter

class AverageMeter(object):
    """计算和存储训练中的数值"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

def accuracy(output, target, topk=(1,)):
    """计算对于指定k值的精度"""
    maxk = max(topk)
    batch_size = target.size(0)

    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))

    res = []
    for k in topk:
        correct_k = correct[:k].reshape(-1).float().sum(0)
        res.append(correct_k.mul_(100.0 / batch_size))
    return res

def train(model, train_dataloader, epoch, criterion, optimizer, writer, device):
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()

    model.train()
    end = time.time()
    # 遍历训练数据集
    for step, (inputs, labels) in enumerate(train_dataloader):
        data_time.update(time.time() - end)

        # inputs = inputs.cuda()
        # labels = labels.cuda()
        inputs = inputs.to(device)
        labels = labels.to(device)
        # 前向运算
        outputs = model(inputs)
        # 交叉熵损失函数
        loss = criterion(outputs, labels)

        # 计算top1和top5的精度
        prec1, prec5 = accuracy(outputs.data, labels, topk=(1, 5))
        # 保存损失值
        losses.update(loss.item(), inputs.size(0))
        # 保存top1的精度
        top1.update(prec1.item(), inputs.size(0))
        # 保存top5的精度
        top5.update(prec5.item(), inputs.size(0))
        # 反向传播
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # 处理一个批次的时间
        batch_time.update(time.time() - end)
        end = time.time()
        # 打印日志
        if (step + 1) % params['display'] == 0:
            print('-------------------------------------------------------')
            for param in optimizer.param_groups:
                print('lr: ', param['lr'])
            print_string = 'Epoch: [{0}][{1}/{2}]'.format(epoch, step+1, len(train_dataloader))
            print(print_string)
            print_string = 'data_time: {data_time:.3f}, batch time: {batch_time:.3f}'.format(
                data_time=data_time.val,
                batch_time=batch_time.val)
            print(print_string)
            print_string = 'loss: {loss:.5f}'.format(loss=losses.avg)
            print(print_string)
            print_string = 'Top-1 accuracy: {top1_acc:.2f}%, Top-5 accuracy: {top5_acc:.2f}%'.format(
                top1_acc=top1.avg,
                top5_acc=top5.avg)
            print(print_string)
    # 记录日志
    writer.add_scalar('train_loss_epoch', losses.avg, epoch)
    writer.add_scalar('train_top1_acc_epoch', top1.avg, epoch)
    writer.add_scalar('train_top5_acc_epoch', top5.avg, epoch)

def validation(model, val_dataloader, epoch, criterion, optimizer, writer, device):
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()
    top1 = AverageMeter()
    top5 = AverageMeter()
    model.eval()

    end = time.time()
    with torch.no_grad():
        for step, (inputs, labels) in enumerate(val_dataloader):
            data_time.update(time.time() - end)
            # inputs = inputs.cuda()
            # labels = labels.cuda()
            inputs = inputs.to(device)
            labels = labels.to(device)
            # 这里只有前向运算,没有反向传播
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            # measure accuracy and record loss

            prec1, prec5 = accuracy(outputs.data, labels, topk=(1, 5))
            losses.update(loss.item(), inputs.size(0))
            top1.update(prec1.item(), inputs.size(0))
            top5.update(prec5.item(), inputs.size(0))
            batch_time.update(time.time() - end)
            end = time.time()
            if (step + 1) % params['display'] == 0:
                print('----validation----')
                print_string = 'Epoch: [{0}][{1}/{2}]'.format(epoch, step + 1, len(val_dataloader))
                print(print_string)
                print_string = 'data_time: {data_time:.3f}, batch time: {batch_time:.3f}'.format(
                    data_time=data_time.val,
                    batch_time=batch_time.val)
                print(print_string)
                print_string = 'loss: {loss:.5f}'.format(loss=losses.avg)
                print(print_string)
                print_string = 'Top-1 accuracy: {top1_acc:.2f}%, Top-5 accuracy: {top5_acc:.2f}%'.format(
                    top1_acc=top1.avg,
                    top5_acc=top5.avg)
                print(print_string)
    writer.add_scalar('val_loss_epoch', losses.avg, epoch)
    writer.add_scalar('val_top1_acc_epoch', top1.avg, epoch)
    writer.add_scalar('val_top5_acc_epoch', top5.avg, epoch)


def main():
    # 关闭cudnn每次寻找最优配置
    cudnn.benchmark = False
    cur_time = time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime(time.time()))
    logdir = os.path.join(params['log'], cur_time)
    if not os.path.exists(logdir):
        os.makedirs(logdir)

    writer = SummaryWriter(log_dir=logdir)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    print("Loading dataset")
    # 读取训练数据集
    train_dataloader = \
        DataLoader(
            VideoDataset(params['dataset'], mode='train', clip_len=params['clip_len'], frame_sample_rate=params['frame_sample_rate']),
            batch_size=params['batch_size'], shuffle=True, num_workers=params['num_workers'])
    # 读取验证数据集
    val_dataloader = \
        DataLoader(
            VideoDataset(params['dataset'], mode='validation', clip_len=params['clip_len'], frame_sample_rate=params['frame_sample_rate']),
            batch_size=params['batch_size'], shuffle=False, num_workers=params['num_workers'])

    print("load model")
    # 创建resnet50网络
    model = slowfastnet.resnet50(class_num=params['num_classes'])
    # 是否加载预训练模型
    if params['pretrained'] is not None:
        pretrained_dict = torch.load(params['pretrained'], map_location='cpu')
        try:
            model_dict = model.module.state_dict()
        except AttributeError:
            model_dict = model.state_dict()
        pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
        print("load pretrain model")
        model_dict.update(pretrained_dict)
        model.load_state_dict(model_dict)
    
    # model = model.cuda(params['gpu'][0])
    # model = nn.DataParallel(model, device_ids=params['gpu'])  # multi-Gpu
    model = model.to(device)

    # criterion = nn.CrossEntropyLoss().cuda()
    criterion = nn.CrossEntropyLoss().to(device)
    optimizer = optim.SGD(model.parameters(), lr=params['learning_rate'], momentum=params['momentum'], weight_decay=params['weight_decay'])
    # 每隔step个epoch,学习率降为原来的1/10
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=params['step'], gamma=0.1)

    model_save_dir = os.path.join(params['save_path'], cur_time)
    if not os.path.exists(model_save_dir):
        os.makedirs(model_save_dir)
    for epoch in range(params['epoch_num']):
        # 开始训练
        train(model, train_dataloader, epoch, criterion, optimizer, writer, device)
        if epoch % 2 == 0:
            # 开始验证
            validation(model, val_dataloader, epoch, criterion, optimizer, writer, device)
        scheduler.step()
        if epoch % 1 == 0:
            checkpoint = os.path.join(model_save_dir,
                                      "clip_len_" + str(params['clip_len']) + "frame_sample_rate_" +
                                      str(params['frame_sample_rate']) + "_checkpoint_" +
                                      str(epoch) + ".pth.tar")
            # 多卡保存模型
            # torch.save(model.module.state_dict(), checkpoint)
            torch.save(model.state_dict(), checkpoint)

    writer.close

if __name__ == '__main__':
    main()
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2022-05-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
图像处理
图像处理基于腾讯云深度学习等人工智能技术,提供综合性的图像优化处理服务,包括图像质量评估、图像清晰度增强、图像智能裁剪等。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档