PyTorch for Audio + Music Processing(8/9/10) :基于CNN的模型构建/训练/推理
迪丽瓦拉
2024-04-25 06:12:49
0

基于CNN的模型构建/训练/推理

文章目录

  • 基于CNN的模型构建/训练/推理
  • 前言
    • 08 Implementing a CNN network
    • 09 Training urban sound classifier
    • 10 Predictions with sound classifier
  • 一、构建CNN模型
    • 构建过程如下:
    • 网络结构通过torchsummary打印出来
    • 输入输出shape和参数数量说明
      • 输出shape的计算
      • 输出参数数量的计算
  • 二、模型的训练
    • 创建dataloader
    • 单个epoch的训练过程
    • 多个epoch
    • 训练整体过程
    • 训练最终输出
  • 三、模型推理
    • 定义class_mapping
    • 预测函数
  • 总结


前言

本系列最后一部分:urban sound音频分类神经网络模型的搭建和训练,大纲和数据集的准备可以看我前期的内容:
1.PyTorch for Audio + Music Processing(1) :Course Overview(课程大纲)
2.PyTorch for Audio + Music Processing(2/3/4/5/6/7) :构建数据集和提取音频特征
本期的内容包括:

08 Implementing a CNN network

类似VGG网络结构的CNN模型的构建

09 Training urban sound classifier

urban sound音频分类模型的训练

10 Predictions with sound classifier

推理结构的实现


一、构建CNN模型

构建过程如下:

1.4个卷积block,对应conv1,conv2,conv3,conv4,每个block包含conv2d,relu,maxpooling
2.flatten层
3.全连接linear
4.softmax
代码如下:

class CNNNetwork(nn.Module):def __init__(self):super().__init__()# 4 conv blocks / flatten / linear / softmaxself.conv1 = nn.Sequential(nn.Conv2d(in_channels=1,out_channels=16,kernel_size=3,stride=1,padding=2),nn.ReLU(),nn.MaxPool2d(kernel_size=2))self.conv2 = nn.Sequential(nn.Conv2d(in_channels=16,out_channels=32,kernel_size=3,stride=1,padding=2),nn.ReLU(),nn.MaxPool2d(kernel_size=2))self.conv3 = nn.Sequential(nn.Conv2d(in_channels=32,out_channels=64,kernel_size=3,stride=1,padding=2),nn.ReLU(),nn.MaxPool2d(kernel_size=2))self.conv4 = nn.Sequential(nn.Conv2d(in_channels=64,out_channels=128,kernel_size=3,stride=1,padding=2),nn.ReLU(),nn.MaxPool2d(kernel_size=2))self.flatten = nn.Flatten()self.linear = nn.Linear(128 * 5 * 4, 10)self.softmax = nn.Softmax(dim=1)def forward(self, input_data):x = self.conv1(input_data)x = self.conv2(x)x = self.conv3(x)x = self.conv4(x)x = self.flatten(x)logits = self.linear(x)predictions = self.softmax(logits)return predictions

网络结构通过torchsummary打印出来

        Layer (type)               Output Shape         Param #
================================================================Conv2d-1           [-1, 16, 66, 46]             160ReLU-2           [-1, 16, 66, 46]               0MaxPool2d-3           [-1, 16, 33, 23]               0Conv2d-4           [-1, 32, 35, 25]           4,640ReLU-5           [-1, 32, 35, 25]               0MaxPool2d-6           [-1, 32, 17, 12]               0Conv2d-7           [-1, 64, 19, 14]          18,496ReLU-8           [-1, 64, 19, 14]               0MaxPool2d-9             [-1, 64, 9, 7]               0Conv2d-10           [-1, 128, 11, 9]          73,856ReLU-11           [-1, 128, 11, 9]               0MaxPool2d-12            [-1, 128, 5, 4]               0Flatten-13                 [-1, 2560]               0Linear-14                   [-1, 10]          25,610Softmax-15                   [-1, 10]               0
================================================================
Total params: 122,762
Trainable params: 122,762
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 1.83
Params size (MB): 0.47
Estimated Total Size (MB): 2.31
----------------------------------------------------------------

输入输出shape和参数数量说明

以第一个卷据块输出为例,前面音频通过梅尔频谱变换提取到的特征为shape为64*44的tensor
其结构定义为:

self.conv1 = nn.Sequential(nn.Conv2d(in_channels=1,out_channels=16,kernel_size=3,stride=1,padding=2),nn.ReLU(),nn.MaxPool2d(kernel_size=2))

输出shape的计算

其中padding=2,所以输出的shape就变成(64+2)x(44+2),即66x46
out_channels=16,即有16个通道(或卷积核数量),分别对输入进行卷积计算,所以输出的通道也是16
所以输出的tensor的shape为16x66x46
maxpool2d的size=2,所以其shape最终为16x33x23

输出参数数量的计算

每个kernel_size=3,且每个kernel是参数共享,所以每个kernel参数为3x3=9
由于有16个kernel,所以参数数量=16x9=144。由于还有偏置=kernel的数量,最终参数数量=16x9+16=160

二、模型的训练

创建dataloader

from torch.utils.data import DataLoader
# 导入torch的dataloader
def create_data_loader(train_data, batch_size):# train_data为前期定义的urbanDataset,batch_size为每批训练样本数train_dataloader = DataLoader(train_data, batch_size=batch_size)return train_dataloader

单个epoch的训练过程

def train_single_epoch(model, data_loader, loss_fn, optimiser, device):# model为一步骤定义的cnn模型,loss_fn为损失函数,optimiser为优化方法,device为训练设备for input, target in data_loader:input, target = input.to(device), target.to(device)# 从迭代器获取训练数据和标签# calculate lossprediction = model(input)# 前向输出预测结果loss = loss_fn(prediction, target)# 通过模型输出和标签计算损失函数# backpropagate error and update weightsoptimiser.zero_grad()# 梯度归零,因为训练的过程通常使用mini-batch方法,所以如果不将梯度清零的话,梯度会与上一个batch的数据相关loss.backward()# 反向传播计算梯度optimiser.step()# 基于optimiser方法和梯度信息更新weightprint(f"loss: {loss.item()}")

多个epoch

def train(model, data_loader, loss_fn, optimiser, device, epochs):for i in range(epochs):print(f"Epoch {i+1}")train_single_epoch(model, data_loader, loss_fn, optimiser, device)print("---------------------------")print("Finished training")

训练整体过程

if __name__ == "__main__":if torch.cuda.is_available():device = "cuda"else:device = "cpu"print(f"Using {device}")mel_spectrogram = torchaudio.transforms.MelSpectrogram(sample_rate=SAMPLE_RATE,n_fft=1024,hop_length=512,n_mels=64)# 通过torchaudio定义梅尔转换,为下面的UrbanSoundDataset准备usd = UrbanSoundDataset(ANNOTATIONS_FILE,AUDIO_DIR,mel_spectrogram,SAMPLE_RATE,NUM_SAMPLES,device)# 定义数据集train_dataloader = create_data_loader(usd, BATCH_SIZE)# 调用上面定义的create_data_loader# construct model and assign it to devicecnn = CNNNetwork().to(device)print(cnn)# 实例化CNNNetwork模型# initialise loss funtion + optimiserloss_fn = nn.CrossEntropyLoss()# 采用交叉熵损失函数optimiser = torch.optim.Adam(cnn.parameters(),lr=LEARNING_RATE)# 定义优化方式# train modeltrain(cnn, train_dataloader, loss_fn, optimiser, device, EPOCHS)# 训练# save modeltorch.save(cnn.state_dict(), "feedforwardnet.pth")print("Trained feed forward net saved at feedforwardnet.pth")

训练最终输出

Epoch 1
loss: 2.241577625274658
---------------------------
Epoch 2
loss: 2.2747385501861572
---------------------------
Epoch 3
loss: 2.3089897632598877
---------------------------
Epoch 4
loss: 2.348045587539673
---------------------------
Epoch 5
loss: 2.315420150756836
---------------------------
Epoch 6
loss: 2.3148367404937744
---------------------------
Epoch 7
loss: 2.31473708152771
---------------------------
Epoch 8
loss: 2.3141160011291504
---------------------------
Epoch 9
loss: 2.3157730102539062
---------------------------
Epoch 10
loss: 2.3171067237854004
---------------------------
Finished training
Trained feed forward net saved at feedforwardnet.pthProcess finished with exit code 0

三、模型推理

定义class_mapping

模型的输出是对应的class的序号,所以这里定义了一个序号(顺序)与类别的映射,其数据是根据数据集ubranDataset定义的类别

class_mapping = ["air_conditioner","car_horn","children_playing","dog_bark","drilling","engine_idling","gun_shot","jackhammer","siren","street_music"
]

预测函数

def predict(model, input, target, class_mapping):model.eval()# 必须加这句,eval() 时,pytorch 会自动把 BN 和 DropOut 固定住,不会取平均,而是用训练好的值with torch.no_grad():predictions = model(input)# Tensor (1, 10) -> [ [0.1, 0.01, ..., 0.6] ]predicted_index = predictions[0].argmax(0)predicted = class_mapping[predicted_index]# 模型预测的值expected = class_mapping[target]# ground_truth值return predicted, expected

总结

本系列PyTorch for Audio + Music Processing课程完整地讲述了:
1.基于torch audio的音频数据集处理,加载,梅尔特征提取过程
2.基于CNN的基础分类模型的构建
3.torch模型的训练和预测
逻辑比较清晰,讲的也很细致,很适合入门。但同时作者也说了,这个课程只是普及了基础框架和这类问题的处理思路,采用的网络模型也是很基础的类VGG结构,感兴趣的同学可以尝试更SOTA的模型和多种特征来加强模型的性能。

相关内容