自闭症康复网 > 深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X

深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X

时间：2023-04-10 02:42:46

这里写目录标题

起因配置环境问题探索一、由近及远二、追根溯源三、问题总结本文源码

起因

近日，博主在学习《动手学深度学习》(PyTorch版)时，用fashion_mnist复现LeNet时想知道这个for循环运行了多少次：

代码如下：（在文末会给出整个代码）

for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)

这段代码的意思就是从train_iter读取样本X和标签y，但是奇怪的是这个for循环执行次数为235次，但是设置的batch_size 为256。这引起了我的兴趣，这个235的数从何而来，本代码自始至终从未定义过235这个数

通过Debug注意到：

更吊轨的是train_iter长度为40,40 在此代码中也从未出定义过。

如果你和我有一样的困惑，并感兴趣的话请往下继续观看，这或许会对你理解源代码有所帮助

配置环境

使用环境：python3.8

平台：Windows10

IDE：PyCharm

问题探索

一、由近及远

这个疑问发生的地方在函数def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):中

这个函数传入的数据被用于for循环：for X, y in train_iter:在于 train_iter变量

train_iter是通过函数train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)传入的

即train_ch5函数外部的train_iter将其数据传给了train_ch5函数内部的train_ite

train_ch5函数外部的train_iter的由来是通过赋值：train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)得到的

最终定位到问题关键：d2l.load_data_fashion_mnist(batch_size=batch_size)函数

二、追根溯源

d2l.load_data_fashion_mnist（）

的源码如下：

def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):"""Download the fashion mnist dataset and then load into memory."""trans = []if resize:trans.append(torchvision.transforms.Resize(size=resize))trans.append(torchvision.transforms.ToTensor())transform = pose(trans)mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)if sys.platform.startswith('win'):num_workers = 0 # 0表示不用额外的进程来加速读取数据else:num_workers = 4train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)return train_iter, test_iter

可以看到，里面涉及到train_iter数据的内容关键为这三句代码：

transform = pose(trans)mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)

而其中第一句

transform = pose(trans)

意义再远转换图片格式为张量

第二句为第三句通过支撑，所以关键在于第三句

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)

第三句的意思为按照要求（传入参数）加载数据，其实就是老生常谈的DataLoader问题了，DataLoader源码如下（删去了源码中大部分的注释）：

class DataLoader(object):__initialized = Falsedef __init__(self, dataset, batch_size=1, shuffle=False, sampler=None,batch_sampler=None, num_workers=0, collate_fn=None,pin_memory=False, drop_last=False, timeout=0,worker_init_fn=None, multiprocessing_context=None):torch._C._log_api_usage_once("python.data_loader")if num_workers < 0:raise ValueError('num_workers option should be non-negative; ''use num_workers=0 to disable multiprocessing.')if timeout < 0:raise ValueError('timeout option should be non-negative')self.dataset = datasetself.num_workers = num_workersself.pin_memory = pin_memoryself.timeout = timeoutself.worker_init_fn = worker_init_fnself.multiprocessing_context = multiprocessing_contextif isinstance(dataset, IterableDataset):self._dataset_kind = _DatasetKind.Iterableif shuffle is not False:raise ValueError("DataLoader with IterableDataset: expected unspecified ""shuffle option, but got shuffle={}".format(shuffle))elif sampler is not None:# See NOTE [ Custom Samplers and IterableDataset ]raise ValueError("DataLoader with IterableDataset: expected unspecified ""sampler option, but got sampler={}".format(sampler))elif batch_sampler is not None:# See NOTE [ Custom Samplers and IterableDataset ]raise ValueError("DataLoader with IterableDataset: expected unspecified ""batch_sampler option, but got batch_sampler={}".format(batch_sampler))else:self._dataset_kind = _DatasetKind.Mapif sampler is not None and shuffle:raise ValueError('sampler option is mutually exclusive with ''shuffle')if batch_sampler is not None:# auto_collation with custom batch_samplerif batch_size != 1 or shuffle or sampler is not None or drop_last:raise ValueError('batch_sampler option is mutually exclusive ''with batch_size, shuffle, sampler, and ''drop_last')batch_size = Nonedrop_last = Falseelif batch_size is None:# no auto_collationif shuffle or drop_last:raise ValueError('batch_size=None option disables auto-batching ''and is mutually exclusive with ''shuffle, and drop_last')if sampler is None: # give default samplersif self._dataset_kind == _DatasetKind.Iterable:# See NOTE [ Custom Samplers and IterableDataset ]sampler = _InfiniteConstantSampler()else: # map-styleif shuffle:sampler = RandomSampler(dataset)else:sampler = SequentialSampler(dataset)if batch_size is not None and batch_sampler is None:# auto_collation without custom batch_samplerbatch_sampler = BatchSampler(sampler, batch_size, drop_last)self.batch_size = batch_sizeself.drop_last = drop_lastself.sampler = samplerself.batch_sampler = batch_samplerif collate_fn is None:if self._auto_collation:collate_fn = _utils.collate.default_collateelse:collate_fn = _utils.collate.default_convertself.collate_fn = collate_fnself.__initialized = Trueself._IterableDataset_len_called = None

其中传入参数的意义可以参考这篇博客：传送门

好了继续回到我们的问题

我们来理一下思路：我们找到了torch.utils.data.DataLoader()源代码，通过阅读源代码，我们期望找到该源码返回的train_iter为什么是235的长度

通过Debug进入dataloader.py：发现

可以看到，235这个数字首先出现在batch_sampler中，那么问题转到了命令

batch_sampler = BatchSampler(sampler, batch_size, drop_last)

即函数BatchSampler(sampler, batch_size, drop_last)

我们理一下这个函数传入的参数意思

sampler：读取的fashion_mnist训练数据集，长度为60000batch_size：小批量的大小，设定值为256drop_last：一个为False的bool型

好了，我们继续对BatchSampler（）进行Debug，进入sampler.py发现BatchSampler（）函数相对简单，其源码（同样的，删去了部分注释）如下：

class BatchSampler(Sampler)def __init__(self, sampler, batch_size, drop_last):if not isinstance(sampler, Sampler):raise ValueError("sampler should be an instance of ""torch.utils.data.Sampler, but got sampler={}".format(sampler))if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \batch_size <= 0:raise ValueError("batch_size should be a positive integer value, ""but got batch_size={}".format(batch_size))if not isinstance(drop_last, bool):raise ValueError("drop_last should be a boolean value, but got ""drop_last={}".format(drop_last))self.sampler = samplerself.batch_size = batch_sizeself.drop_last = drop_last# print("**")def __iter__(self):batch = []for idx in self.sampler:batch.append(idx)if len(batch) == self.batch_size:yield batchbatch = []if len(batch) > 0 and not self.drop_last:yield batchdef __len__(self):if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size

在Debug中会发现，BatchSampler（）源码真正执行的部分就只有这一块儿：

def __init__(self, sampler, batch_size, drop_last):if not isinstance(sampler, Sampler):raise ValueError("sampler should be an instance of ""torch.utils.data.Sampler, but got sampler={}".format(sampler))if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \batch_size <= 0:raise ValueError("batch_size should be a positive integer value, ""but got batch_size={}".format(batch_size))if not isinstance(drop_last, bool):raise ValueError("drop_last should be a boolean value, but got ""drop_last={}".format(drop_last))self.sampler = samplerself.batch_size = batch_sizeself.drop_last = drop_last# print("**")

在Debug的时候会解开print("**")的注释，来查看运行进程

运行完上面一段后最后一句print("**")就会跳出BatchSampler（）

但是上面打码中根本没有return回去任何东西，真是让人头大的情况

及时将下面的

def __iter__(self):

和

def __len__(self):

代码打上断点也直接跳出BatchSampler（）

为了解决上面的问题，查看

def __iter__(self):

和

def __len__(self):

将这段代码：到底有没有运行，在二者下第一节加上一个print函数，具体如下：

def __iter__(self):print("***")batch = []for idx in self.sampler:batch.append(idx)if len(batch) == self.batch_size:yield batchbatch = []if len(batch) > 0 and not self.drop_last:yield batchdef __len__(self):print("****")if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size

再运行代码，发现在Console打印出如下：

可以知道：BatchSampler（）源码中运行了：

def __len__(self):

下的代码，即：

def __len__(self):print("****")if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size

我们来细细看一下这个函数下面的流程：

当self.drop_last为true时，返回len(self.sampler) // self.batch_size

当self.drop_last为false时，返回(len(self.sampler) + self.batch_size - 1) // self.batch_size

在Debug时可以发现：

self.drop_last为false，所以返回的是：

self.sampler的长度+self.batch_size再减去1最后整除self.batch_size

以本案以为例：

（60000+256-1）//256 = 235

哇哦！我们终于得到了235的来源了

好了，到这里我们就探清了235的来源；

同理，我们的test_iter为40的来源便是：

（10000+256-1）//256 = 40

有必要解释一下，为什么test_iter相较于train_iter从60000变成了10000

因为fashion_mnist数据集中训练集的长度为60000，测试集长度为10000

三、问题总结

我们的初衷是为了知道循环次数，知道循环次数的目的在于知道循环的原理，为什么要这样循环，这样循环的好处在于哪儿？

我们回到本文最开始的for循环（将其上面的迭代此处循环也加上）中：

for epoch in range(num_epochs):train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))

面对这中循环，在训练网络中随处可见，不禁思考：这个循环的意义何在？

首先进入循环着，epoch为循环迭代次数，第一次迭代epoch为0在这个迭代中首先将误差项（train_l_sum, train_acc_sum, n, batch_count）设为0然后进入对数据处理的循环中，从训练集train_iter中读取样本数据X和真实标签y，每次读入个数为256个，一共读取235次

随后进行向前传递y_hat = net(X)再计算损失函数l = loss(y_hat, y)再反向传播，采用优化算法进行对参数进行修正，以得到最优的网络参数并且每次迭代中会计算偏差值，以得到每次循环之后的正确率

本文源码

# 本书链接https://tangshusen.me/Dive-into-DL-PyTorch/#/chapter03_DL-basics/3.8_mlp# 5.5 卷积神经网络（LeNet）#注释：黄文俊#邮箱：hurri_cane@import timeimport torchfrom torch import nn, optimimport syssys.path.append("..")import d2lzh_pytorch as d2ldevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')class LeNet(nn.Module):def __init__(self):super(LeNet, self).__init__()# 卷积层块self.conv = nn.Sequential(nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_sizenn.Sigmoid(),nn.MaxPool2d(2, 2), # kernel_size, stridenn.Conv2d(6, 16, 5),nn.Sigmoid(),nn.MaxPool2d(2, 2))# 全连接层块self.fc = nn.Sequential(nn.Linear(16*4*4, 120),nn.Sigmoid(),nn.Linear(120, 84),nn.Sigmoid(),nn.Linear(84, 10))def forward(self, img):feature = self.conv(img)output = self.fc(feature.view(img.shape[0], -1))return outputnet = LeNet()print(net)# 本函数已保存在d2lzh_pytorch包中方便以后使用。该函数将被逐步改进。def evaluate_accuracy(data_iter, net, device=None):if device is None and isinstance(net, torch.nn.Module):# 如果没指定device就使用net的devicedevice = list(net.parameters())[0].deviceacc_sum, n = 0.0, 0with torch.no_grad():for X, y in data_iter:if isinstance(net, torch.nn.Module):net.eval() # 评估模式, 这会关闭dropoutacc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()net.train() # 改回训练模式else: # 自定义的模型, 3.13节之后不会用到, 不考虑GPUif('is_training' in net.__code__.co_varnames): # 如果有is_training这个参数# 将is_training设置成Falseacc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()else:acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()n += y.shape[0]return acc_sum / n# 本函数已保存在d2lzh_pytorch包中方便以后使用def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):net = net.to(device)print("training on ", device)loss = torch.nn.CrossEntropyLoss()for epoch in range(num_epochs):train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))# 下面我们来实验LeNet模型。实验中，我们仍然使用Fashion-MNIST作为训练数据集。batch_size = 256train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)lr, num_epochs = 0.001, 5optimizer = torch.optim.Adam(net.parameters(), lr=lr)train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)print("*"*50)

深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X y in train_iter:】

如果觉得《深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X 》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。