这里写目录标题
起因配置环境问题探索一、由近及远二、追根溯源三、问题总结本文源码起因
近日,博主在学习《动手学深度学习》(PyTorch版)时,用fashion_mnist复现LeNet时想知道这个for循环运行了多少次:
代码如下:(在文末会给出整个代码)
for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)
这段代码的意思就是从train_iter读取样本X和标签y,但是奇怪的是这个for循环执行次数为235次,但是设置的batch_size 为256。这引起了我的兴趣,这个235的数从何而来,本代码自始至终从未定义过235这个数
通过Debug注意到:
更吊轨的是train_iter长度为40,40 在此代码中也从未出定义过。
如果你和我有一样的困惑,并感兴趣的话请往下继续观看,这或许会对你理解源代码有所帮助
配置环境
使用环境:python3.8 平台:Windows10 IDE:PyCharm
问题探索
一、由近及远
这个疑问发生的地方在函数def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
中
这个函数传入的数据被用于for循环:for X, y in train_iter:
在于 train_iter变量
train_iter是通过函数train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
传入的
即train_ch5函数外部的train_iter将其数据传给了train_ch5函数内部的train_ite
train_ch5函数外部的train_iter的由来是通过赋值:train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
得到的
最终定位到问题关键:d2l.load_data_fashion_mnist(batch_size=batch_size)
函数
二、追根溯源
d2l.load_data_fashion_mnist()
的源码如下:
def load_data_fashion_mnist(batch_size, resize=None, root='~/Datasets/FashionMNIST'):"""Download the fashion mnist dataset and then load into memory."""trans = []if resize:trans.append(torchvision.transforms.Resize(size=resize))trans.append(torchvision.transforms.ToTensor())transform = pose(trans)mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)mnist_test = torchvision.datasets.FashionMNIST(root=root, train=False, download=True, transform=transform)if sys.platform.startswith('win'):num_workers = 0 # 0表示不用额外的进程来加速读取数据else:num_workers = 4train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)return train_iter, test_iter
可以看到,里面涉及到train_iter数据的内容关键为这三句代码:
transform = pose(trans)mnist_train = torchvision.datasets.FashionMNIST(root=root, train=True, download=True, transform=transform)train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
而其中第一句
transform = pose(trans)
意义再远转换图片格式为张量
第二句为第三句通过支撑,所以关键在于第三句
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
第三句的意思为按照要求(传入参数)加载数据,其实就是老生常谈的DataLoader
问题了,DataLoader
源码如下(删去了源码中大部分的注释):
class DataLoader(object):__initialized = Falsedef __init__(self, dataset, batch_size=1, shuffle=False, sampler=None,batch_sampler=None, num_workers=0, collate_fn=None,pin_memory=False, drop_last=False, timeout=0,worker_init_fn=None, multiprocessing_context=None):torch._C._log_api_usage_once("python.data_loader")if num_workers < 0:raise ValueError('num_workers option should be non-negative; ''use num_workers=0 to disable multiprocessing.')if timeout < 0:raise ValueError('timeout option should be non-negative')self.dataset = datasetself.num_workers = num_workersself.pin_memory = pin_memoryself.timeout = timeoutself.worker_init_fn = worker_init_fnself.multiprocessing_context = multiprocessing_contextif isinstance(dataset, IterableDataset):self._dataset_kind = _DatasetKind.Iterableif shuffle is not False:raise ValueError("DataLoader with IterableDataset: expected unspecified ""shuffle option, but got shuffle={}".format(shuffle))elif sampler is not None:# See NOTE [ Custom Samplers and IterableDataset ]raise ValueError("DataLoader with IterableDataset: expected unspecified ""sampler option, but got sampler={}".format(sampler))elif batch_sampler is not None:# See NOTE [ Custom Samplers and IterableDataset ]raise ValueError("DataLoader with IterableDataset: expected unspecified ""batch_sampler option, but got batch_sampler={}".format(batch_sampler))else:self._dataset_kind = _DatasetKind.Mapif sampler is not None and shuffle:raise ValueError('sampler option is mutually exclusive with ''shuffle')if batch_sampler is not None:# auto_collation with custom batch_samplerif batch_size != 1 or shuffle or sampler is not None or drop_last:raise ValueError('batch_sampler option is mutually exclusive ''with batch_size, shuffle, sampler, and ''drop_last')batch_size = Nonedrop_last = Falseelif batch_size is None:# no auto_collationif shuffle or drop_last:raise ValueError('batch_size=None option disables auto-batching ''and is mutually exclusive with ''shuffle, and drop_last')if sampler is None: # give default samplersif self._dataset_kind == _DatasetKind.Iterable:# See NOTE [ Custom Samplers and IterableDataset ]sampler = _InfiniteConstantSampler()else: # map-styleif shuffle:sampler = RandomSampler(dataset)else:sampler = SequentialSampler(dataset)if batch_size is not None and batch_sampler is None:# auto_collation without custom batch_samplerbatch_sampler = BatchSampler(sampler, batch_size, drop_last)self.batch_size = batch_sizeself.drop_last = drop_lastself.sampler = samplerself.batch_sampler = batch_samplerif collate_fn is None:if self._auto_collation:collate_fn = _utils.collate.default_collateelse:collate_fn = _utils.collate.default_convertself.collate_fn = collate_fnself.__initialized = Trueself._IterableDataset_len_called = None
其中传入参数的意义可以参考这篇博客:传送门
好了继续回到我们的问题
我们来理一下思路:我们找到了torch.utils.data.DataLoader()
源代码,通过阅读源代码,我们期望找到该源码返回的train_iter
为什么是235的长度
通过Debug进入dataloader.py:发现
可以看到,235这个数字首先出现在batch_sampler
中,那么问题转到了命令
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
即函数BatchSampler(sampler, batch_size, drop_last)
我们理一下这个函数传入的参数意思
sampler:读取的fashion_mnist训练数据集,长度为60000batch_size:小批量的大小,设定值为256drop_last:一个为False的bool型
好了,我们继续对BatchSampler()
进行Debug,进入sampler.py发现BatchSampler()
函数相对简单,其源码(同样的,删去了部分注释)如下:
class BatchSampler(Sampler)def __init__(self, sampler, batch_size, drop_last):if not isinstance(sampler, Sampler):raise ValueError("sampler should be an instance of ""torch.utils.data.Sampler, but got sampler={}".format(sampler))if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \batch_size <= 0:raise ValueError("batch_size should be a positive integer value, ""but got batch_size={}".format(batch_size))if not isinstance(drop_last, bool):raise ValueError("drop_last should be a boolean value, but got ""drop_last={}".format(drop_last))self.sampler = samplerself.batch_size = batch_sizeself.drop_last = drop_last# print("**")def __iter__(self):batch = []for idx in self.sampler:batch.append(idx)if len(batch) == self.batch_size:yield batchbatch = []if len(batch) > 0 and not self.drop_last:yield batchdef __len__(self):if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size
在Debug中会发现,BatchSampler()
源码真正执行的部分就只有这一块儿:
def __init__(self, sampler, batch_size, drop_last):if not isinstance(sampler, Sampler):raise ValueError("sampler should be an instance of ""torch.utils.data.Sampler, but got sampler={}".format(sampler))if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \batch_size <= 0:raise ValueError("batch_size should be a positive integer value, ""but got batch_size={}".format(batch_size))if not isinstance(drop_last, bool):raise ValueError("drop_last should be a boolean value, but got ""drop_last={}".format(drop_last))self.sampler = samplerself.batch_size = batch_sizeself.drop_last = drop_last# print("**")
在Debug的时候会解开print("**")
的注释,来查看运行进程
运行完上面一段后最后一句print("**")
就会跳出BatchSampler()
但是上面打码中根本没有return回去任何东西,真是让人头大的情况
及时将下面的
def __iter__(self):
和
def __len__(self):
代码打上断点也直接跳出BatchSampler()
为了解决上面的 问题,查看
def __iter__(self):
和
def __len__(self):
将这段代码:到底有没有运行,在二者下第一节加上一个print函数,具体如下:
def __iter__(self):print("***")batch = []for idx in self.sampler:batch.append(idx)if len(batch) == self.batch_size:yield batchbatch = []if len(batch) > 0 and not self.drop_last:yield batchdef __len__(self):print("****")if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size
再运行代码,发现在Console打印出如下:
可以知道:BatchSampler()
源码中运行了:
def __len__(self):
下的代码,即:
def __len__(self):print("****")if self.drop_last:return len(self.sampler) // self.batch_sizeelse:return (len(self.sampler) + self.batch_size - 1) // self.batch_size
我们来细细看一下这个函数下面的流程:
当self.drop_last
为true时,返回len(self.sampler) // self.batch_size
当self.drop_last
为false时,返回(len(self.sampler) + self.batch_size - 1) // self.batch_size
在Debug时可以发现:
self.drop_last
为false,所以返回的是:
self.sampler
的长度+self.batch_size
再减去1最后整除self.batch_size
以本案以为例:
(60000+256-1)//256 = 235
哇哦!我们终于得到了235的来源了
好了,到这里我们就探清了235的来源;
同理,我们的test_iter
为40的来源便是:
(10000+256-1)//256 = 40
有必要解释一下,为什么test_iter
相较于train_iter
从60000变成了10000
因为fashion_mnist数据集中训练集的长度为60000,测试集长度为10000
三、问题总结
我们的初衷是为了知道循环次数,知道循环次数的目的在于知道循环的原理,为什么要这样循环,这样循环的好处在于哪儿?
我们回到本文最开始的for循环(将其上面的迭代此处循环也加上)中:
for epoch in range(num_epochs):train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))
面对这中循环,在训练网络中随处可见,不禁思考:这个循环的意义何在?
首先进入循环着,epoch
为循环迭代次数,第一次迭代epoch
为0在这个迭代中首先将误差项(train_l_sum, train_acc_sum, n, batch_count
)设为0然后进入对数据处理的循环中,从训练集train_iter
中读取样本数据X和真实标签y,每次读入个数为256个,一共读取235次
随后进行向前传递y_hat = net(X)
再计算损失函数l = loss(y_hat, y)
再反向传播,采用优化算法进行对参数进行修正,以得到最优的网络参数并且每次迭代中会计算偏差值,以得到每次循环之后的正确率
本文源码
# 本书链接https://tangshusen.me/Dive-into-DL-PyTorch/#/chapter03_DL-basics/3.8_mlp# 5.5 卷积神经网络(LeNet)#注释:黄文俊#邮箱:hurri_cane@import timeimport torchfrom torch import nn, optimimport syssys.path.append("..")import d2lzh_pytorch as d2ldevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')class LeNet(nn.Module):def __init__(self):super(LeNet, self).__init__()# 卷积层块self.conv = nn.Sequential(nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_sizenn.Sigmoid(),nn.MaxPool2d(2, 2), # kernel_size, stridenn.Conv2d(6, 16, 5),nn.Sigmoid(),nn.MaxPool2d(2, 2))# 全连接层块self.fc = nn.Sequential(nn.Linear(16*4*4, 120),nn.Sigmoid(),nn.Linear(120, 84),nn.Sigmoid(),nn.Linear(84, 10))def forward(self, img):feature = self.conv(img)output = self.fc(feature.view(img.shape[0], -1))return outputnet = LeNet()print(net)# 本函数已保存在d2lzh_pytorch包中方便以后使用。该函数将被逐步改进。def evaluate_accuracy(data_iter, net, device=None):if device is None and isinstance(net, torch.nn.Module):# 如果没指定device就使用net的devicedevice = list(net.parameters())[0].deviceacc_sum, n = 0.0, 0with torch.no_grad():for X, y in data_iter:if isinstance(net, torch.nn.Module):net.eval() # 评估模式, 这会关闭dropoutacc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()net.train() # 改回训练模式else: # 自定义的模型, 3.13节之后不会用到, 不考虑GPUif('is_training' in net.__code__.co_varnames): # 如果有is_training这个参数# 将is_training设置成Falseacc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()else:acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()n += y.shape[0]return acc_sum / n# 本函数已保存在d2lzh_pytorch包中方便以后使用def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):net = net.to(device)print("training on ", device)loss = torch.nn.CrossEntropyLoss()for epoch in range(num_epochs):train_l_sum, train_acc_sum, n, batch_count, start = 0.0, 0.0, 0, 0, time.time()for X, y in train_iter:X = X.to(device)y = y.to(device)y_hat = net(X)l = loss(y_hat, y)optimizer.zero_grad()l.backward()optimizer.step()train_l_sum += l.cpu().item()train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()n += y.shape[0]batch_count += 1test_acc = evaluate_accuracy(test_iter, net)print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))# 下面我们来实验LeNet模型。实验中,我们仍然使用Fashion-MNIST作为训练数据集。batch_size = 256train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)lr, num_epochs = 0.001, 5optimizer = torch.optim.Adam(net.parameters(), lr=lr)train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)print("*"*50)
深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X y in train_iter:】
如果觉得《深度学习中训练迭代次数理解【源码阅读技巧分享】【深度学习循环迭代理解】【for X 》对你有帮助,请点赞、收藏,并留下你的观点哦!