梯度下降和随机梯度下降 — 使用Gluon

Gluon里,使用小批量随机梯度下降很容易。我们无需重新实现该算法。特别地,当批量大小等于训练集大小时,该算法即为梯度下降;批量大小为1即为随机梯度下降。

In [1]:
import mxnet as mx
from mxnet import autograd
from mxnet import gluon
from mxnet import ndarray as nd
import numpy as np
import random

mx.random.seed(1)
random.seed(1)

# 生成数据集。
num_inputs = 2
num_examples = 1000
true_w = [2, -3.4]
true_b = 4.2
X = nd.random_normal(scale=1, shape=(num_examples, num_inputs))
y = true_w[0] * X[:, 0] + true_w[1] * X[:, 1] + true_b
y += .01 * nd.random_normal(scale=1, shape=y.shape)
dataset = gluon.data.ArrayDataset(X, y)

net = gluon.nn.Sequential()
net.add(gluon.nn.Dense(1))
square_loss = gluon.loss.L2Loss()

为了使学习率在两个epoch后自我衰减,我们需要访问gluon.Trainerlearning_rate属性和set_learning_rate函数。

In [2]:
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt

def train(batch_size, lr, epochs, period):
    assert period >= batch_size and period % batch_size == 0
    net.collect_params().initialize(mx.init.Normal(sigma=1), force_reinit=True)
    # 随机梯度下降。
    trainer = gluon.Trainer(net.collect_params(), 'sgd',
                            {'learning_rate': lr})
    data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
    total_loss = [np.mean(square_loss(net(X), y).asnumpy())]
    for epoch in range(1, epochs + 1):
        # 学习率自我衰减。
        if epoch > 2:
            trainer.set_learning_rate(trainer.learning_rate * 0.1)
        for batch_i, (data, label) in enumerate(data_iter):
            with autograd.record():
                output = net(data)
                loss = square_loss(output, label)
            loss.backward()
            trainer.step(batch_size)
            if batch_i * batch_size % period == 0:
                total_loss.append(np.mean(square_loss(net(X), y).asnumpy()))
        print("Batch size %d, Learning rate %f, Epoch %d, loss %.4e" %
              (batch_size, trainer.learning_rate, epoch, total_loss[-1]))

    print('w:', np.reshape(net[0].weight.data().asnumpy(), (1, -1)),
          'b:', net[0].bias.data().asnumpy()[0], '\n')
    x_axis = np.linspace(0, epochs, len(total_loss), endpoint=True)
    plt.semilogy(x_axis, total_loss)
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.show()

当批量大小为1时,训练使用的是随机梯度下降。在当前学习率下,目标函数值在早期快速下降后略有波动。当epoch大于2,学习率自我衰减后,目标函数值下降后较平稳。最终学到的参数值与真实值较接近。

In [3]:
train(batch_size=1, lr=0.2, epochs=3, period=10)
Batch size 1, Learning rate 0.200000, Epoch 1, loss 6.5111e-05
Batch size 1, Learning rate 0.200000, Epoch 2, loss 8.4869e-05
Batch size 1, Learning rate 0.020000, Epoch 3, loss 4.8941e-05
w: [[ 2.00122356 -3.40042353]] b: 4.20143

../_images/chapter_optimization_gd-sgd-gluon_5_1.png

当批量大小为1000时,由于训练数据集含1000个样本,此时训练使用的是梯度下降。在当前学习率下,目标函数值在前两个epoch下降较快。当epoch大于2,学习率自我衰减后,目标函数值下降较慢。最终学到的参数值与真实值较接近。

In [4]:
train(batch_size=1000, lr=0.999, epochs=3, period=1000)
Batch size 1000, Learning rate 0.999000, Epoch 1, loss 2.8226e-03
Batch size 1000, Learning rate 0.999000, Epoch 2, loss 7.6969e-05
Batch size 1000, Learning rate 0.099900, Epoch 3, loss 7.2206e-05
w: [[ 1.99328578 -3.40189672]] b: 4.19996

../_images/chapter_optimization_gd-sgd-gluon_7_1.png

当批量大小为10时,由于训练数据集含1000个样本,此时训练使用的是(小批量)随机梯度下降。最终学到的参数值与真实值较接近。

In [5]:
train(batch_size=10, lr=0.2, epochs=3, period=10)
Batch size 10, Learning rate 0.200000, Epoch 1, loss 4.9146e-05
Batch size 10, Learning rate 0.200000, Epoch 2, loss 4.8791e-05
Batch size 10, Learning rate 0.020000, Epoch 3, loss 4.8380e-05
w: [[ 1.99995196 -3.39923882]] b: 4.20079

../_images/chapter_optimization_gd-sgd-gluon_9_1.png

同样是批量大小为10,我们把学习率改大。这时我们观察到目标函数值不断增大。这时典型的overshooting问题。

In [6]:
train(batch_size=10, lr=5, epochs=3, period=10)
Batch size 10, Learning rate 5.000000, Epoch 1, loss nan
Batch size 10, Learning rate 5.000000, Epoch 2, loss nan
Batch size 10, Learning rate 0.500000, Epoch 3, loss nan
w: [[ nan  nan]] b: nan

/var/lib/jenkins/miniconda3/envs/gluon_zh_docs/lib/python3.6/site-packages/matplotlib/scale.py:111: RuntimeWarning: invalid value encountered in less_equal
  out[a <= 0] = -1000
../_images/chapter_optimization_gd-sgd-gluon_11_2.png

同样是批量大小为10,我们把学习率改小。这时我们观察到目标函数值下降较慢,直到3个epoch也没能得到接近真实值的解。

In [7]:
train(batch_size=10, lr=0.002, epochs=3, period=10)
Batch size 10, Learning rate 0.002000, Epoch 1, loss 1.0582e+01
Batch size 10, Learning rate 0.002000, Epoch 2, loss 7.1162e+00
Batch size 10, Learning rate 0.000200, Epoch 3, loss 6.8397e+00
w: [[ 0.27345622 -1.61730075]] b: 1.42968

../_images/chapter_optimization_gd-sgd-gluon_13_1.png

结论

  • 使用GluonTrainer可以使模型训练变得更容易。
  • 使用gluon.Trainerlearning_rate属性和set_learning_rate函数可以随意调整学习率。

练习

  • 你还能想到哪些学习率自我衰减的方法?

吐槽和讨论欢迎点这里