多GPU来训练 — 使用Gluon

在Gluon里可以很容易的使用数据并行。在多GPU来训练 — 从0开始里我们手动实现了几个数据同步函数来使用数据并行,Gluon里实现了同样的功能。

多设备上的初始化

之前我们介绍了如果使用initialize()里的ctx在CPU或者特定GPU上初始化模型。事实上,ctx可以接受一系列的设备,它会将初始好的参数复制所有的设备上。

这里我们使用之前介绍Resnet18来作为演示。

In [1]:
import sys
sys.path.append('..')
import utils
from mxnet import gpu
from mxnet import cpu

net = utils.resnet18(10)
ctx = [gpu(0), gpu(1)]
net.initialize(ctx=ctx)

记得前面提到的延迟初始化,这里参数还没有被初始化。我们需要先给定数据跑一次。

Gluon提供了之前我们实现的split_and_load函数,它将数据分割并返回各个设备上的复制。然后根据输入的设备,计算也会在相应的数据上执行。

In [2]:
from mxnet import nd
from mxnet import gluon

x = nd.random.uniform(shape=(4, 1, 28, 28))
x_list = gluon.utils.split_and_load(x, ctx)
print(net(x_list[0]))
print(net(x_list[1]))

[[-0.00299451 -0.114948   -0.04571831 -0.08353794  0.09219883 -0.10255374
   0.08285993  0.08471885 -0.03377745  0.0142048 ]
 [-0.01095816 -0.12053964 -0.05160385 -0.08963331  0.08892047 -0.10402268
   0.07713397  0.08005997 -0.02352627  0.02912929]]
<NDArray 2x10 @gpu(0)>

[[-0.00705471 -0.11297002 -0.04886303 -0.08850279  0.0929175  -0.10409503
   0.07982443  0.08671783 -0.01739573  0.02761966]
 [-0.01419117 -0.10728938 -0.0441721  -0.08339462  0.09654251 -0.09772847
   0.08203971  0.09051772 -0.02636191  0.02598564]]
<NDArray 2x10 @gpu(1)>

这时候我们可以来看初始的过程发生了什么了。记得我们可以通过data来访问参数值,它默认会返回CPU上值。但这里我们只在两个GPU上初始化了,在访问的对应设备的值的时候,我们需要指定设备。

In [3]:
weight = net[1].params.get('weight')
print(weight.data(ctx[0])[0])
print(weight.data(ctx[1])[0])
try:
    weight.data(cpu())
except:
    print('Not initialized on', cpu())

[[[-0.00613896 -0.03968295  0.00958075]
  [-0.05106945 -0.06736943 -0.02462026]
  [ 0.01646897 -0.04904552  0.0156934 ]]]
<NDArray 1x3x3 @gpu(0)>

[[[-0.00613896 -0.03968295  0.00958075]
  [-0.05106945 -0.06736943 -0.02462026]
  [ 0.01646897 -0.04904552  0.0156934 ]]]
<NDArray 1x3x3 @gpu(1)>
Not initialized on cpu(0)

上一章我们提到过如何在多GPU之间复制梯度求和并广播,这个在gluon.Trainer里面会被默认执行。这样我们可以实现完整的训练函数了。

训练

In [4]:
from mxnet import gluon
from mxnet import autograd
from time import time
from mxnet import init

def train(num_gpus, batch_size, lr):
    train_data, test_data = utils.load_data_fashion_mnist(batch_size)

    ctx = [gpu(i) for i in range(num_gpus)]
    print('Running on', ctx)

    net = utils.resnet18(10)
    net.initialize(init=init.Xavier(), ctx=ctx)
    loss = gluon.loss.SoftmaxCrossEntropyLoss()
    trainer = gluon.Trainer(
        net.collect_params(),'sgd', {'learning_rate': lr})

    for epoch in range(5):
        start = time()
        total_loss = 0
        for data, label in train_data:
            data_list = gluon.utils.split_and_load(data, ctx)
            label_list = gluon.utils.split_and_load(label, ctx)
            with autograd.record():
                losses = [loss(net(X), y) for X, y in zip(
                    data_list, label_list)]
            for l in losses:
                l.backward()
            total_loss += sum([l.sum().asscalar() for l in losses])
            trainer.step(batch_size)

        nd.waitall()
        print('Epoch %d, training time = %.1f sec'%(
            epoch, time()-start))

        test_acc = utils.evaluate_accuracy(test_data, net, ctx[0])
        print('         validation accuracy = %.4f'%(test_acc))

尝试在单GPU上执行。

In [5]:
train(1, 256, .1)

Running on [gpu(0)]
Epoch 0, training time = 15.1 sec
         validation accuracy = 0.8619
Epoch 1, training time = 14.4 sec
         validation accuracy = 0.8952
Epoch 2, training time = 14.4 sec
         validation accuracy = 0.8976
Epoch 3, training time = 14.4 sec
         validation accuracy = 0.8924
Epoch 4, training time = 14.4 sec
         validation accuracy = 0.8993

同样的参数,但使用两个GPU。

In [6]:
train(2, 256, .1)
Running on [gpu(0), gpu(1)]
Epoch 0, training time = 11.0 sec
         validation accuracy = 0.8820
Epoch 1, training time = 10.4 sec
         validation accuracy = 0.9005
Epoch 2, training time = 10.5 sec
         validation accuracy = 0.9077
Epoch 3, training time = 10.5 sec
         validation accuracy = 0.9054
Epoch 4, training time = 10.5 sec
         validation accuracy = 0.8786

增大批量值和学习率

In [7]:
train(2, 512, .2)
Running on [gpu(0), gpu(1)]
Epoch 0, training time = 8.5 sec
         validation accuracy = 0.8321
Epoch 1, training time = 8.4 sec
         validation accuracy = 0.8551
Epoch 2, training time = 8.4 sec
         validation accuracy = 0.8986
Epoch 3, training time = 8.4 sec
         validation accuracy = 0.8893
Epoch 4, training time = 8.4 sec
         validation accuracy = 0.9104

结论

Gluon的参数初始化和Trainer都支持多设备,从单设备到多设备非常容易。

练习

  • 多GPU来训练 — 从0开始不一样,这里我们使用了更现代些的ResNet。看看不同的批量大小和学习率对不同GPU个数上的不一样。
  • 有时候各个设备计算能力不一样,例如同时使用CPU和GPU,或者GPU之间型号不一样,这时候应该怎么办?

吐槽和讨论欢迎点这里