YOLO — 使用Gluon

接着上一讲的SSD,我们继续来实现一个目标检测的经典算法–YOLO2。YOLO2是继YOLO之后使用纯卷积网络的作品。 这里不再赘述数据集之类的琐事,直接进入算法实现的部分,推荐不熟悉的小伙伴先用SSD的例子热身一下。

In [1]:
import numpy as np
import mxnet as mx
from mxnet import gluon
from mxnet import nd
from mxnet.gluon import nn
from mxnet.gluon import Block, HybridBlock
from mxnet.gluon.model_zoo import vision

YOLO v2

原始卷积输出的转换

我们知道原始卷积输出的是一个 (B, N, H, W)的矩阵,其中B是batch-size, H和W是特征层的空间维度,N是卷积的输出通道数,相对比较复杂,是我们关注的维度。 简单地说,我们需要对每个空间维度上的点(x, y)输出(类别数 + 1 + 4)个预测值。类别数很好理解,就是训练集里正类的数量, 4也很好理解,就是每个预测框的偏移量预测(x, y, w, h)

中心点的转换

中心点是sigmoid函数的输出值,本身在0到1之间,表示的意义是每个格点内部的空间位置,我们需要把它们转换成图片上的相对位置。 通过arange函数生成连续递增的数列,加上预测值,就是每个目标中心在图片上的相对坐标。

In [2]:
def transform_center(xy):
    """Given x, y prediction after sigmoid(), convert to relative coordinates (0, 1) on image."""
    b, h, w, n, s = xy.shape
    offset_y = nd.tile(nd.arange(0, h, repeat=(w * n * 1), ctx=xy.context).reshape((1, h, w, n, 1)), (b, 1, 1, 1, 1))
    # print(offset_y[0].asnumpy()[:, :, 0, 0])
    offset_x = nd.tile(nd.arange(0, w, repeat=(n * 1), ctx=xy.context).reshape((1, 1, w, n, 1)), (b, h, 1, 1, 1))
    # print(offset_x[0].asnumpy()[:, :, 0, 0])
    x, y = xy.split(num_outputs=2, axis=-1)
    x = (x + offset_x) / w
    y = (y + offset_y) / h
    return x, y

长宽的转换

长宽是exp函数的输出,意义是相对于锚点长宽的比率,我们对预测值的exp()乘以相对锚点的长宽,除以图片格点的数量,得到的是目标长宽相对于图片的尺寸。

In [3]:
def transform_size(wh, anchors):
    """Given w, h prediction after exp() and anchor sizes, convert to relative width/height (0, 1) on image"""
    b, h, w, n, s = wh.shape
    aw, ah = nd.tile(nd.array(anchors, ctx=wh.context).reshape((1, 1, 1, -1, 2)), (b, h, w, 1, 1)).split(num_outputs=2, axis=-1)
    w_pred, h_pred = nd.exp(wh).split(num_outputs=2, axis=-1)
    w_out = w_pred * aw / w
    h_out = h_pred * ah / h
    return w_out, h_out

yolo2_forward作为一个方便使用的函数,会把卷积的通道分开,转换,最后转成我们需要的检测框

In [4]:
def yolo2_forward(x, num_class, anchor_scales):
    """Transpose/reshape/organize convolution outputs."""
    stride = num_class + 5
    # transpose and reshape, 4th dim is the number of anchors
    x = x.transpose((0, 2, 3, 1))
    x = x.reshape((0, 0, 0, -1, stride))
    # now x is (batch, m, n, stride), stride = num_class + 1(object score) + 4(coordinates)
    # class probs
    cls_pred = x.slice_axis(begin=0, end=num_class, axis=-1)
    # object score
    score_pred = x.slice_axis(begin=num_class, end=num_class + 1, axis=-1)
    score = nd.sigmoid(score_pred)
    # center prediction, in range(0, 1) for each grid
    xy_pred = x.slice_axis(begin=num_class + 1, end=num_class + 3, axis=-1)
    xy = nd.sigmoid(xy_pred)
    # width/height prediction
    wh = x.slice_axis(begin=num_class + 3, end=num_class + 5, axis=-1)
    # convert x, y to positions relative to image
    x, y = transform_center(xy)
    # convert w, h to width/height relative to image
    w, h = transform_size(wh, anchor_scales)
    # cid is the argmax channel
    cid = nd.argmax(cls_pred, axis=-1, keepdims=True)
    # convert to corner format boxes
    half_w = w / 2
    half_h = h / 2
    left = nd.clip(x - half_w, 0, 1)
    top = nd.clip(y - half_h, 0, 1)
    right = nd.clip(x + half_w, 0, 1)
    bottom = nd.clip(y + half_h, 0, 1)
    output = nd.concat(*[cid, score, left, top, right, bottom], dim=4)
    return output, cls_pred, score, nd.concat(*[xy, wh], dim=4)

定义一个函数来生成yolo2训练目标

YOLO2寻找真实目标的方法比较特殊,是在每个格点内各自比较,而不是使用全局的预设。而且我们不需要对生成的训练目标进行反向传播,为了简洁描述比较的方法,我们可以在这里转成numpy而且可以用for循环(切记转成numpy会破坏自动求导的记录,只有当反向传播不需要的时候才能使用这个技巧),实际使用中,如果遇到速度问题,我们可以用mx.ndarray矩阵的写法来加速。 这里我们使用了一个技巧:sample_weight(个体权重)矩阵, 用于损失函数内部权重的调整,我们也可以通过权重矩阵来控制哪些个体需要被屏蔽,这一点在目标检测中尤其重要,因为往往大多数的背景区域不需要预测检测框。

In [5]:
def corner2center(boxes, concat=True):
    """Convert left/top/right/bottom style boxes into x/y/w/h format"""
    left, top, right, bottom = boxes.split(axis=-1, num_outputs=4)
    x = (left + right) / 2
    y = (top + bottom) / 2
    width = right - left
    height = bottom - top
    if concat:
        last_dim = len(x.shape) - 1
        return nd.concat(*[x, y, width, height], dim=last_dim)
    return x, y, width, height

def center2corner(boxes, concat=True):
    """Convert x/y/w/h style boxes into left/top/right/bottom format"""
    x, y, w, h = boxes.split(axis=-1, num_outputs=4)
    w2 = w / 2
    h2 = h / 2
    left = x - w2
    top = y - h2
    right = x + w2
    bottom = y + h2
    if concat:
        last_dim = len(left.shape) - 1
        return nd.concat(*[left, top, right, bottom], dim=last_dim)
    return left, top, right, bottom

def yolo2_target(scores, boxes, labels, anchors, ignore_label=-1, thresh=0.5):
    """Generate training targets given predictions and labels."""
    b, h, w, n, _ = scores.shape
    anchors = np.reshape(np.array(anchors), (-1, 2))
    #scores = nd.slice_axis(outputs, begin=1, end=2, axis=-1)
    #boxes = nd.slice_axis(outputs, begin=2, end=6, axis=-1)
    gt_boxes = nd.slice_axis(labels, begin=1, end=5, axis=-1)
    target_score = nd.zeros((b, h, w, n, 1), ctx=scores.context)
    target_id = nd.ones_like(target_score, ctx=scores.context) * ignore_label
    target_box = nd.zeros((b, h, w, n, 4), ctx=scores.context)
    sample_weight = nd.zeros((b, h, w, n, 1), ctx=scores.context)
    for b in range(output.shape[0]):
        # find the best match for each ground-truth
        label = labels[b].asnumpy()
        valid_label = label[np.where(label[:, 0] > -0.5)[0], :]
        # shuffle because multi gt could possibly match to one anchor, we keep the last match randomly
        np.random.shuffle(valid_label)
        for l in valid_label:
            gx, gy, gw, gh = (l[1] + l[3]) / 2, (l[2] + l[4]) / 2, l[3] - l[1], l[4] - l[2]
            ind_x = int(gx * w)
            ind_y = int(gy * h)
            tx = gx * w - ind_x
            ty = gy * h - ind_y
            gw = gw * w
            gh = gh * h
            # find the best match using width and height only, assuming centers are identical
            intersect = np.minimum(anchors[:, 0], gw) * np.minimum(anchors[:, 1], gh)
            ovps = intersect / (gw * gh + anchors[:, 0] * anchors[:, 1] - intersect)
            best_match = int(np.argmax(ovps))
            target_id[b, ind_y, ind_x, best_match, :] = l[0]
            target_score[b, ind_y, ind_x, best_match, :] = 1.0
            tw = np.log(gw / anchors[best_match, 0])
            th = np.log(gh / anchors[best_match, 1])
            target_box[b, ind_y, ind_x, best_match, :] = mx.nd.array([tx, ty, tw, th])
            sample_weight[b, ind_y, ind_x, best_match, :] = 1.0
            # print('ind_y', ind_y, 'ind_x', ind_x, 'best_match', best_match, 't', tx, ty, tw, th, 'ovp', ovps[best_match], 'gt', gx, gy, gw/w, gh/h, 'anchor', anchors[best_match, 0], anchors[best_match, 1])
    return target_id, target_score, target_box, sample_weight

我们用YOLO2Output作为yolo2的输出层,其实本质就是一个HybridBlock,内部包了一个卷积层作为最终的输出

In [6]:
class YOLO2Output(HybridBlock):
    def __init__(self, num_class, anchor_scales, **kwargs):
        super(YOLO2Output, self).__init__(**kwargs)
        assert num_class > 0, "number of classes should > 0, given {}".format(num_class)
        self._num_class = num_class
        assert isinstance(anchor_scales, (list, tuple)), "list or tuple of anchor scales required"
        assert len(anchor_scales) > 0, "at least one anchor scale required"
        for anchor in anchor_scales:
            assert len(anchor) == 2, "expected each anchor scale to be (width, height), provided {}".format(anchor)
        self._anchor_scales = anchor_scales
        out_channels = len(anchor_scales) * (num_class + 1 + 4)
        with self.name_scope():
            self.output = nn.Conv2D(out_channels, 1, 1)

    def hybrid_forward(self, F, x, *args):
        return self.output(x)

接下来是下载并加载数据集

In [7]:
from mxnet import gluon

root_url = ('https://apache-mxnet.s3-accelerate.amazonaws.com/'
            'gluon/dataset/pikachu/')
data_dir = '../data/pikachu/'
dataset = {'train.rec': 'e6bcb6ffba1ac04ff8a9b1115e650af56ee969c8',
          'train.idx': 'dcf7318b2602c06428b9988470c731621716c393',
          'val.rec': 'd6c33f799b4d058e82f2cb5bd9a976f69d72d520'}
for k, v in dataset.items():
    gluon.utils.download(root_url+k, data_dir+k, sha1_hash=v)
In [8]:
from mxnet import image
from mxnet import nd

data_shape = 256
batch_size = 32
rgb_mean = nd.array([123, 117, 104])
rgb_std = nd.array([58.395, 57.12, 57.375])

def get_iterators(data_shape, batch_size):
    class_names = ['pikachu', 'dummy']
    num_class = len(class_names)
    train_iter = image.ImageDetIter(
        batch_size=batch_size,
        data_shape=(3, data_shape, data_shape),
        path_imgrec=data_dir+'train.rec',
        path_imgidx=data_dir+'train.idx',
        shuffle=True,
        mean=True,
        std=True,
        rand_crop=1,
        min_object_covered=0.95,
        max_attempts=200)
    val_iter = image.ImageDetIter(
        batch_size=batch_size,
        data_shape=(3, data_shape, data_shape),
        path_imgrec=data_dir+'val.rec',
        shuffle=False,
        mean=True,
        std=True)
    return train_iter, val_iter, class_names, num_class

train_data, test_data, class_names, num_class = get_iterators(
    data_shape, batch_size)
In [9]:
batch = train_data.next()
print(batch)
DataBatch: data shapes: [(32, 3, 256, 256)] label shapes: [(32, 1, 5)]
In [10]:
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 120
import matplotlib.pyplot as plt

def box_to_rect(box, color, linewidth=3):
    """convert an anchor box to a matplotlib rectangle"""
    box = box.asnumpy()
    return plt.Rectangle(
        (box[0], box[1]), box[2]-box[0], box[3]-box[1],
        fill=False, edgecolor=color, linewidth=linewidth)

_, figs = plt.subplots(3, 3, figsize=(6,6))
for i in range(3):
    for j in range(3):
        img, labels = batch.data[0][3*i+j], batch.label[0][3*i+j]
        img = img.transpose((1, 2, 0)) * rgb_std + rgb_mean
        img = img.clip(0,255).asnumpy()/255
        fig = figs[i][j]
        fig.imshow(img)
        for label in labels:
            rect = box_to_rect(label[1:5]*data_shape,'red',2)
            fig.add_patch(rect)
        fig.axes.get_xaxis().set_visible(False)
        fig.axes.get_yaxis().set_visible(False)
plt.show()
../_images/chapter_computer-vision_yolo_16_0.png

损失函数

In [11]:
sce_loss = gluon.loss.SoftmaxCrossEntropyLoss(from_logits=False)
l1_loss = gluon.loss.L1Loss()

评估测量

这里我们取巧用一个自己定义的metric来记录纯损失值,有的时候当你想不出特别贴切的观测函数,不如直接看损失值有没有下降

In [12]:
from mxnet import metric

class LossRecorder(mx.metric.EvalMetric):
    """LossRecorder is used to record raw loss so we can observe loss directly
    """
    def __init__(self, name):
        super(LossRecorder, self).__init__(name)

    def update(self, labels, preds=0):
        """Update metric with pure loss
        """
        for loss in labels:
            if isinstance(loss, mx.nd.NDArray):
                loss = loss.asnumpy()
            self.sum_metric += loss.sum()
            self.num_inst += 1

obj_loss = LossRecorder('objectness_loss')
cls_loss = LossRecorder('classification_loss')
box_loss = LossRecorder('box_refine_loss')

粗粒度调控下每种损失相对的权重

In [13]:
positive_weight = 5.0
negative_weight = 0.1
class_weight = 1.0
box_weight = 5.0

网络

加载模型园里训练好的resnet网络,取出中间的特征提取层,用合适的锚点尺寸新建我们的YOLO2输出层

In [14]:
pretrained = vision.get_model('resnet18_v1', pretrained=True).features
net = nn.HybridSequential()
for i in range(len(pretrained) - 2):
    net.add(pretrained[i])

# anchor scales, try adjust it yourself
scales = [[3.3004, 3.59034],
          [9.84923, 8.23783]]

# use 2 classes, 1 as dummy class, otherwise softmax won't work
predictor = YOLO2Output(2, scales)
predictor.initialize()
net.add(predictor)

这里我们还是需要GPU来加速训练

In [15]:
from mxnet import init
from mxnet import gpu

ctx = gpu(0)
net.collect_params().reset_ctx(ctx)
trainer = gluon.Trainer(net.collect_params(),
                        'sgd', {'learning_rate': 1, 'wd': 5e-4})
In [16]:
import time
from mxnet import autograd
for epoch in range(20):
    # reset data iterators and metrics
    train_data.reset()
    cls_loss.reset()
    obj_loss.reset()
    box_loss.reset()
    tic = time.time()
    for i, batch in enumerate(train_data):
        x = batch.data[0].as_in_context(ctx)
        y = batch.label[0].as_in_context(ctx)
        with autograd.record():
            x = net(x)
            output, cls_pred, score, xywh = yolo2_forward(x, 2, scales)
            with autograd.pause():
                tid, tscore, tbox, sample_weight = yolo2_target(score, xywh, y, scales, thresh=0.5)
            # losses
            loss1 = sce_loss(cls_pred, tid, sample_weight * class_weight)
            score_weight = nd.where(sample_weight > 0,
                                    nd.ones_like(sample_weight) * positive_weight,
                                    nd.ones_like(sample_weight) * negative_weight)
            loss2 = l1_loss(score, tscore, score_weight)
            loss3 = l1_loss(xywh, tbox, sample_weight * box_weight)
            loss = loss1 + loss2 + loss3
        loss.backward()
        trainer.step(batch_size)
        # update metrics
        cls_loss.update(loss1)
        obj_loss.update(loss2)
        box_loss.update(loss3)

    print('Epoch %2d, train %s %.5f, %s %.5f, %s %.5f time %.1f sec' % (
        epoch, *cls_loss.get(), *obj_loss.get(), *box_loss.get(), time.time()-tic))
Epoch  0, train classification_loss 0.00096, objectness_loss 0.03676, box_refine_loss 0.00246 time 9.0 sec
Epoch  1, train classification_loss 0.00032, objectness_loss 0.01557, box_refine_loss 0.00205 time 8.3 sec
Epoch  2, train classification_loss 0.00010, objectness_loss 0.00709, box_refine_loss 0.00194 time 8.3 sec
Epoch  3, train classification_loss 0.00007, objectness_loss 0.00453, box_refine_loss 0.00194 time 8.3 sec
Epoch  4, train classification_loss 0.00006, objectness_loss 0.00341, box_refine_loss 0.00198 time 8.3 sec
Epoch  5, train classification_loss 0.00005, objectness_loss 0.00275, box_refine_loss 0.00198 time 8.3 sec
Epoch  6, train classification_loss 0.00006, objectness_loss 0.00251, box_refine_loss 0.00185 time 8.3 sec
Epoch  7, train classification_loss 0.00005, objectness_loss 0.00219, box_refine_loss 0.00183 time 8.3 sec
Epoch  8, train classification_loss 0.00004, objectness_loss 0.00200, box_refine_loss 0.00178 time 8.3 sec
Epoch  9, train classification_loss 0.00004, objectness_loss 0.00190, box_refine_loss 0.00167 time 8.3 sec
Epoch 10, train classification_loss 0.00005, objectness_loss 0.00182, box_refine_loss 0.00173 time 8.3 sec
Epoch 11, train classification_loss 0.00004, objectness_loss 0.00172, box_refine_loss 0.00160 time 8.3 sec
Epoch 12, train classification_loss 0.00003, objectness_loss 0.00159, box_refine_loss 0.00153 time 8.3 sec
Epoch 13, train classification_loss 0.00004, objectness_loss 0.00160, box_refine_loss 0.00151 time 8.3 sec
Epoch 14, train classification_loss 0.00005, objectness_loss 0.00167, box_refine_loss 0.00143 time 8.3 sec
Epoch 15, train classification_loss 0.00003, objectness_loss 0.00147, box_refine_loss 0.00140 time 8.3 sec
Epoch 16, train classification_loss 0.00004, objectness_loss 0.00146, box_refine_loss 0.00134 time 8.3 sec
Epoch 17, train classification_loss 0.00004, objectness_loss 0.00146, box_refine_loss 0.00130 time 8.3 sec
Epoch 18, train classification_loss 0.00003, objectness_loss 0.00133, box_refine_loss 0.00125 time 8.3 sec
Epoch 19, train classification_loss 0.00003, objectness_loss 0.00140, box_refine_loss 0.00126 time 8.3 sec

预处理和预测函数

In [17]:
def process_image(fname):
    with open(fname, 'rb') as f:
        im = image.imdecode(f.read())
    # resize to data_shape
    data = image.imresize(im, data_shape, data_shape)
    # minus rgb mean, divide std
    data = (data.astype('float32') - rgb_mean) / rgb_std
    # convert to batch x channel x height xwidth
    return data.transpose((2,0,1)).expand_dims(axis=0), im

def predict(x):
    x = net(x)
    output, cls_prob, score, xywh = yolo2_forward(x, 2, scales)
    return nd.contrib.box_nms(output.reshape((0, -1, 6)))

继续读取皮卡丘来做测试

In [18]:
x, im = process_image('../img/pikachu.jpg')
out = predict(x.as_in_context(ctx))
out.shape
out
Out[18]:

[[[ 0.          0.99522698  0.51428801  0.49950701  0.67562884  0.70625955]
  [ 0.          0.98376417  0.2181921   0.58949018  0.35859996  0.73523128]
  [ 0.          0.96043801  0.7542755   0.5093382   0.90127486  0.70931464]
  ...,
  [-1.         -1.         -1.         -1.         -1.         -1.        ]
  [-1.         -1.         -1.         -1.         -1.         -1.        ]
  [-1.         -1.         -1.         -1.         -1.         -1.        ]]]
<NDArray 1x512x6 @gpu(0)>

显示结果

In [19]:
mpl.rcParams['figure.figsize'] = (6,6)

colors = ['blue', 'green', 'red', 'black', 'magenta']

def display(im, out, threshold=0.5):
    plt.imshow(im.asnumpy())
    for row in out:
        row = row.asnumpy()
        class_id, score = int(row[0]), row[1]
        if class_id < 0 or score < threshold:
            continue
        color = colors[class_id%len(colors)]
        box = row[2:6] * np.array([im.shape[0],im.shape[1]]*2)
        rect = box_to_rect(nd.array(box), color, 2)
        plt.gca().add_patch(rect)
        text = class_names[class_id]
        plt.gca().text(box[0], box[1],
                       '{:s} {:.2f}'.format(text, score),
                       bbox=dict(facecolor=color, alpha=0.5),
                       fontsize=10, color='white')
    plt.show()

display(im, out[0], threshold=0.5)
../_images/chapter_computer-vision_yolo_33_0.png

课后习题

  1. 试试改变默认的anchor scale,调整尺寸,增加或减少数量?
  2. 调整不同损失函数相对的权重,看看对于训练结果有什么影响?
  3. 在目标检测这种batch内部有效目标占比很少的情况下,损失函数内部取平均有什么问题,更好的方法?