SiamRPN++

2020-05-09 2021-05-10

DeepLearning / Object Tracking / SiamRPN++

4 minutes read (About 633 words) 0 visits

SiamRPN++[1] is a novel Siamese network based tracker to adopt deep networks that broke strict translation invariance. It performs layer-wise and depth-wise aggregations to successfully trained a ResNet-driven Siamese tracker. There are some details of reading and implementing it.

Paper & Code & note

Paper: SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks(CVPR 2019 paper)
Code: PyTorch
Note: Mendeley

Paper

Abstract

Siamese trackers formulate tracking as convolutional feature cross-correlation that still have an accuracy gap to take advantage of features from deep networks.

This paper proved the core reason comes from the lack ofstrict translation invariance, and break this restriction through a simple yet effective spatial aware sampling strategy.

They further proposed a new model architecture to perform layer-wise and depth- wise aggregations.

It obtains currently the best results on five large tracking benchmarks.

Problem Description

Padding in deep networks will destroy the strict translation invariance.

RPN requires asymmetrical features for classification and regression.

Problem Solution

Sampling strategy: break the spatial invariance restriction.

Layer-wise feature aggregation: predict the similarity map from features learned at multiple levels.

Depth-wise separable correlation: produce multiple similarity maps associated with different semantic meanings to reduces the parameter number.

Conceptual Understanding

Hypothesis: the violation of strict translation invariance will lead to a spatial bias.

Experiments: targets are placed in the center with different shift ranges in sepreate training experiments.

Results: a strong center bias is learned, increasing shift ranges could learn more area to alleviate it.

Core Conception

Layer-wise Aggregation

They explore multi-level features both low level and semantic information that extracted from the last three residual block, refering these outputs as $F_3(z)$, $F_4(z)$, and $F_5(z)$.

The output sizes of the three RPN modules have the same spatial resolution, weighted sum is adopted directly on the RPN output.

Depth-wise Cross Correlation

A conv-bn block is adopted to make two feature maps with the same number of channels do the correlation operation.

Another conv-bn-relu block is appended to fuse different channel outputs.

Furthermore, an interesting phenomena is that the objects in the same category have high response on same channels, while responses of the rest channels are suppressed. It can be comprehended as each channel represents some semantic information.

Experiments

Code

The complete code can be found in [pysot][2].

Note

More details of SiamRPN++ and the like can be found in [3].

References

[1] LI, Bo, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. p. 4282-4291.
[2] pysot. https://github.com/STVIR/pysot.
[3] Erer Huang. “Overview of Siamese Network Methods.” https://zhuanlan.zhihu.com/p/66757733.

Title：SiamRPN++
Author：Gojay
Link：https://gojay.top/2020/05/09/SiamRPN++/
Date：2020-05-09
Copyright：All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.

DL, Tracking, VOT