时空特征融合网络的多目标跟踪与分割

刘雨亭; 张开华; 樊佳庆; 刘青山

doi:10.11834/jig.210417

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

时空特征融合网络的多目标跟踪与分割
Spatiotemporal feature fusion network based multi-objects tracking and segmentation
2022年27卷第11期页码：3257-3266
纸质出版日期： 2022-11-16 ，

录用日期： 2021-11-02
DOI： 10.11834/jig.210417
稿件说明：

移动端阅览

刘雨亭, 张开华, 樊佳庆, 刘青山. 时空特征融合网络的多目标跟踪与分割[J]. 中国图象图形学报, 2022,27(11):3257-3266.

Yuting Liu, Kaihua Zhang, Jiaqing Fan, Qingshan Liu. Spatiotemporal feature fusion network based multi-objects tracking and segmentation[J]. Journal of Image and Graphics, 2022,27(11):3257-3266.
刘雨亭, 张开华, 樊佳庆, 刘青山. 时空特征融合网络的多目标跟踪与分割[J]. 中国图象图形学报, 2022,27(11):3257-3266. DOI： 10.11834/jig.210417.

Yuting Liu, Kaihua Zhang, Jiaqing Fan, Qingshan Liu. Spatiotemporal feature fusion network based multi-objects tracking and segmentation[J]. Journal of Image and Graphics, 2022,27(11):3257-3266. DOI： 10.11834/jig.210417.

摘要

目的

多目标跟踪与分割是计算机视觉领域一个重要的研究方向。现有方法多是借鉴多目标跟踪领域先检测然后进行跟踪与分割的思路，这类方法对重要特征信息的关注不足，难以处理目标遮挡等问题。为了解决上述问题，本文提出一种基于时空特征融合的多目标跟踪与分割模型，利用空间三坐标注意力模块和时间压缩自注意力模块选择出显著特征，以此达到优异的多目标跟踪与分割性能。

方法

本文网络由2D编码器和3D解码器构成，首先将多幅连续帧图像输入到2D编码层，提取出不同分辨率的图像特征，然后从低分辨率的特征开始通过空间三坐标注意力模块得到重要的空间特征，通过时间压缩自注意力模块获得含有关键帧信息的时间特征，再将两者与原始特征融合，然后与较高分辨率的特征共同输入3D卷积层，反复聚合不同层次的特征，以此得到融合多次的既有关键时间信息又有重要空间信息的特征，最后得到跟踪和分割结果。

结果

实验在YouTube-VIS（YouTube video instance segmentation）和KITTI MOTS（multi-object tracking and segmentation）两个数据集上进行定量评估。在YouTube-VIS数据集中，相比于性能第2的CompFeat模型，本文方法的AP（average precision）值提高了0.2%。在KITTI MOTS数据集中，相比于性能第2的STEm-Seg模型，在汽车类上，本文方法的ID switch指标减少了9；在行人类上，本文方法的sMOTSA（soft multi-object tracking and segmentation accuracy）、MOTSA（multi-object tracking and segmentation accuracy）和MOTSP（multi-object tracking and segmentation precision）分别提高了0.7%、0.6%和0.9%，ID switch指标减少了1。在KITTI MOTS数据集中进行消融实验，验证空间三坐标注意力模块和时间压缩自注意力模块的有效性，消融实验结果表明提出的算法改善了多目标跟踪与分割的效果。

结论

提出的多目标跟踪与分割模型充分挖掘多帧图像之间的特征信息，使多目标跟踪与分割的结果更加精准。

Abstract

Objective

multiple-objects-oriented tracking and segmentation aims to track and segment a variety of video-based objects

which is concerned about detection

tracking and segmentation. Such existing methods are derived of tracking and segmenting in the context of multi-objects tracking detection. But

it is challenged to resolve the target occlusion and its contexts for effective features extraction. Our research is focused on a joint multi-object tracking and segmentation method based on the 3D spatiotemporal feature fusion module (STFNet)

spatial tri-coordinated attention (STCA) and temporal-reduced self-attention (TRSA)

which is adaptively for salient feature representations selection to optimize tracking and segmentation performance.

Method

The STFNet is composed of a 2D encoder and a 3D decoder. First

multiple frames are put into the 2D encoder in consistency and the decoder takes low-resolution features as input. The low-resolution features is implemented for feature fusion through 3 layers of 3D convolutional layers

the spatial features of the key spatial information is then obtained via STCA module

and the key-frame-information-involved temporal features are integrated through the TRSA. They are all merged with the original features. Next

the higher resolution features and the low-level fusion features are put into the 3D convolutional layer (1×1×1) together

and the features of different levels are replicable to aggregate the features with key frame information and salient spatial information. Finally

our STFNet is fitted to the features into the three-dimensional Gaussian distribution of each case. Every Gaussian distribution is assigned different pixels on continuous frames for multi-scenario objects or their background. It can achieve the segmentation of the each target. Specifically

STCA is focused on the attention-enhanced version of the coordinated attention. The coordinated attention is based on the horizontal and vertical attention weights only. The attention mechanism can be linked to the range of local information without the dimensioned channel information. The STCA is added a channel-oriented attention mechanism to retain valid information or discard useless information further. First

the STCA is used to extract horizontal/vertical/channel-oriented features via average pooling. It can encode the information from the three coordinated directions for subsequent extraction of weight coefficients. Next

the STCA performs two-by-two fusion of the three features. Furthermore

it concatenates them and put them into the 1×1 convolution for feature fusion. Next

the STCA is used to put them into the batch of benched layer and a non-linear activation function. The features of the three coordinate directions can be fused. Third

the fusion features are leveraged to obtain the separate attention features for each coordinate direction. The attention features of the same direction are added together

and each direction is obtained through the sigmoid function. Finally

to get the output of the STCA

the weight is multiplied by the original feature. For the TRSA

to solve the frequent occlusion in multi-targets tracking and segmentation

temporal-based multi-features are selected. The designed module make the network pay more attention to the object information of the key frame and weaken the information of the occluded frame. 1) The TRSA put features into three 1×1×1 3D convolutions

and the purpose of the convolutional layer is oriented for reducing dimensions. 2) The TRSA is used to fuse dimensions to obtain three matrices excluded temporal scale

and one-dimension convolution is used to realize dimensionality reduction. It can greatly reduce the amount of matrix operations in the latter stage. 3) The TRSA is used to obtain a low-dimensional matrix through transposing two of the matrices

multiplying the non-transposed matrix and the transposed matrix. The result is put into the SoftMax function to get the attention weight. 4) The attention weight is used to multiply the original feature. The features are restored to the original dimension after increasing dimension

rearranging dimension and entering 3D convolution.

Result

Our main testing datasets are based on YouTube video instance segmentation(YouTube-VIS) and multi-object tracking and segmentation(KITTI MOTS). For the YouTube-VIS dataset

we combine YouTube-VIS and common objects in context(COCO) training for the COCO training set and the overlapped part is just 20 object classes. The size of the input image is 640×1 152 pixels. The evaluation indicators like average precision (AP) and average recall (AR) in MaskTrack region convolutional neural network(R-CNN) are used to evaluate the performance of model tracking and segmentation. For the KITTI MOTS dataset

it is trained on the KITTI MOTS training set. The size of input image is 544×1 792 pixels. The evaluation indicators like soft multi-object tracking and segmentation accuracy(sMOTSA)

multi-object tracking and segmentation accuracy(MOTSA)

multi-object tracking and segmentation precision(MOTSP)

and ID switch(IDS) in TrackR-CNN are used to evaluate the performance of model tracking and segmentation. Our data results are augmented by random horizontal flip

video reverse order

and image brightness enhancement. Our experiments are based on using ResNet-101 as the backbone network and initializing the backbone network with the weights of the Mask R-CNN pre-training model trained on the COCO training set. The decoder network weights are used for random initialization weights method. Three loss functions are used for training. The Lovász Hinge loss function is targeted for learning the feature embedding vector

the smoothness loss function is oriented for learning the variance value

and the L2 loss is used for generating the instance center heat map

respectively. For the YouTube-VIS dataset

the AP value is increased by 0.2% compared to the second-performing CompFeat. For the KITTI MOTS dataset

in the car category

compared to the second-performance STEm-Seg

the ID switch index is reduced by 9; in the pedestrian category

compared to the second-performance STEm-Seg

sMOTSA is increased by 0.7%

and MOTSA is increased 0.6%

MOTSP is increased by 0.9%

ID switch index is decreased by 1. At the same time

our ablation experiments are carried out in the KITTI MOTS dataset. The STCA improves the network effect by 0.5% in comparison with the baseline

and the TRSA improves the network effect by 0.3% compared to the baseline. The result shows that our two modules are relatively effective.

Conclusion

We demonstrate a multi-objects tracking and segmentation model based on spatiotemporal feature fusion. The model can fully mine the feature information between multiple frames of the video

and it makes the results of target tracking and segmentation more accurate. The Experimental results illustrate that our potential STFNet can optimize target occlusion to a certain extent.

关键词

深度学习多目标跟踪与分割(MOTS)3D卷积神经网络特征融合注意力机制

Keywords

deep learningmulti-object tracking and segmentation(MOTS)3D convolutional neural networkfeature fusionattention mechanism

references

Athar A, Mahadevan S, Ošep A, Leal-Taixé L and Leibe B. 2020. Stem-seg: spatio-temporal embeddings for instance segmentation in videos//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 158-177 [DOI: 10.1007/978-3-030-58621-8_10http://dx.doi.org/10.1007/978-3-030-58621-8_10]

Berman M, Triki A R and Blaschko M B. 2018. The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4413-4421 [DOI: 10.1109/CVPR.2018.00464http://dx.doi.org/10.1109/CVPR.2018.00464]

Fang L and Yu F Q. 2020. Multi-object tracking based on adaptive online discriminative appearance learning and hierarchical association. Journal of Image and Graphics, 25(4): 708-720

方岚, 于凤芹. 2020. 自适应在线判别外观学习的分层关联多目标跟踪. 中国图象图形学报, 25(4): 708-720

Fu J, Liu J, Tian H J, Li Y, Bao Y J, Fang Z W and Lu H Q. 2019. Dual attention network for scene segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3141-3149 [DOI: 10.1109/CVPR.2019.00326http://dx.doi.org/10.1109/CVPR.2019.00326]

Fu Y, Yang L J, Liu D, Huang T S and Shi H. 2020. CompFeat: comprehensive feature aggregation for video instance segmentation//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI: 1361-1369

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2980-2988 [DOI: 10.1109/ICCV.2017.322http://dx.doi.org/10.1109/ICCV.2017.322]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778 [DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hou Q B, Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 13708-13717 [DOI: 10.1109/CVPR46437.2021.01350http://dx.doi.org/10.1109/CVPR46437.2021.01350]

Hu J, Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141 [DOI: 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745]

Lin C C, Hung Y, Feris R and He L L. 2020. Video instance segmentation tracking with a modified VAE architecture//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13144-13154 [DOI: 10.1109/CVPR42600.2020.01316http://dx.doi.org/10.1109/CVPR42600.2020.01316]

Lin C C, Zhao G S, Yin A H, Ding B C, Guo L and Chen H B. 2020. AS-PANet: a chromosome instance segmentation method based on improved path aggregation network architecture. Journal of Image and Graphics, 25(10): 2271-2280

林成创, 赵淦森, 尹爱华, 丁笔超, 郭莉, 陈汉彪. 2020. AS-PANet: 改进路径增强网络的重叠染色体实例分割. 中国图象图形学报, 25(10): 2271-2280

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944 [DOI: 10.1109/CVPR.2017.106]http://dx.doi.org/10.1109/CVPR.2017.106].

Luiten J, Zulfikar I E and Leibe B. 2020. UnOVOST: unsupervised offline video object segmentation and tracking//Proceedings of 2020 IEEE/Winter Conference on Applications of Computer Vision. Snowmass, USA: IEEE: 1989-1998 [DOI: 10.1109/WACV45572.2020.9093285http://dx.doi.org/10.1109/WACV45572.2020.9093285]

Porzi L, Hofinger M, Ruiz I, Serrat J, Bulò S R and Kontschieder P. 2020. Learning multi-object tracking and segmentation from automatic annotations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6845-6854 [DOI: 10.1109/CVPR42600.2020.00688http://dx.doi.org/10.1109/CVPR42600.2020.00688]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 6000-6010

Voigtlaender P, Chai Y N, Schroff F, Adam H, Leibe B and Chen L C. 2019b. FEELVOS: fast end-to-end embedding learning for video object segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9473-9482 [DOI: 10.1109/CVPR.2019.00971http://dx.doi.org/10.1109/CVPR.2019.00971]

Voigtlaender P, Krause M, Osep A, Luiten J, Sekar B B G, Geiger A and Leibe B. 2019a. MOTS: multi-object tracking and segmentation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7934-7943 [DOI: 10.1109/CVPR.2019.00813http://dx.doi.org/10.1109/CVPR.2019.00813]

Wang X L, Girshick R, Gupta A and He K M. 2018. Non-local neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7794-7803 [DOI: 10.1109/CVPR.2018.00813http://dx.doi.org/10.1109/CVPR.2018.00813]

Wang X Q, Jiang J G and Qi M B. 2017. Hierarchical multi-object tracking algorithm based on globally multiple maximum clique graphs. Journal of Image and Graphics, 22(10): 1401-1408

王雪琴, 蒋建国, 齐美彬. 2017. 全局多极团的分层关联多目标跟踪. 中国图象图形学报, 22(10): 1401-1408[DOI: 10.11834/jig.160527]

Wojke N, Bewley A and Paulus D. 2017. Simple online and realtime tracking with a deep association metric//Proceedings of 2017 IEEE International Conference on Image Processing (ICIP). Beijing, China: IEEE: 3645-3649 [DOI: 10.1109/ICIP.2017.8296962http://dx.doi.org/10.1109/ICIP.2017.8296962]

Xu Z B, Zhang W, Tan X, Yang W, Huang H, Wen S L, Ding E R and Huang L S. 2020. Segment as points for efficient online multi-object tracking and segmentation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 264-281 [DOI: 10.1007/978-3-030-58452-8_16http://dx.doi.org/10.1007/978-3-030-58452-8_16]

Yang L J, Fan Y C and Xu N. 2019. Video instance segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 5187-5196 [DOI: 10.1109/ICCV.2019.00529http://dx.doi.org/10.1109/ICCV.2019.00529]

Yang L J, Wang Y R, Xiong X H, Yang J C and Katsaggelos A K. 2018. Efficient video object segmentation via network modulation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6499-6507 [DOI: 10.1109/CVPR.2018.00680http://dx.doi.org/10.1109/CVPR.2018.00680]

文章被引用时，请邮件提醒。

提交

航空遥感图像深度学习目标检测技术研究进展

红外与可见光图像特征动态选择的目标检测网络

显著性引导的目标互补隐藏弱监督语义分割

轻量级图像超分辨率的蓝图可分离卷积Transformer网络

融合残差上下文编码和路径增强的视杯视盘分割