陈震元1,2,3, 王振东1,2,3, 宫辰1,2,3(1.南京理工大学计算机科学与工程学院, 南京 210094;2.高维信息智能感知与系统教育部重点实验室, 南京 210094;3.江苏省社会安全图像与视频理解重点实验室, 南京 210094)
目标检测是计算机视觉领域的基本任务之一,根据标签信息的不同,可分为全监督目标检测、半监督目标检测和弱监督目标检测等。弱监督目标检测旨在仅利用图像级别的类别标记信息训练检测器,从而完成对测试图像中所有目标物体的定位和分类。因能够显著降低数据标记成本,弱监督目标检测愈发受到关注且已取得令人瞩目的进展。本文由弱监督目标检测的研究意义引入,首先介绍了弱监督目标检测的标签设置及问题定义、基于多示例学习的基础框架和面临的局部主导、实例歧义和计算消耗这 3 大难题,接着按核心网络架构将该领域的典型算法归纳为 3 大类,分别是基于优化候选框生成的算法、结合图像分割的算法和基于自训练的算法,并分别阐述各类算法的核心贡献。进一步地,本文通过实验在多种评估指标上对比了各类弱监督目标检测算法的检测效果。在VOC2007(visual object classes 2007)数据集中,平均精度均值(mean average precision,mAP)最高的方法为 MIST(mul-tiple instance self-training)算法(54.9%),正确定位率(correct localization,CorLoc)最高的方法为 SLV(spatial likeli-hood voting)算法(71.1%)。在 VOC2012 数据集中,mAP 最高的方法为 NDI-WSOD(negative deterministic informationweakly supervised object detection)算法(53.9%),CorLor 最高的方法为 P-MIDN(pyramidal multiple instance detectionnetwork)算法(73.3%)。在 MSCOCO(Microsoft common objects in context)数据集中,在交并比(intersection overunion,IoU)阈值为 50% 时验证集上的平均精度 ValAP50最高的方法为 P-MIDN(pyramidal multiple instance detectionnetwork)(27.4%)。最后探讨了弱监督目标检测未来的研究方向。本文所总结的弱监督目标检测算法框架,对后续研究人员的网络设计、模型探究和优化方向等都具有一定的参考价值。
Image-level labeled weakly supervised object detection:a survey
Chen Zhenyuan1,2,3, Wang Zhendong1,2,3, Gong Chen1,2,3(1.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China;2.Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing 210094, China;3.Jiangsu Key Laboratory of Image and Video Understanding for Social Security, Nanjing 210094, China)
Object detection is a fundamental problem in computer vision and image processing.From the perspective of supervision, it can be divided into fully-supervised, semi-supervised, and weakly-supervised.In recent years, object detection has played an important role in various areas and shown great application value.Precise object detection depends on the accurate region or instance-level image labeling during detector training.However, the complexity of the background and the diversity of objects in real scenes make accurate image labeling extremely time-consuming and laborious.In particular, traditional fully supervised object detection algorithms need to mark the position and category of each object in the image manually with a minimum rectangular box.Thus, the cost of acquiring a training label is increased.By contrast, weakly-supervised object detection(WSOD)algorithms only require the category labels of the whole image for training.Thus, a large number of training samples can be easily obtained by searching the category labels on some image websites.WSOD has received increasing attention and achieved encouraging progress because of its ability to reduce the labor cost of labeling remarkably.Therefore, researchers focus on WSOD algorithms based on image-level coarse labeling.These algorithms slightly depend on supervised information.Compared with other supervised object detection tasks, WSOD aims to localize and classify objects in an image by using only image-level category annotations.The present study starts with the research significance of WSOD.First, the definition, basic framework, and main challenges of WSOD are introduced:1) WSOD is performed in the training and test phases with standard detectors.The whole problem of WSOD can be understood as learning a mapping relationship from several candidate boxes contained in an image to image category markers.2)The problem setup of WSOD is consistent with that of multi-example learning in weakly supervised learning.Thus, WSOD can be treated as a learning problem by taking each candidate box and the image containing all the candidate boxes as an example and a "package" itself, respectively.For each category, if the image contains at least one target object of this category, the image is a positive packet;otherwise, it is a negative packet.Therefore, detector parameters can be learned based on candidate boxes in images.If an image is predicted to be a positive packet of a certain class, then the image contains the target of this class.Thus, the target can be identified using a rectangular candidate box.3)WSOD faces three major problems:local dominance problem, instance ambiguity problem, and conspicuous memory consumption problem.Afterward, advanced WSOD algorithms are classified into three categories according to the network architectures:optimization-candidate-box-generation-based algorithms, segmentation-based algorithms, and self-training-based algorithms.Among them, the core of the optimized-candidate-box-generation-based algorithms is the improved candidate box generator in the basic framework.The core of segmentation-based and self-training-based algorithms is the improved detector in the basic framework.The difference is that the former algorithms aim to add a segmentation branch and guide detection through segmentation, whereas the latter algorithms aim to optimize the detection network.Furthermore, the detection results of various WSOD algorithms are compared under several evaluation metrics through extensive experiments.This study selects and compares the current mainstream WSOD algorithms on PASCAL visual object class 2017(VOC2007)and VOC2012 datasets.All algorithms use the Visual Geometry Group(VGG)network 16 pretrained on the ImageNet LargeScale Visual Recognition Challenge(ILSVRC)dataset as the backbone for feature extraction to ensure the fairness of comparison.Moreover, only the performance of the model itself is evaluated without considering the effect of fully supervised models, such as Fast R-CNN.In the mean average precision (mAP) comparison on the VOC2007 dataset, multiple instance self-training(MIST)is considered the best, with the single model obtaining 54.9% mAP.The mAP of the existing advanced WSOD algorithms is between 50% and 60%.Compared with the mAP of the online instance classifier refinement(OICR)algorithm, which is often used as the baseline method, the mAP of MIST is improved by less than 15%.This finding indicates that this field still has a large room for improvement.The comparison of mAP and correct localization (CorLoc)on the VOC2012 dataset indicates that negative deterministic information weakly supervised object detection (NDI-WSOD)achieves good performance, reaching 53.9%, which is 16% higher than the OICR performance.The best algorithm for the CorLoc is pyramidal multiple instance detection network(P-MIDN), and its performance reaches 73.3%.This value is 11.2% higher than that reached by OICR.In addition, various algorithms are adopted for comparison on Microsoft common objects in context(MS COCO)datasets.The algorithm with the highest ValAP50 is still P-MIDN, which achieves 27.4%.MIST combines optimized pseudo notation generation, regularization technique, and bounding box regression in the self-training process.Thus, it can continue to be superior to its competitors on different datasets.The research of the WSOD algorithm based on image-level labeling has made a great breakthrough because of the vigorous development of deep learning.However, WSOD still faces many challenges, and a certain gap between it and fully supervised object detection exists.Finally, some valuable future research directions in this field are discussed:1)generating a few candidate boxes with high quality, 2)designing a reasonable and efficient cooperative framework for detection and segmentation, 3)designing a reasonable strategy or digging out many improved positive samples through the network itself, and 4) designing lightweight network models that can be applied to mobile terminals.