发布时间: 2019-08-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180475
2019 | Volume 24 | Number 8

图像分析和识别

基于视觉感知的智能扫地机器人的垃圾检测与分类

宁凯¹, 张东波^1,2, 印峰^1,2, 肖慧辉¹

1. 湘潭大学信息工程学院, 湘潭 411105;

2. 机器人视觉感知与控制技术国家工程实验室, 长沙 410012

收稿日期: 2018-08-07; 修回日期: 2019-03-17

基金项目: 国家自然科学基金项目（61602397）；湖南省自然科学基金项目（2017JJ2251，2017JJ3315）

第一作者简介: 宁凯, 1993年生, 男, 硕士研究生, 主要研究方向为模式识别与图像处理。E-mail:511602130@qq.com;
印峰, 男, 讲师, 主要研究方向为机器人技术, 群体智能算法与应用。E-mail:433649161066@worlduc.com;
肖慧辉, 女, 硕士研究生, 主要研究方向为模式识别与图象处理。E-mail:945773998@qq.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2019)08-1358-11

摘要

目的为了提高扫地机器人的自主性和智能化程度，为扫地机器人配备视觉传感器，使其获得视觉感知能力，通过研究有效的垃圾检测分类模型与算法，实现对垃圾的定位与识别，引导扫地机器人对垃圾进行自动识别与按类处理，提高工作的目的性和效率，避免盲动和减少能耗。方法选择检测速度较快的YOLOv2作为主网络模型，结合密集连接卷积网络，嵌入深层密集模块，对YOLOv2进行改进，提出一种YOLOv2-dense网络，该网络可以充分利用图像的高分辨率特征，实现图像浅层和深层特征的复用与融合。结果测试结果表明，智能扫地机器人使用本文方法可以有效识别不同形态的常见垃圾类别，在真实场景中，测试识别准确率为84.98%，目标检测速度达到26帧/s。结论实验结果表明，本文构建的YOLOv2-dense网络模型具有实时检测的速度，并且在处理具有不同背景、光照、视角与分辨率的图片时，表现出较强的适应和识别性能。在机器人移动过程中，可以保证以较高的准确率识别出垃圾的种类，整体性能优于原YOLOv2模型。

关键词

YOLOv2网络; 扫地机器人; 密集连接; 神经网络; 深度学习

Garbage detection and classification of intelligent sweeping robot based on visual perception

Ning Kai¹, Zhang Dongbo^1,2, Yin Feng^1,2, Xiao Huihui¹

1. College of Information Engineering, Xiangtan University, Xiangtan 411105, China;

2. Robot Visual Perception & Control Technology National Engineering Laboratory, Changsha 410012, China

Supported by: National Natural Science Foundation of China (61602397)

Abstract

Objective Home service robots have attracted widespread attention in recent years due to their close relationship with the daily lives of humans. Sweeping robots are the first home service robots that have entered the consumer market and are available extensively. At present, the intelligent sweeping robots on the market have only basic functions such as automatic path planning, automatic charging, and automatic obstacle avoidance, thereby greatly reducing the workload of housework, which is an important reason it is widely accepted by the market. Despite this situation, the current level of intelligence of sweeping robots remains low, and the main shortcomings are reflected in two aspects:First, high-level perception and discriminating ability toward the environment are lacking. For example, the behavior pattern adopted in the cleaning process is usually a random walk mode, and some higher intelligence sweeping robots may support simple path planning functions, such as taking a "Z" path. However, this function is generally classified under the "blind" mode, because the robot will perform cleaning activities whether or not garbage is present in its working path. Therefore, the work efficiency is low, and the energy consumption is greatly increased. Second, the current sweeping robots generally do not have the ability to distinguish the category of garbage. If garbage can be handled according to their correct category, then it will not only facilitate the sorting of garbage but also meet environmental protection requirements. Sweeping robots are equipped with visual sensors to achieve visual perception to improve their autonomy and intelligentization. A study of the effective classification model and algorithm of garbage detection can develop the process of location and garbage recognition, and a sweeping robot can be guided to automatically recognize and deal with different types of garbage, thereby improving the purpose and efficiency of its work. Moreover, the sweeping robot can avoid being in the blind state and reduce unnecessary energy consumption. Method The proposed garbage detection and classification method based on machine vision technology aims to improve the autonomous ability of the sweeping robot. This method selects the YOLOv2 network as the main network, which has a fast detection speed in the regression method. The YOLOv2 network is combined with the dense convolutional network (DenseNet) to make full use of the high-resolution features of detection objects. The shallow and deep features of object detection can be reused and fused by embedding deep dense modules to reduce the loss of feature information. Finally, a garbage detection and classification model is built by using a multiscale strategy to train the detection network. The self-built sample dataset is used in the training and testing process. To expand the sample size, the experiment uses the image data enhancement tool ImageDataGenerator provided by Keras to perform horizontal mirror flip, random rotation, cutting, zooming, and other types of processing. The training process of the YOLOv2-dense network is as follows:First, the original image is obtained through data enhancement, and the dataset obtained by the original image and the data enhancement is manually marked. Then, YOLOv2-dense training is performed with the dataset, and the training model is obtained. To adapt large dynamic changes in the scale of the object during the movement of the mobile robot, a multiscale training strategy is adopted for network training, and the detection model is obtained. The mobile robot experiment platform uses the omnidirectional motion of the Anycbot four-wheel-drive robot chassis, the AVR motion controller, and the NVIDIA Jetson TX2 of the 256-core GPU as the visual processor. Result The detection network model is trained and tested with our self-built dataset. Test results show that the improved network structure can retain more original picture information and enhance the extraction ability of target features, and this method can effectively identify the common types of garbage, including different forms of liquid and solid, and quickly and accurately mark out the location of the garbage. Real-time detection with different angles and distances presents good results. The accuracy of YOLOv2 detection is only 82.3%, and the speed is 27 frames per second. By contrast, the accuracy of the proposed improved network YOLOv2-dense reaches 84.98%, which is 2.62% higher than that of YOLOv2, and a speed of 26 frames per second can be achieved, which satisfies real-time detection. Conclusion The experimental results show that the YOLOv2-dense network model built in this study has the speed of real-time detection. It shows strong adaptability and recognition performance when dealing with pictures with different backgrounds, illuminations, visual angles, and resolutions. For the same object, the detection effect is greatly changed at different angles and scales, and the detection performance of long-distance objects is poor. Moreover, for 3D objects, different viewing angles have a greater impact on detection. The result of the object detection is dynamically changed during the movement of the robot. The proposed method can ensure that the type of garbage is accurately identified most of the time or within a certain distance and angle of view. Moreover, the proposed method can guarantee the identification of garbage types with higher accuracy, and its overall performance is better than that of the original YOLOv2 model.

Key words

YOLOv2 network; sweeping robot; dense connectivity pattern; neural networks; deep learning

0 引言

近年来，家居服务机器人引起广泛关注，其中扫地机器人是最早实现产业化，且已经广泛进入消费市场的产品。目前市面上的扫地机器人虽然具备了路径规划、自动充电和避障等基本功能，但智能化程度不高，主要体现在两个方面：一是缺乏对环境的高层感知与判别能力，例如清扫过程中通常为随机游走模式，虽然可加入简单的路径规划功能^[1]，但是总的来说，清扫过程具有盲动性，无论工作路径中是否有垃圾需要处理，都会执行清扫行为，因此工作效益和效率较低，同时大大增加了不必要的工作能耗。二是不具备对垃圾进行分类辨别、按类处理的能力。事实上，不同类别的垃圾应采取不同的处理方式，实现按类处理，有利于垃圾按类分拣，满足环保要求，同时能够大大增强扫地机器人的清扫能力。

为解决上述问题，一种可行的方案是为扫地机器人配备视觉传感器^[2]，使其获得视觉感知能力^[3]，利用检测分类模型与算法，实现对垃圾的自动定位与识别^[4-5]，引导扫地机器人进行智能化的自主清扫，从而大大提高工作的目的性和效率，避免盲动和减少能耗。要达到上述目的，实现垃圾准确检测与分类是必须解决的关键问题。

目前，围绕扫地机器人自主进行垃圾检测与分类开展的工作尚未见有公开报道。垃圾检测与分类属于计算机视觉领域的物体检测与分类问题。在物体检测领域，近年来常用基于手工设计具有不变性的局部特征描述方法^[6](例如经典的SIFT(scale-invariant feature transform)^[7]、SURF(speeded up robust features)^[8]、ORB(oriented fast and rotated brief)^[9]、HOG(histogram of oriented gridients)^[10]等描述算子)处理遮挡、复杂背景等问题。但是手工设计特征鲁棒性差，算法适应性不强。采用基于深度学习的卷积神经网络(CNN)^[11-12]方法可以较大幅度提升目标检测的性能^[13]。早期Girshick等人^[14]提出的Region-CNN(R-CNN)采用区域候选方法实现目标检测，但主要问题是每一个region proposal都需要输入到CNN中进行计算，导致计算量很大。为了提高效率，在R-CNN基础上，Girshick^[15]提出了Fast Region-CNN(Fast R-CNN)，成功解决了R-CNN重复计算问题。同时，通过微调候选框位置，其训练和测试时间都得到了提升；随后Ren等人^[16]提出了Faster Region-CNN(Faster R-CNN)，引入RPN(region proposal network)网络提取proposal，真正实现了端到端的网络学习，但在速度上仍然无法满足实时的要求。为了实现实时检测，Redmon等人^[17]提出基于回归的YOLO(you only look once)模型，将分类和定位两个任务统一到同一个网络，具有实时检测能力，但准确性低于Fast R-CNN与Faster R-CNN。同年Liu等人^[18]提出SSD(single shot multibox detector)，该方法结合了YOLO的回归思想和Faster R-CNN的anchor机制，同时兼顾了速度和准确率。而Redmon等人^[19]在原有的YOLO基础上提出改进的YOLOv2(you only look once version 2)，在YOLO的基础上进一步提高了检测精度，在PASCAL VOC2007数据集上准确率达到76.8%，并且在Geforce GTX Titan X上达到67帧/s。上述网络模型随着网络的加深，数据信息在经过多层网络之后很有可能逐渐消失，且梯度消失问题会越来越明显。为了实现多层的特征重复利用，缓解梯度消失问题，本文在YOLOv2的基础上，结合密集连接卷积网络^[20]改进现有YOLOv2模型，有望进一步提升网络性能。

为解决家居环境下垃圾快速、高精度检测问题，本文选择基于回归方法中检测速度快的YOLOv2网络。通过结合密集连接卷积网络(densenet)，提出了改进的YOLOv2-dense网络结构，同时采用数据增强和多尺度训练策略，提高了检测的准确率，达到了实时的检测速度。本文研究结果可望为研制强智能扫地机器人提供技术支持，提升未来扫地机器人的智能化程度。

1 实验数据集

1.1 垃圾分类与处理方式

家居环境中的垃圾种类非常多，颜色、形状、尺寸差异大，如何合理分类是实现机器人智能作业的一个前提条件。为了简化问题，便于扫地机器人的功用实现，本文按照形态、体积大小等因素，选择了生活中最常见的25类垃圾进行研究，采集了25类垃圾不同环境下的样本图片。

常见垃圾按照形态可分为固体和液体两大类，固体类垃圾按照体积大小采取两种处理方式：1)清扫处理模式。该模式针对7类小体积物体，包括果壳、卫生纸、纸屑以及其他小型物体类。2)抓取处理模式。该模式针对9类大体积物体，包括易拉罐、废纸团、瓶子、包装袋、纸盒、纸杯、塑料杯和塑料袋等。固体类垃圾分类示意图如图 1所示。

图 1 固体类垃圾类别及处理方式

Fig. 1 Solid waste classification diagram

液体类垃圾按照处理方式分为两大类：1)擦除处理模式。该模式针对透明液体如酒水，以及非透明物体包括牛奶、橙汁、西瓜汁、可乐以及其他有颜色饮料共6类。2)高温蒸汽加擦除处理模式。该模式针对污渍如墨汁、油渍等共3类。其中透明类液体附着在地面上，由于透明、形状不确定，特征提取困难，识别难度大。液体类垃圾分类示意图如图 2所示。

图 2 液体类垃圾类别及处理方式

Fig. 2 Liquid waste classification diagram

1.2 实验数据集获取

由于没有公开的数据集，实验所用数据是在实际家居环境中通过摄像头获取和整理的，包括视频和单张图片。视频图片中通常包含不同的物体，从中抽取了1 241张图片样本；而单张图片通常只包含一类物体，从中抽取了1 133张图片样本。采集过程考虑了角度、视角、距离、光照等多种可能情况的影响。实验采集的各类垃圾样本图片的示例如图 3所示。

图 3 各类垃圾样本图片示例

Fig. 3 Examples of various types of garbage samples

为了增加网络训练的样本规模，对现有样本进行扩充，实验使用Keras提供的图像数据增强工具ImageDataGenerator对训练样本进行了水平镜像翻转、随机旋转、裁切、缩放等处理，最后得到5 440张图片，其中80%作为训练集，20%作为测试集。数据增强处理示例如图 4所示。对所有增强后的图片均使用LabelImg工具标注了真实垃圾物体的矩形区域坐标，生成并保存了对应图片中垃圾的坐标信息。

图 4 数据增强处理示例

Fig. 4 Example of data augmentation ((a)original; (b)random rotation and mirroring Ⅰ; (c)random rotation and mirroring Ⅱ)

2 YOLOv2-dense检测模型

2.1 YOLOv2网络

YOLO模型借鉴GoogLeNet网络结构实现端到端的学习，通过回归的方法直接在整张图上提取候选区域，并利用回归的方法输出候选框的位置和类别。YOLO检测模型流程如图 5所示，首先，图像被模型划分为大小为$C×C$的网格，如果一个物体的中心落在某个网格内，则落入物体的网格负责检测该物体。每个网格预测$B$个边框(bounding boxes)，边框信息采用5组信息$T(x, y, w, h, confidence)$表示，即每个边框的中心坐标$(x, y)$，宽高$(w, h)$和置信度$confidence$共5个参数。采用YOLO模型能够实现目标实时检测，但是检测的准确率不高。

图 5 YOLO检测流程

Fig. 5 Detection process of YOLO

YOLOv2是YOLO检测模型的改进版，包括22个卷积、5个池化层，卷积层用于提取图像特征，池化层负责特征降维、压缩参数和数据的数量。YOLOv2在最后阶段移除了dropout层，并在卷积层后加入了批量归一化(batch normalization)操作，避免过拟合。将YOLO网络中最后一个1×1的卷积层移除，增加了3个3×3×1 024的卷积层，最后再接上输出是类别个数的1×1卷积层。

YOLOv2模型网络结构如图 6所示。与YOLO相似，YOLOv2同样将输入图像分成$C×C$个网格，若某个物体中心位置的坐标落入到某个网格，则该网格就负责检测出这个物体。滑动窗口的信息也采用5组信息$T(x, y, w, h, confidence)$表示, 其中，$x$与$y$是当前网格预测到的检测对象的置信度中心位置的横坐标与纵坐标，$w$和$h$分别是滑动窗口的宽度和高度，$confidence$是置信度，反映当前滑动窗口是否包含检测对象及其预测准确性的估计概率，计算为

$ \begin{align} & confidence=\text{P}\left( object \right)\times IOU_{\text{pred}}^{\text{truth}}= \\ & \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left\{ \begin{matrix} IOU_{\text{pred}}^{\text{truth}} & 有目标 \\ 0 & 无目标 \\ \end{matrix} \right. \\ \end{align} $

(1)

图 6 YOLOv2网络

Fig. 6 YOLOv2 network

式中，$\text{P}\left(object \right)$表示滑动窗口包含检测对象的概率，$IOU_{{\rm{pred}}}^{{\rm{truth}}}$表示滑动窗口与真实检测对象区域的重叠面积，若单元格中存在目标，则$\text{P}\left(object \right)$为1，此时置信度$confidence$为$IOU_{{\rm{pred}}}^{{\rm{truth}}}$，否则${\rm{P}}\left({object} \right)$为0，置信度$confidence$为0。在测试时，通过式(2)计算候选框中特定类别的置信分数。

$ \begin{array}{l} {\rm{P}}\left( {\mathit{clas}{\mathit{s}_\mathit{i}}|{\rm{ }}\mathit{object}} \right) \times {\rm{P}}(\mathit{object}) \times 10U_{{\rm{ pred }}}^{{\rm{ truth }}} = \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{\rm{P}}\left( {\mathit{clas}{\mathit{s}_i}} \right) \times IOU_{{\rm{ pred }}}^{{\rm{ truth }}} \end{array} $

(2)

式(2)的值代表该类别出现在候选框中的概率和候选框匹配目标物体的概率。

为了提高检测精度，YOLOv2采用一系列的改进措施，包括：归一化处理、高分辨率分类器、引入锚机制、维度聚类、直接位置预测，细粒度特征、多尺度训练等技巧。而为了提升速度，YOLOv2采用了相对简单的Darknet-19网络。为了进一步改进分类性能，可采用联合训练方法，结合词向量树(wordtree)等方法，使YOLOv2的检测种类继续扩充。

2.2 YOLOv2-dense网络

图片通过YOLOv2网络结构中的20个卷积层和5个池化层进行特征提取后，深层几乎不会利用浅层信息，使得高分辨率的浅层特征利用率大幅度减少，对应特征图上的特征往往难以得到充分训练，从而影响检测精度。为了充分利用高分辨率特征，通过嵌入深层密集模块实现特征的复用和融合，本文引入密集连接卷积网络(densenet)对YOLOv2网络结构进行改进，提出了一种YOLOv2-dense网络，如图 7所示。图 7中虚线部分为YOLOv2-dense中的密集模块(dense block)示意图，具体操作步骤如下：

图 7 YOLOv2-dense网络

Fig. 7 YOLOv2-dense network

1) 在YOLOv2-dense网络中，将21层特征图$\boldsymbol{x}_{0}$作为$H_{1}$的输入，经过归一化处理(BN)和激活函数ReLU (rectified linear units)，用256个1×1的卷积核卷积成256个特征图，再通过归一化处理和ReLU操作，用128个3×3的卷积核卷积成128个特征图$\boldsymbol{x}_{1}$，然后将$\boldsymbol{x}_{0}$与$\boldsymbol{x}_{1}$拼接成640个特征图，将[$\boldsymbol{x}_{0}$, $\boldsymbol{x}_{1}$]作为$H_{2}$的输入。

2)$H_{2}$再经过归一化处理和激活函数ReLU，用256个1×1的卷积核卷积成256个特征图，进行归一化处理和激活函数ReLU操作后，用128个3×3的卷积核卷积成128个特征图$\boldsymbol{x}_{2}$，然后将$\boldsymbol{x}_{0}$、$\boldsymbol{x}_{1}$和$\boldsymbol{x}_{2}$拼接成768个特征图，将[$\boldsymbol{x}_{0}$, $\boldsymbol{x}_{1}$, $\boldsymbol{x}_{2}$]作为$H_{3}$的输入。

3) 依次类推，最后得到13×13×1 024通道的深层特征图。

DenseNet使得$l$层的输入直接影响到之后的所有层，它的输出为

$ \boldsymbol{x}_{l}=H_{l}\left(\left[\boldsymbol{x}_{0}, \boldsymbol{x}_{1}, \cdots, \boldsymbol{x}_{l-1}\right]\right), {l}=1, 2, 3, 4, \cdots $

(3)

式中，$\boldsymbol{x}_{0}$为模块输入特征图，$\boldsymbol{x}_{l}$表示第$l$层的输出，$\left(\left[\boldsymbol{x}_{0}, \boldsymbol{x}_{1}, \cdots, \boldsymbol{x}_{l-1}\right]\right)$为对$\boldsymbol{x}_{0}, \boldsymbol{x}_{1}, \cdots, \boldsymbol{x}_{l-1}$的拼接。${H_l}$(·)为归一化处理(BN)、激活函数ReLU及卷积Conv的组合操作，实现第$ l$层非线性变换。本文采用的$H_{l}$(·)操作为BN →ReLU→Conv(1×1)→BN→ReLU→Conv(3×3)，其中，Conv($n$×$n$)表示该卷积操作的卷积核的大小为$n$×$n$。由于每一层都包含之前所有层的输出信息，在很大程度上解决了随着深度卷积神经网络深度增加而带来的梯度消失问题，使目标检测效果更强。

3 实验结果与分析

3.1 机器人实验平台

实验搭建的移动机器人平台包括可以全向运动的Anycbot四轮驱动机器人底盘、AVR运动控制器、256核GPU的NVIDIA Jetson TX2视觉处理器、车载摄像头及安装支架、蓝牙通信模块与超声波避障模块、ubuntu16.04操作系统、ROS(robot operating system)系统、OpenCV 3.2、TensorFlow0.9、caffe、darknet等工具与框架。模型训练服务器采用的GPU为GTX1080Ti。图 8为系统整体工作流程示意图。

图 8 扫地机器人工作流程

Fig. 8 Sweeping robot workflow

首先输入数据集，在服务器上运行YOLOv2-dense网络训练垃圾检测与分类模型，然后将训练好的模型部署到扫地机器人上，开启板载摄像头，当摄像头检测到某位置存在垃圾时，引导机器人到达垃圾位置，按照系统给出的清扫模式执行作业。

3.2 YOLOv2-dense网络训练

YOLOv2-dense网络的训练流程如下：1)将原始图像通过数据增强获得数据集；2)将原始图像和数据增强得到的数据集进行人工标记；3)用数据集进行YOLOv2-dense训练；4)得到训练模型。由于原始样本数目有限，有必要对数据集进行增强处理，然后对所有图像进行手工标记，为了解决移动机器人运动过程中物体尺度的大幅动态变化问题，采用了多尺度训练策略进行网络训练，最终得到检测模型。

实验平台包括：11 GB Force GTX1080ti×2 GPU，Intel Core CPU I7-7700K，4.5 GHz处理器，32 GB运行内存，操作系统为ubuntu 16.04，框架为darknet。

借鉴多尺度训练策略，在训练过程中，每隔10个批量改变一次模型的输入尺寸，从而使得模型对不同尺寸的图像具有稳健性。尺寸的调整计算为

$ S=32(7+u) $

(4)

式中，$S$为输入图像的尺寸，$u$为在3~12中随机产生的数。多尺度训练策略强迫模型适应不同的输入图像分辨率。与固定分辨率的模型相比，该训练策略在不同分辨率下都有良好的检测结果。

网络训练参数设置：学习率为0.001；学习率衰减策略为steps；样本更新参数为64；最大迭代次数为80 200；学习率变化比例为0.1；动量为0.9；权重衰减正则化参数为0.000 5。分别在训练20 000、40 000、60 000进行学习率的调整，最开始训练时学习率为10^-2，迭代2 000次后调整为10^-3，迭代60 000次后调整为10^-4。网络具体结构如表 1所示。

表 1 YOLOv2-dense网络参数
Table 1 YOLOv2-dense network parameters

下载CSV

操作	特征图输出数量	卷积核尺寸/卷积运算步长	输出/像素
卷积	32	3×3	416×416×32
池化		2×2/2	208×208×32
卷积	64	3×3	208×208×64
池化		2×2/2	104×104×64
卷积	128	3×3	104×104×128
卷积	64	1×1	104×104×64
卷积	128	3×3	104×104×128
池化		2×2/2	52×52×128
卷积	256	3×3	52×52×256
卷积	128	1×1	52×52×128
卷积	256	3×3	52×52×256
池化		2×2/2	26×26×256
卷积	512	3×3	26×26×512
卷积	256	1×1	26×26×256
卷积	512	3×3	26×26×512
卷积	256	1×1	26×26×256
卷积	512	3×3	26×26×512
池化		2×2/2	13×13×512
卷积	1 024	3×3	13×13×1 024
卷积	512	1×1	13×13×512
卷积	1 024	3×3	13×13×1 024
卷积	512	1×1	13×13×512
多密集	256	1×1/1	13×13×1 024
卷积块	128	3×3/1
卷积	1 024	3×3	13×13×1 024
卷积	1 024	3×3	13×13×1 024
融合			26×26×512
卷积	64	3×3	26×26×64
重整切割			13×13×256
融合			13×13×1 280
卷积		3×3	13×13×1 024
卷积		1×1	13×13×150

3.3 检测结果示例与分析

图 9与图 10展示了改进的网络YOLOv2-dense在测试图像上的部分检测结果。图 9为单个物体检测示例，图 10为多物体检测的示例结果。

图 9 单物体检测

Fig. 9 Single object detection

图 10 多物体检测

Fig. 10 Multi-object detection

从图 9与图 10检测结果可以看到，YOLOv2-dense检测精度整体较高，但是仍存在一定的识别错误率，图 11给出了部分识别有误的图片示例。

图 11 识别出错的情况

Fig. 11 Error identification picture((a) picture 1; (b) picture 2; (c) picture 3; (d) picture 4; (e) picture 5)

图 11(a)为同时检测两个不同形状的瓶子，出现了误判塑料瓶为塑料袋的情况；图 11(b)为将检测的塑料瓶误判为易拉罐的情况；图 11(c)检测废纸团与卫生纸，但将卫生纸误判为牛奶；图 11(d)(e)中遗漏了西瓜汁检测。分析本次实验结果可知，样本集数量不充足，模型适应性有待提升，存在漏检误判的问题。同时液体具有很强的不规则性，部分固态物体(例如果壳等)也具有不规则性，检测此类不规则的物体难度很大，容易导致漏检误判的情况。

图 12展示了改进的网络YOLOv2-dense在实时视频检测上的结果。

图 12 实时检测情况

Fig. 12 Real-time detection((a)video1; (b)video2; (c)video3)

视频1(图 12(a))检测的是两个塑料杯和一个纸杯，由于地面反光，导致将地面误判为塑料杯。视频2(图 12(b))检测的为酒水，出现检测过程中部分时刻检测不到酒水。视频3(图 12(c))为检测其他颜色类饮料与橙汁，过程中将颜色类饮料误判成可乐，并在判断过程中，橙汁漏检。三段实时检测视频结果如表 2所示。

表 2 实时视频检测统计结果
Table 2 Real-time video detection statistics

下载CSV

实时视频	总帧数	检测准确帧数	误判帧数	漏检帧数	准确率/%
视频1	1 000	950	50	30	95.0
视频2	1 400	850	550	32	60.1
视频3	1 100	753	347	39	68.3

分析本次实验结果可知，实时检测过程中，地面反光易造成误判，影响物体的检测性能。部分物体相似度很高，例如其他颜色类饮料(咖啡)与可乐，容易造成误判。

图 13展示了改进的网络YOLOv2-dense在实时检测上基于不同视角与不同远近距离的检测结果。

图 13 不同角度与距离检测情况

Fig. 13 Detection of different angles and distances ((a)video4; (b)video5; (c)video6)

视频4(图 13(a))检测的为易拉罐，当仅拍摄易拉罐的底面时，无法检测到易拉罐，转动视角与距离，易拉罐能够被检测到，准确率达到80.9%。视频5(图 13(b))检测的为瓶子，与视频4相似，当摄像头拍摄底部与瓶口时，检测效果不理想，转动视角和移动距离，目标能够检测到，准确率达到77.4%。视频6(图 13(c))为远距离观测，当摄像头离物体很远时，无法检测到易拉罐，当靠近易拉罐时，可以检测到物体。三段实时检测结果如表 3所示。

表 3 不同角度与距离实时检测统计结果
Table 3 Real-time detection of statistical results at different angles and distances

下载CSV

实时视频	总帧数	检测准确帧数	误判帧数	漏检帧数	准确率/%
视频4	2 250	1 821	429	125	80.9
视频5	1 160	898	262	210	77.4
视频6	1 370	891	479	301	65.0

分析本次实验结果可知，对于同一物体，在不同角度、不同尺度下，检测的效果有较大变化，远距离物体检测性能欠佳，同时对于3D物体，不同的视角对检测有较大影响。

将自建数据集上的图片大小归一化为416×416像素，测试原YOLOv2和改进后的YOLOv2-dense的检测性能，结果如表 4所示。YOLOv2检测的准确率为82.3%，速度为27帧/s。本文提出的改进网络YOLOv2-dense的准确率为84.98%，较YOLOv2提高了2.62%，速度为26帧/s，与YOLOv2相差不大，能够达到实时检测的效果。

表 4 网络对比实验结果
Table 4 Network comparison experiment results

下载CSV

	输入尺寸大小/像素	最大迭代次数	准确率%	召回率%	检测速度/(帧/s)
YOLOv2	416×416	80 200	82.36	92.04	27
YOLOv2-dense	416×416	80 200	84.98	95.03	26
注：加粗字体表示最优结果。

尽管存在部分问题，但是视频测试表明，在机器人移动过程中，物体检测的结果是动态变化的，算法可以保证在大部分时间内或特定的距离和视角内准确地识别出垃圾的种类。

本文提出的YOLOv2-dense算法的识别性能优于YOLOv2算法，因为改进后的网络结构可以保留更多的浅层图片信息，提高了提取目标特征的能力，使得算法在处理具有不同光照、背景、视角与分辨率的图片时，表现出更强的适应性和识别性能。

4 结论

为了提高扫地机器人的自主清扫能力，本文研究了基于视觉的扫地机器人的垃圾检测与分类问题，在原有目标检测网络YOLOv2架构基础上，嵌入了DenseNet网络的多密集卷积块思想，提出了改进的YOLOv2-dense网络结构。本文方法充分利用了更多的图像特征信息，同时采用数据增强和多尺度训练策略，提高了检测的准确率，达到了实时的检测速度，实验表明，本文构建的YOLOv2-dense整体性能优于YOLOv2网络检测与分类的准确率，并且具有实时的检测速度。

有监督学习方法YOLOv2对训练数据集的数量、质量与多样性均有较高要求，未来研究中可采用对抗生成网络或基于半监督、无监督学习的深度网络检测方法，通过分析物体的属性来实现垃圾处置建议、清扫模式判定等决策，进一步提升扫地机器人的自主决策能力。

参考文献

[1] Zhu D Q, Yan M Z. Survey on technology of mobile robot path planning[J]. Control and Decision, 2010, 25(7): 961–967. [朱大奇, 颜明重. 移动机器人路径规划技术综述[J]. 控制与决策, 2010, 25(7): 961–967. ] [DOI:10.13195/j.cd.2010.07.4.zhudq.014]

[2] Zhang H B, Yuan K, Zhou Q R. Visual navigation of a mobile robot based on path recognition[J]. Journal of Image and Graphics, 2004, 9(7): 853–857. [张海波, 原魁, 周庆瑞. 基于路径识别的移动机器人视觉导航[J]. 中国图象图形学报, 2004, 9(7): 853–857. ] [DOI:10.11834/jig.200407161]

[3] Yang J Y, Ma L, Bai D C, et al. Robot vision environmental perception method based on hybrid features[J]. Journal of Image and Graphics, 2012, 17(1): 114–122. [杨俊友, 马乐, 白殿春, 等. 机器人的混合特征视觉环境感知方法[J]. 中国图象图形学报, 2012, 17(1): 114–122. ] [DOI:10.11834/jig.20120116]

[4] Wang W G, Wu K B, Zhu J, et al. Obstacle detection for robot in unknown environment[J]. Journal of Image and Graphics, 2012, 17(2): 209–214. [王文格, 武凯宾, 朱江, 等. 未知环境下机器人障碍物检测技术[J]. 中国图象图形学报, 2012, 17(2): 209–214. ] [DOI:10.11834/jig.20120208]

[5] Gao Q J, Li J, Ma L, et al. Road crossing scene recognition for robot vision_based location[J]. Journal of Image and Graphics, 2009, 14(12): 2510–2516. [高庆吉, 李娟, 马乐, 等. 机器人视觉定位中的路口场景识别方法研究[J]. 中国图象图形学报, 2009, 14(12): 2510–2516. ] [DOI:10.11834/jig.20091213]

[6] Sun H, Wang C, Wang R S. A review of local invariant features[J]. Journal of Image and Graphics, 2011, 16(2): 141–151. [孙浩, 王程, 王润生. 局部不变特征综述[J]. 中国图象图形学报, 2011, 16(2): 141–151. ] [DOI:10.11834/jig.20110207]

[7] Lowe D G. Object recognition from local scale-invariant features[C]//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE, 1999: 1150-1157.[DOI: 10.1109/ICCV.1999.790410]

[8] Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3): 346–359. [DOI:10.1016/j.cviu.2007.09.014]

[9] Rublee E, Rabaud V, Konolige K, et al. ORB: an efficient alternative to SIFT or SURF[C]//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE, 2011: 2564-2571.[DOI: 10.1109/ICCV.2011.6126544]

[10] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005: 886-893.[DOI: 10.1109/CVPR.2005.177]

[11] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: ACM 2012: 1097-1105.

[12] Zheng Y, Chen Q Q, Zhang Y J. Deep learning and its new progress in object and behavior recognition[J]. Journal of Image and Graphics, 2014, 19(2): 175–184. [郑胤, 陈权崎, 章毓晋. 深度学习及其在目标和行为识别中的新进展[J]. 中国图象图形学报, 2014, 19(2): 175–184. ] [DOI:10.11834/jig.20140202]

[13] LeCun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015, 521(7553): 436–444. [DOI:10.1038/nature14539]

[14] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[DOI: 10.1109/CVPR.2014.81]

[15] Girshick R. Fast R-CNN[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.[DOI: 10.1109/ICCV.2015.169]

[16] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]

[17] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[DOI: 10.1109/CVPR.2016.91]

[18] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.[DOI: 10.1007/978-3-319-46448-0_2]

[19] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6517-6525.[DOI: 10.1109/CVPR.2017.690]

[20] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 2261-2269.[DOI: 10.1109/CVPR.2017.243]