基于对抗式扩张卷积的多尺度人群密度估计

刘思琦; 郎丛妍; 冯松鹤

doi:10.11834/jig.180401

ChinaMM 2018 | 浏览量 : 0 下载量: 4 CSCD: 1

PDF
导出
分享
收藏
专辑

基于对抗式扩张卷积的多尺度人群密度估计
Multi-scale crowd counting via adversarial dilated convolutions
2019年24卷第3期页码：483-492
收稿：2018-06-20，

修回：2018-9-3，

纸质出版：2019-03-16
DOI： 10.11834/jig.180401
稿件说明：

移动端阅览

刘思琦, 郎丛妍, 冯松鹤. 基于对抗式扩张卷积的多尺度人群密度估计[J]. 中国图象图形学报, 2019,24(3):483-492. DOI： 10.11834/jig.180401.

Siqi Liu, Congyan Lang, Songhe Feng. Multi-scale crowd counting via adversarial dilated convolutions[J]. Journal of Image and Graphics, 2019, 24(3): 483-492. DOI： 10.11834/jig.180401.

摘要

目的

人群密度估计任务是通过对人群特征的提取和分析，估算出密度分布情况和人群计数结果。现有技术运用的CNN网络中的下采样操作会丢失部分人群信息，且平均融合方式会使多尺度效应平均化，该策略并不一定能得到准确的估计结果。为了解决上述问题，提出一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。

方法

利用扩张卷积在不损失分辨率的情况下对输入图像进行特征提取，且不同的扩张系数可以聚集多尺度上下文信息。最后通过对抗式损失函数将网络中提取的不同尺度的特征信息以合作式的方式融合，得到准确的密度估计结果。

结果

在4个主要的人群计数数据集上进行对比实验。在测试阶段，将测试图像输入训练好的生成器网络，输出预测密度图；将密度图积分求和得到总人数，并以平均绝对误差（MAE）和均方误差（MSE）作为评价指标进行结果对比。其中，在ShanghaiTech数据集上Part_A的MAE和MSE分别降至60.5和109.7，Part_B的MAE和MSE分别降至10.2和15.3，提升效果明显。

结论

本文提出了一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。实验结果表明，在人群分布差异较大的场景中构建的算法模型有较好的自适应性，能根据不同的场景提取特征估算密度分布，并对人群进行准确计数。

Abstract

Objective

Crowd counting is a task that estimates the counting results and density distribution of a crowd by extracting and analyzing the crowd features. A common strategy extracts the crowd features of different scales through multiscale convolutional neural networks (CNNs) and then fuses them to yield the final density estimation results. However

crowd information will be lost due to the downsampling operation in CNNs and the model averaging effects in the multi-scale CNNs induced by the fusion method. The strategy does not necessarily acquire accurate estimation results. Accordingly

this study proposes a novel model named multi-scale crowd counting via adversarial dilated convolutions.

Method

Our background modeling is based on a dilated convolution model proposed by solving the problem of image semantic segmentation. In the domain of image segmentation

the most common method is to use a CNN to solve the problem. We enter the images into the CNN

and the network performs the convolution operation to images and then the pooling operation. As a result

the image size is reduced

and the receptive field of the network is increased. However

the image segmentation is a pixel-wise problem

and the smaller image must be upsampled after pooling to the original image size for prediction (a deconvolution operation is generally used to realize upsampling). Therefore

two key points exist in image segmentation. One is that pooling reduces the image size and increases the receptive field

and the other is upsampling to enlarge the image size. Information may be lost in the processes of reducing and resizing. Dilated convolution

which can make the network scale consistent

is proposed to solve this issue. A specific principle is to remove the pooling layer in the network that reduces the resolution of the feature map. However

the model cannot learn the global vision of images. Increasing the convolution kernel scale will cause the computation to increase sharply and overload the memory. We can increase the original convolution kernel to a certain expansion coefficient and fill the empty position with 0 to enlarge the scale of the convolution kernel and increase the receptive field. In this way

the receptive field widens due to the expansion of the convolution kernel and the computation remains unchanged because the effective computation points in the convolution kernel remain unchanged. The scale of each feature is invariable

and thus the image information is also preserved. The proposed model is based on adversarial dilated convolutions. On the one hand

the dilated convolution can extract the features of input image without losing resolution and the module uses different dilated convolutions to aggregate multi-scale context information. On the other hand

the adversarial loss function improves the accuracy of estimation results in a collaborating manner to fuse different-scale information.

Result

The proposed method reduces the mean absolute error (MAE) and the mean squared error (MSE) to 60.5 and 109.7 on the Part_A of ShanghaiTech dataset and to 10.2 and 15.3 on the Part_B

respectively. Compared with existing methods

the proposed method shows improved MAEs by 7.7 and 0.4 in the two parts. A synthetic analysis of five sets of video sequences on the WorldExpo' 10 database demonstrates that the average prediction result increases by 0.66 compared with that of the classical algorithm. On the UCF_CC_50 dataset

MAE and MSE improve by 18.6 and 22.9

respectively

which proves that the estimation accuracy is enhanced because of the noticeable effect on the environment with a complex number of scenes. However

the MAE reduces to 1.02 on the UCSD database and the MSE does not improve. The adversarial loss function limits the robustness of crowd counting with a low-density environment.

Conclusion

A new learning strategy named multiscale crowd counting via adversarial dilated convolutions is proposed in this study. The network uses the dilated convolutions to save significant image information

and the dilated convolutions with dilated coefficients of different sizes aggregate multiscale contextual information

which solves the problem of counting the head of different scales in crowd scene images due to angle difference. The adversarial loss function utilizes the image feature extracted by the network to estimate crowd density. Experimental results show that the algorithm model constructed in the scene with large population distribution has good adaptability. The model can estimate the density distribution according to different scenes and count the population accurately.

关键词

Keywords

references

Junior J C S J, Musse S R, Jung C R. Crowd analysis using computer vision techniques[J]. IEEE Signal Processing Magazine, 2010, 27(5):66-77.[DOI:10.1109/MSP.2010.937394]

Zhan B B, Monekosso D N, Remagnino P, et al. Crowd analysis:a survey[J]. Machine Vision and Applications, 2008, 19(5-6):345-357.[DOI:10.1007/s00138-008-0132-4]

Li T, Chang H, Wang M, et al. Crowded scene analysis:a survey[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25(3):367-386.[DOI:10.1109/TCSVT.2014.2358029]

Oñoro-Rubio D, López-Sastre R J. Towards perspective-free object counting with deep learning[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Spinger, 2016: 615-629.[ DOI: 10.1007/978-3-319-46478-7_38 http://dx.doi.org/10.1007/978-3-319-46478-7_38 ]

Zhang S H, Wu G H, Costeira J P, et al. Understanding traffic density from large-scale web camera data[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4264-4273.[ DOI: 10.1109/CVPR.2017.454 http://dx.doi.org/10.1109/CVPR.2017.454 ]

Zhang Y Y, Zhou D S, Chen S Q, et al. Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 589-597.[ DOI: 10.1109/CVPR.2016.70 http://dx.doi.org/10.1109/CVPR.2016.70 ]

Sam D B, Surya S, Babu R V. Switching convolutional neural network for crowd counting[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4031-4039.[ DOI: 10.1109/CVPR.2017.429 http://dx.doi.org/10.1109/CVPR.2017.429 ]

Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions[EB/OL][2018-06-15] . https://arxiv.org/pdf/1511.07122V2.pdf https://arxiv.org/pdf/1511.07122V2.pdf .

Shen Z, Xu Y,Ni B B, et al. Crowd counting via adversarial cross-scale consistency pursuit[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: CVPR, 2018: 5245-5254.

Lin Z, Davis L S. Shape-based human detection and segmentation via hierarchical part-template matching[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(4):604-618.[DOI:10.1109/TPAMI.2009.204]

Wang M, Wang X G. Automatic adaptation of a generic pedestrian detector to a specific traffic scene[C]//Proceedings of the CVPR 2011. Colorado Springs, CO, USA: IEEE, 2011: 3401-3408.[ DOI: 10.1109/CVPR.2011.5995698 http://dx.doi.org/10.1109/CVPR.2011.5995698 ]

An S J, Liu W Q, Venkatesh S. Face recognition using kernel ridge regression[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE, 2007: 1-7.[ DOI: 10.1109/CVPR.2007.383105 http://dx.doi.org/10.1109/CVPR.2007.383105 ]

Chen K, Loy C C, Gong S G, et al. Feature mining for localised crowd counting[C]//Proceedings of the British Machine Vision Conference. Newcastle, UK: Northumbria University, 2012: 21.1-21.11.

Bansal A, Venkatesh K S. People counting in high density crowds from still images[EB/OL].[2018-06-15]. https: //arxiv.org/pdf/1507.08445.pdf.

Sindagi V A, Patel V M. A survey of recent advances in CNN-based single image crowd counting and density estimation[J]. Pattern Recognition Letters, 2018, 107:3-16.[DOI:10.1016/j.patrec.2017.07.007]

Wang C, Zhang H, Yang L, et al. Deep people counting in extremely dense crowds[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 1299-1302.[ DOI: 10.1145/2733373.2806337 http://dx.doi.org/10.1145/2733373.2806337 ]

Zhang C, Li H S, Wang X G, et al. Cross-scene crowd counting via deep convolutional neural networks[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 833-841.[ DOI: 10.1109/CVPR.2015.7298684 http://dx.doi.org/10.1109/CVPR.2015.7298684 ]

Boominathan L, Kruthiventi S S S, Babu R V. Crowdnet: a deep convolutional network for dense crowd counting[C]//Proceedings of 2016 ACM International Conference on Multimedia. Amsterdam, The Netherlands: ACM, 2016: 640-644.[ DOI: 10.1145/2964284.2967300 http://dx.doi.org/10.1145/2964284.2967300 ]

Lempitsky V, Zisserman A. Learning to count objects in images[C]//Proceedings of the 23rd International Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: Curran Associates Inc., 2010: 1324-1332.

Li Y H, Zhang X F, Chen D M. CSRNet: dilated convolutional neural networks for understanding the highly congested scenes[EB/OL]. 2018-02-27[2018-06-15] . https://arxiv.org/pdf/1802.10062.pdf https://arxiv.org/pdf/1802.10062.pdf .

Sindagi V A, Patel V M. Generating high-quality crowd density maps using contextual pyramid CNNs[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 1861-1870.[ DOI: 10.1109/ICCV.2017.206 http://dx.doi.org/10.1109/ICCV.2017.206 ]

Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks[EB/OL]. 2015-11-19[2018-06-15] . https://arxiv.org/pdf/1511.06434.pdf https://arxiv.org/pdf/1511.06434.pdf .

Li Y J, Liu S F, Yang J M, et al. Generative face completion[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5892-5900.[ DOI: 10.1109/CVPR.2017.624 http://dx.doi.org/10.1109/CVPR.2017.624 ]

Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5967-5976.[ DOI: 10.1109/CVPR.2017.632 http://dx.doi.org/10.1109/CVPR.2017.632 ]

Pathak D, Krähenbühl P, Donahue J, et al. Context encoders: feature learning by inpainting[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2536-2544.[ DOI: 10.1109/CVPR.2016.278 http://dx.doi.org/10.1109/CVPR.2016.278 ]

Zhang H, Sindagi V, Patel V M. Image de-raining using a conditional generative adversarial network[EB/OL]. 2017-01-21[2018-06-15] . https://arxiv.org/pdf/1701.05957.pdf https://arxiv.org/pdf/1701.05957.pdf .

Idrees H, Saleemi I, Seibert C, et al. Multi-source multi-scale counting in extremely dense crowd images[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 2547-2554.[ DOI: 10.1109/CVPR.2013.329 http://dx.doi.org/10.1109/CVPR.2013.329 ]

Chan A B, Liang Z S J, Vasconcelos N. Privacy preserving crowd monitoring: counting people without people models or tracking[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-7.[ DOI: 10.1109/CVPR.2008.4587569 http://dx.doi.org/10.1109/CVPR.2008.4587569 ]