发布时间: 2018-08-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.180017 2018 | Volume 23 | Number 8 图像分析和识别

1. 浙江工业大学信息工程学院, 杭州 310023;
2. 银江股份有限公司, 杭州 310000
 收稿日期: 2018-01-08; 修回日期: 2018-03-07 基金项目: 国家自然科学基金项目（61070134，61379078） 第一作者简介: 陈朋, 1992年生, 男, 浙江工业大学信息工程专业硕士研究生, 研究方向为计算机视觉、深度学习。E-mail:chenpeng96357@foxmail.com;王丽冉, 女, 浙江工业大学信息工程专业硕士研究生, 研究方向为计算机视觉、深度学习。E-mail:1406034706@qq.com;何霞, 女, 浙江工业大学信息工程专业硕士研究生, 研究方向为计算机视觉、深度学习。E-mail:178332747@qq.com. 中图法分类号: TP391 文献标识码: A 文章编号: 1006-8961(2018)08-1181-12

# 关键词

Crowd density estimation based on multi-level feature fusion
Chen Peng1, Tang Yiping1,2, Wang Liran1, He Xia1
1. School of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China;
2. Enjoyor Co., Ltd, Hangzhou 310000, China
Supported by: National Natural Science Foundation of China (61070134, 61379078)

# Abstract

Objective With the noticeable growth in population, large-scale collective activities have become increasingly frequent. In recent years, a series of social problems have become progressively prominent due to crowding of crowds. In particular, frequent accidents occur in densely populated areas, such as scenic spots, railway stations, and shopping malls. Crowd analysis has become an important research topic of intelligent video surveillance. Crowd density estimation has also become the focus of crowd safety control and management research. Crowd density estimation can help the staff to optimize the management of the statistics of the crowd in the current situation. Preventing overcrowding and detecting potential safety issues are important contributions of such a process. However, several of the available technologies are only applicable to a small number of people, and the environment is relatively static scene. Aiming at the visual detection of crowd density, uneven distribution, and occlusion crowd density, this study proposes a crowd density estimation method based on multi-level feature fusion network. Method First, we generate the feature map of each level using the convolutional pooling of the network. After five out of eight convolution layers are generated, a feature map that is 1/32 of the original size and 128 dimensions is generated and then perform three deconvolution operations. Thereafter, the convolutional layer features of the previous stage are fused together. Finally, the convolution layer is convoluted using a 1×1 volumetric kernel to form a density feature map of 1/4 of the original size. For the image, each convolution operation is an abstraction of the image features of the previous layer, and its different depths correspond to different levels of semantic features. Moreover, if convolution' shallow network resolution is high, the additional image details are found. However, if its deep network resolution is low, then deep semantic and some key features should be learned. Low-level features can be suitably used to extract small target features, whereas high-level features can be used to extract large target features. We solve the problem of inconsistent image scales by combining the feature information of different layers. Second, we use the public dataset to generate the corresponding density label map using our artificial calibration and then train the network to independently predict the density map of the test image. Finally, by integrating the density map, on the basis of the generated density map, we propose a quantitative method of crowd extent, and the crowd crowding is calculated through the reduction and combination of crowd spatial information on density map. Result The proposed method reduces the MAE to 2.35 on the mall dataset and reduces the MAE to 20.73 and 104.86 on the ShanghaiTech dataset. Compared with the existing methods, the crowd density estimation accuracy is improved, having a noticeable effect on the environment with complex number of scenes. In addition, the experimental results of different network structures show an improvement of the test results after adding the deconvolution layer compared with pure convolutional networks. Under the complex scene of ShanghaiTech dataset, after the feature fusion network, the performance has further improved, especially the integration of 1, 2 features, which generates a more prominent effect. When the integration of the characteristics of the three layer basically does not improve the effect, the main reason is the level is too high and contains additional details. Moreover, several redundant information affects the generalization of the network capacity. The effect of network improvements is also not noticeable for the mall dataset with the standard scenario. However, when we use a pure convolutional network, the result is noticeable. Conclusion This study proposes a crowd density estimation method based on multi-level feature fusion network. Through the extraction and fusion of the features of different semantic layers, the network can extract the features of people in different scales and sizes, which effectively improves the robustness of the algorithm. Using the complete picture as the input better preserves the overall picture information, the feature space location information is considered in network training. This algorithm is more scientific and efficient when using the density map generated by forecasting in combination with the spatial information in the estimation of the number of people and the degree of congestion. The algorithm also has the advantages in small scene constraints, high crowd estimation accuracy, and simple and reliable crowd congestion assessment. The effectiveness of the proposed multi-level feature fusion network and crowd congestion evaluation method is verified through experiments.

# Key words

crowd density estimation; crowded degree assessment; hierarchical feature fusion; convolutional neural network(CNN); deep learning; intelligent video analysis

# 1.2 训练标签的制作

 $k = \frac{{{L_{\rm{c}}}-{L_{\rm{p}}}}}{{{r_{\rm{p}}}-{r_{\rm{c}}}}}$ (1)

 ${L_x} = k(x-{r_{\rm{P}}}){\rm{ + }}{L_{\rm{p}}}$ (2)

 $H(x) = \sum\limits_{i = 1}^N {\delta (x-{x_i})}$ (3)

 $D(x) = H(x)*{G_\sigma }(x)$ (4)

 $\begin{array}{l} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{G_\sigma }(x) = \\ \frac{{\exp \left[{-\frac{1}{{2(1-{\rho ^2})}}\left( \begin{array}{l} \frac{{{{(x-{\mu _1})}^2}}}{{\sigma _1^2}} + \frac{{{{(x - {\mu _2})}^2}}}{{\sigma _2^2}} - \\ \frac{{2\rho (x - {\mu _1})(y - {\mu _2})}}{{{\sigma _1}{\sigma _2}}} \end{array} \right)} \right]}}{{2{\rm{ \mathit{ π} }}{\sigma _1}{\sigma _2}\sqrt {1 -{\rho ^2}} }} \end{array}$ (5)

Table 7 Comparison of experimental results of different network structures

 网络结构 Mall ShanghaiTech Part B Part A ${{\rm{C}}_{{\rm{net}}}}$ 2.43 26.4 165.78 ${{\rm{DC}}_{{\rm{net}}}}$ 2.45 24.6 143.35 ${\rm{DC\_}}{{\rm{1}}_{{\rm{net}}}}$ 2.32 21.3 120.25 ${\rm{DC\_}}{{\rm{2}}_{{\rm{net}}}}$ 2.41 20.65 108.39 ${\rm{DC\_}}{{\rm{3}}_{{\rm{net}}}}$ 2.35 20.73 104.86

# 参考文献

• [1] Wen J J, Xu Y, Zhan Y W. People counting based on AdaBoost and inter-frame features[J]. Journal of Image and Graphics, 2011, 16(9): 1729–1735. [文嘉俊, 徐勇, 战荫伟. 基于AdaBoost和帧间特征的人数统计[J]. 中国图象图形学报, 2011, 16(9): 1729–1735. ] [DOI:10.11834/jig.20110924]
• [2] Zhang C, Li H S, Wang X G, et al. Cross-scene crowd counting via deep convolutional neural networks[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 833-841. [DOI: 10.1109/CVPR.2015.7298684] https://www.computer.org/csdl/proceedings/cvpr/2015/6964/00/index.html
• [3] Shi Z L, Ye Y D, Wu Y P, et al. Crowd counting using rank-based spatial pyramid pooling network[J]. Acta Automatica Sinica, 2016, 42(6): 866–874. [时增林, 叶阳东, 吴云鹏, 等. 基于序的空间金字塔池化网络的人群计数方法[J]. 自动化学报, 2016, 42(6): 866–874. ] [DOI:10.16383/j.aas.2016.c150663]
• [4] Zhang D M, Zheng H, Zhang L. Density estimation of scenic spots based on multi features ensemble learning[J]. Science Technology and Engineering, 2017, 17(5): 74–81. [张洞明, 郑宏, 张力. 基于多特征集成学习的景区人群密度估计[J]. 科学技术与工程, 2017, 17(5): 74–81. ] [DOI:10.3969/j.issn.1671-1815.2017.05.013]
• [5] Boominathan L, Kruthiventi S S S, Babu R V. Crowdnet: a deep convolutional network for dense crowd counting[C]//Proceedings of 2016 ACM on Multimedia Conference. Amsterdam, The Netherlands: ACM, 2016: 640-644. [DOI: 10.1145/2964284.2967300] http://eprints.library.iisc.ernet.in/view/year/2016.type.html
• [6] Zhang Y Y, Zhou D S, Chen S Q, et al. Single-image crowd counting via multi-column convolutional neural network[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 589-597. [DOI: 10.1109/CVPR.2016.70] https://www.computer.org/csdl/proceedings/cvpr/2016/8851/00/8851a589-abs.html
• [7] Kumagai S, Hotta K, Kurita T. Mixture of counting CNNs: adaptive integration of CNNs specialized to specific appearance for crowd counting[J]. arXiv: 1703. 09393, 2017. http://arxiv.org/abs/1703.09393
• [8] Gardziński P, Kowalak K, Kamiński Ł, et al. Crowd density estimation based on voxel model in multi-view surveillance systems[C]//Proceedings of 2015 International Conference on Systems, Signals and Image Processing. London, UK: IEEE, 2015: 216-219. [DOI: 10.1109/IWSSIP.2015.7314215]
• [9] Rao A S, Gubbi J, Marusic S, et al. Estimation of crowd density by clustering motion cues[J]. The Visual Computer, 2015, 31(11): 1533–1552. [DOI:10.1007/s00371-014-1032-4]
• [10] Ryan D, Denman S, Sridharan S, et al. An evaluation of crowd counting methods, features and regression models[J]. Computer Vision and Image Understanding, 2015, 130: 1–17. [DOI:10.1016/j.cviu.2014.07.008]
• [11] Liang R H, Zhu Y G, Wang H X. Counting crowd flow based on feature points[J]. Neurocomputing, 2014, 133: 377–384. [DOI:10.1016/j.neucom.2013.12.040]
• [12] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440. [DOI: 10.1109/CVPR.2015.7298965]
• [13] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]
• [14] Wang C, Zhang H, Yang L, et al. Deep people counting in extremely dense crowds[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 1299-1302. [DOI: 10.1145/2733373.2806337] http://dl.acm.org/citation.cfm?id=2806337
• [15] Marsden M, McGuinness K, Little S, et al. Fully convolutional crowd counting on highly congested scenes[J]. arXiv: 1612. 00220, 2016. http://arxiv.org/abs/1612.00220
• [16] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.
• [17] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[J]. arXiv: 1612. 03144, 2016. https://www.computer.org/csdl/proceedings/cvpr/2017/0457/00/0457a936-abs.html
• [18] Lin M, Chen Q, Yan S. Network in network[J]. arXiv: 1312. 4400, 2013. http://arxiv.org/abs/1312.4400
• [19] Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv: 1207. 0580, 2012.
• [20] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. [DOI: 10.1109/CVPR.2016.90]
• [21] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1-9. [DOI: 10.1109/CVPR.2015.7298594]
• [22] Chen K, Loy C C, Gong S G, et al. Feature mining for localised crowd counting[C]//Proceedings of British Machine Vision Conference. BMVA Press, UK: IEEE, 2012: 3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.309.5076&rank=50
• [23] Xu B L, Qiu G P. Crowd density estimation based on rich features and random projection forest[C]//Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision. Lake Placid, NY, USA: IEEE, 2016: 1-8. [DOI: 10.1109/WACV.2016.7477682]
• [24] Pham V Q, Kozakaya T, Yamaguchi O, et al. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 3253-3261. [DOI: 10.1109/ICCV.2015.372]
• [25] Sheng B Y, Shen C H, Lin G S, et al. Crowd counting via weighted VLAD on dense attribute feature maps[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016. [DOI:10.1109/TCSVT.2016.2637379D]