多层次特征融合的人群密度估计
Crowd density estimation based on multi-level feature fusion
- 2018年23卷第8期 页码:1181-1192
收稿:2018-01-08,
修回:2018-3-7,
纸质出版:2018-08-16
DOI: 10.11834/jig.180017
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-01-08,
修回:2018-3-7,
纸质出版:2018-08-16
移动端阅览
目的
2
人群数量和密度估计在视频监控、智能交通和公共安全等领域有着极其重要的应用价值。现有技术对人群数量大,复杂环境下人群密度的估计仍存在较大的改进空间。因此,针对密度大、分布不均匀、遮挡严重的人群密度视觉检测,提出一种基于多层次特征融合网络的人群密度估计方法,用来解决人群密度估计难的问题。
方法
2
首先,利用多层次特征融合网络进行人群特征的提取、融合、生成人群密度图;然后,对人群密度图进行积分计算求出对应人群的数量;最后,通过还原密度图上人群空间位置信息并结合估算出的人群数量,对人群拥挤程度做出量化判断。
结果
2
在Mall数据集上本文方法平均绝对误差(MAE)降至2.35,在ShanghaiTech数据集上MAE分别降至20.73和104.86,与现有的方法进行对比估计精度得到较大提升,尤其是在环境复杂、人数较多的场景下提升效果明显。
结论
2
本文提出的多层次特征融合的人群密度估计方法能有效地对不同尺度的特征进行提取,具有受场景约束小,人群数量估计精度高,人群拥挤程度评估简单可靠等优点,实验的对比结果验证了本文方法的有效性。
Objective
2
With the noticeable growth in population
large-scale collective activities have become increasingly frequent. In recent years
a series of social problems have become progressively prominent due to crowding of crowds. In particular
frequent accidents occur in densely populated areas
such as scenic spots
railway stations
and shopping malls. Crowd analysis has become an important research topic of intelligent video surveillance. Crowd density estimation has also become the focus of crowd safety control and management research. Crowd density estimation can help the staff to optimize the management of the statistics of the crowd in the current situation. Preventing overcrowding and detecting potential safety issues are important contributions of such a process. However
several of the available technologies are only applicable to a small number of people
and the environment is relatively static scene. Aiming at the visual detection of crowd density
uneven distribution
and occlusion crowd density
this study proposes a crowd density estimation method based on multi-level feature fusion network.
Method
2
First
we generate the feature map of each level using the convolutional pooling of the network. After five out of eight convolution layers are generated
a feature map that is 1/32 of the original size and 128 dimensions is generated and then perform three deconvolution operations. Thereafter
the convolutional layer features of the previous stage are fused together. Finally
the convolution layer is convoluted using a 1×1 volumetric kernel to form a density feature map of 1/4 of the original size. For the image
each convolution operation is an abstraction of the image features of the previous layer
and its different depths correspond to different levels of semantic features. Moreover
if convolution' shallow network resolution is high
the additional image details are found. However
if its deep network resolution is low
then deep semantic and some key features should be learned. Low-level features can be suitably used to extract small target features
whereas high-level features can be used to extract large target features. We solve the problem of inconsistent image scales by combining the feature information of different layers. Second
we use the public dataset to generate the corresponding density label map using our artificial calibration and then train the network to independently predict the density map of the test image. Finally
by integrating the density map
on the basis of the generated density map
we propose a quantitative method of crowd extent
and the crowd crowding is calculated through the reduction and combination of crowd spatial information on density map.
Result
2
The proposed method reduces the MAE to 2.35 on the mall dataset and reduces the MAE to 20.73 and 104.86 on the ShanghaiTech dataset. Compared with the existing methods
the crowd density estimation accuracy is improved
having a noticeable effect on the environment with complex number of scenes. In addition
the experimental results of different network structures show an improvement of the test results after adding the deconvolution layer compared with pure convolutional networks. Under the complex scene of ShanghaiTech dataset
after the feature fusion network
the performance has further improved
especially the integration of 1
2 features
which generates a more prominent effect. When the integration of the characteristics of the three layer basically does not improve the effect
the main reason is the level is too high and contains additional details. Moreover
several redundant information affects the generalization of the network capacity. The effect of network improvements is also not noticeable for the mall dataset with the standard scenario. However
when we use a pure convolutional network
the result is noticeable.
Conclusion
2
This study proposes a crowd density estimation method based on multi-level feature fusion network. Through the extraction and fusion of the features of different semantic layers
the network can extract the features of people in different scales and sizes
which effectively improves the robustness of the algorithm. Using the complete picture as the input better preserves the overall picture information
the feature space location information is considered in network training. This algorithm is more scientific and efficient when using the density map generated by forecasting in combination with the spatial information in the estimation of the number of people and the degree of congestion. The algorithm also has the advantages in small scene constraints
high crowd estimation accuracy
and simple and reliable crowd congestion assessment. The effectiveness of the proposed multi-level feature fusion network and crowd congestion evaluation method is verified through experiments.
Wen J J, Xu Y, Zhan Y W. People counting based on AdaBoost and inter-frame features[J]. Journal of Image and Graphics, 2011, 16(9):1729-1735.
文嘉俊, 徐勇, 战荫伟.基于AdaBoost和帧间特征的人数统计[J].中国图象图形学报, 2011, 16(9):1729-1735. [DOI:10.11834/jig.20110924]
Zhang C, Li H S, Wang X G, et al. Cross-scene crowd counting via deep convolutional neural networks[C ] //Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 833-841. [ DOI: 10.1109/CVPR.2015.7298684 http://dx.doi.org/10.1109/CVPR.2015.7298684 ] https://www.computer.org/csdl/proceedings/cvpr/2015/6964/00/index.html .
Shi Z L, Ye Y D, Wu Y P, et al. Crowd counting using rank-based spatial pyramid pooling network[J]. Acta Automatica Sinica, 2016, 42(6):866-874.
时增林, 叶阳东, 吴云鹏, 等.基于序的空间金字塔池化网络的人群计数方法[J].自动化学报, 2016, 42(6):866-874. [DOI:10.16383/j.aas.2016.c150663]
Zhang D M, Zheng H, Zhang L. Density estimation of scenic spots based on multi features ensemble learning[J]. Science Technology and Engineering, 2017, 17(5):74-81.
张洞明, 郑宏, 张力.基于多特征集成学习的景区人群密度估计[J].科学技术与工程, 2017, 17(5):74-81. [DOI:10.3969/j.issn.1671-1815.2017.05.013]
Boominathan L, Kruthiventi S S S, Babu R V. Crowdnet: a deep convolutional network for dense crowd counting[C ] //Proceedings of 2016 ACM on Multimedia Conference. Amsterdam, The Netherlands: ACM, 2016: 640-644. [ DOI: 10.1145/2964284.2967300 http://dx.doi.org/10.1145/2964284.2967300 ] http://eprints.library.iisc.ernet.in/view/year/2016.type.html .
Zhang Y Y, Zhou D S, Chen S Q, et al. Single-image crowd counting via multi-column convolutional neural network[C ] //Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 589-597. [ DOI: 10.1109/CVPR.2016.70 http://dx.doi.org/10.1109/CVPR.2016.70 ] https://www.computer.org/csdl/proceedings/cvpr/2016/8851/00/8851a589-abs.html .
Kumagai S, Hotta K, Kurita T. Mixture of counting CNNs: adaptive integration of CNNs specialized to specific appearance for crowd counting[J]. arXiv: 1703. 09393, 2017. http://arxiv.org/abs/1703.09393 .
Gardziński P, Kowalak K, KamińskiŁ, et al. Crowd density estimation based on voxel model in multi-view surveillance systems[C ] //Proceedings of 2015 International Conference on Systems, Signals and Image Processing. London, UK: IEEE, 2015: 216-219. [ DOI: 10.1109/IWSSIP.2015.7314215 http://dx.doi.org/10.1109/IWSSIP.2015.7314215 ]
Rao A S, Gubbi J, Marusic S, et al. Estimation of crowd density by clustering motion cues[J]. The Visual Computer, 2015, 31(11):1533-1552.[DOI:10.1007/s00371-014-1032-4]
Ryan D, Denman S, Sridharan S, et al. An evaluation of crowd counting methods, features and regression models[J]. Computer Vision and Image Understanding, 2015, 130:1-17.[DOI:10.1016/j.cviu.2014.07.008]
Liang R H, Zhu Y G, Wang H X. Counting crowd flow based on feature points[J]. Neurocomputing, 2014, 133:377-384.[DOI:10.1016/j.neucom.2013.12.040]
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C ] //Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Bost on, MA, USA: IEEE, 2015: 3431-3440. [ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]
Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]
Wang C, Zhang H, Yang L, et al. Deep people counting in extremely dense crowds[C ] //Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 1299-1302. [ DOI: 10.1145/2733373.2806337 http://dx.doi.org/10.1145/2733373.2806337 ] http://dl.acm.org/citation.cfm?id=2806337 .
Marsden M, McGuinness K, Little S, et al. Fully convolutional crowd counting on highly congested scenes[J]. arXiv: 1612. 00220, 2016. http://arxiv.org/abs/1612.00220 .
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.
Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[J]. arXiv: 1612. 03144, 2016. https://www.computer.org/csdl/proceedings/cvpr/2017/0457/00/0457a936-abs.html .
Lin M, Chen Q, Yan S. Network in network[J]. arXiv: 1312. 4400, 2013. http://arxiv.org/abs/1312.4400 .
Hinton G E, Srivastava N, Krizhevsky A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. arXiv: 1207. 0580, 2012.
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C ] //Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778. [ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C ] //Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 1-9. [ DOI: 10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Chen K, Loy C C, Gong S G, et al. Feature mining for localised crowd counting[C]//Proceedings of British Machine Vision Conference. BMVA Press, UK: IEEE, 2012: 3. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.309.5076&rank=50 .
Xu B L, Qiu G P. Crowd density estimation based on rich features and random projection forest[C ] //Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision. Lake Placid, NY, USA: IEEE, 2016: 1-8. [ DOI: 10.1109/WACV.2016.7477682 http://dx.doi.org/10.1109/WACV.2016.7477682 ]
Pham V Q, Kozakaya T, Yamaguchi O, et al. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation[C ] //Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 3253-3261. [ DOI: 10.1109/ICCV.2015.372 http://dx.doi.org/10.1109/ICCV.2015.372 ]
Sheng B Y, Shen C H, Lin G S, et al. Crowd counting via weighted VLAD on dense attribute feature maps[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016.[DOI:10.1109/TCSVT.2016.2637379D]
相关作者
相关机构
京公网安备11010802024621