目的 人群数量和密度估计在视频监控、智能交通和公共安全等领域有着极其重要的应用价值。现有技术对人群数量大，复杂环境下人群密度的估计仍存在较大的改进空间。因此，针对密度大、分布不均匀、遮挡严重的人群密度的视觉检测，本文提出了一种基于多层次特征融合网络的估计方法，用来解决人群密度估计难的问题。方法 首先，利用多层次特征融合网络进行人群特征的提取、融合，生成人群密度图；然后，对人群密度图进行积分计算求出对应人群的数量；最后，通过还原密度图上人群空间位置信息并结合估算出的人群数量，对人群拥挤程度做出量化判断。结果 提出的方法在Mall数据集上实验结果MAE降至2.35，在ShanghaiTech数据集上实验结果MAE分别降至20.73和104.86，与现有的方法进行对比估计精度得到较大提升，尤其是在环境复杂人数较多的场景下提升效果明显。结论 本文提出的多层次特征融合的人群密度估计方法能有效的对不同尺度的特征进行提取，具有受场景约束小，人群数量估计精度高，人群拥挤程度评估简单可靠等优点，通过实验的对比验证了方法的有效性。
Crowd Density Estimation Based on Multi - level Feature Fusion
Chen Peng,Tang Yiping,Wang Liran,He Xia(School of Information Engineering,Zhejiang University of Technology,Hangzhou)
With the dramatic increase in population, large-scale collective activities have become increasingly frequent. In recent years, a series of social problems have become increasingly prominent due to crowding of crowds. In particular, frequent accidents occur in densely populated areas such as scenic spots, railway stations and shopping malls. Crowd analysis has become an important research topic of intelligent video surveillance. Crowd density estimation has also become the focus of crowd safety control and management research. It not only can help the staff to optimize the management of the statistics of the crowd in the current situation; moreover, there is an extremely important value in preventing overcrowding and detecting potential safety issues. Most of the available technologies only apply to a small number of people, the environment is relatively static scene. Aiming at the visual detection of dense density, uneven distribution and occlusion crowd density, this paper proposes a method based on multi-level feature fusion network to solve the problem of crowd density estimation difficult. Method: Firstly, the feature map of each level is generated by using the convolutional pooling of the network. After 8 layers of 5 convolution layers are generated, a feature map of 1/32 original size and 128 dimensions is generated, and then the feature map is deconvolved 3 times Operation, and at the same time, the convolutional layer features of the previous stage are fused layer by layer. Finally, the convolution layer is convoluted by a 1×1 volumetric kernel to form a density feature map of 1/4 original size. For the image, each convolution operation is an abstraction of the image features of the previous layer, and different depths correspond to different levels of semantic features. Shallow network resolution is high, learn more details, deep network resolution is low, learned more deep semantic features and some key features. Low-level features can be well used to extract small target features, while high-level features can be used to extract large target features. Therefore, we solve the problem of inconsistent image scales by combining the feature information of different layers Secondly, we use the public data set to generate the corresponding density label map by our artificial calibration and then train the network to make it can independently predict the density map of the test image. Finally, by integrating the density map, According to the generated density map, a quantitative method of crowding degree of crowd is proposed, and the crowd crowding is calculated through the reduction and combination of crowd spatial information on density map. Result: The proposed method reduces the MAE to 2.35 on the Mall dataset and reduces the MAE to 20.73 and 104.86 on the ShanghaiTech dataset respectively. Compared with the existing methods, the crowd density estimation accuracy is greatly improved, especially in the Environment complex number of scenes more obvious effect. In addition, the experimental results of different network structures show that the test results have been greatly improved after adding the deconvolution layer compared with pure convolutional networks. Especially under the complex scene ShanghaiTech dataset, after the feature fusion network Performance has been further improved; especially the integration of 1, 2 features more prominent effect, when the integration of the characteristics of the 3 layer basically did not improve the effect of the effect is mainly because the level is too high, contains more details, but also inevitable There will be more redundant information, too much redundant information will affect the generalization of the network capacity. In addition, one can also conclude that the effect of network improvements is not obvious for the Mall dataset for the standard scenario, and using a pure convolutional network has yielded good results. Conclusion: This paper first proposed a crowd density estimation method based on multi-level feature fusion network. Through the extraction and fusion of the features of different semantic layers, the network can extract the features of people with different scales and sizes, which effectively improves the robustness of the algorithm. Using the complete picture as input better preserves the overall picture information, making the feature space Location information is also taken into account in network training. It is more scientific and efficient to use the density map generated by forecasting combined with the spatial information in the estimation of the number of people and the degree of congestion, which is of high practical value. It has the advantages of small scene constraints, high accuracy of crowd estimation, simple and reliable crowd congestion assessment, and finally verifies the effectiveness of the proposed multi-level feature fusion network and the crowd congestion evaluation method through experiments.