刘思琦,郎丛妍,冯松鹤(北京交通大学计算机与信息技术学院, 北京 100044)
目的 人群密度估计任务是通过对人群特征的提取和分析，估算出密度分布情况和人群计数结果。现有技术运用的CNN网络中的下采样操作会丢失部分人群信息，且平均融合方式会使多尺度效应平均化，该策略并不一定能得到准确的估计结果。为了解决上述问题，提出一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。方法 利用扩张卷积在不损失分辨率的情况下对输入图像进行特征提取，且不同的扩张系数可以聚集多尺度上下文信息。最后通过对抗式损失函数将网络中提取的不同尺度的特征信息以合作式的方式融合，得到准确的密度估计结果。结果 在4个主要的人群计数数据集上进行对比实验。在测试阶段，将测试图像输入训练好的生成器网络，输出预测密度图；将密度图积分求和得到总人数，并以平均绝对误差（MAE）和均方误差（MSE）作为评价指标进行结果对比。其中，在ShanghaiTech数据集上Part_A的MAE和MSE分别降至60.5和109.7，Part_B的MAE和MSE分别降至10.2和15.3，提升效果明显。结论 本文提出了一种新的基于对抗式扩张卷积的多尺度人群密度估计模型。实验结果表明，在人群分布差异较大的场景中构建的算法模型有较好的自适应性，能根据不同的场景提取特征估算密度分布，并对人群进行准确计数。
Multi-scale crowd counting via adversarial dilated convolutions
Liu Siqi,Lang Congyan,Feng Songhe(Department of Computer and Information, Beijing Jiaotong University, Beijing 100044, China)
Objective Crowd counting is a task that estimates the counting results and density distribution of a crowd by extracting and analyzing the crowd features. A common strategy extracts the crowd features of different scales through multiscale convolutional neural networks (CNNs) and then fuses them to yield the final density estimation results. However, crowd information will be lost due to the downsampling operation in CNNs and the model averaging effects in the multi-scale CNNs induced by the fusion method. The strategy does not necessarily acquire accurate estimation results. Accordingly, this study proposes a novel model named multi-scale crowd counting via adversarial dilated convolutions. Method Our background modeling is based on a dilated convolution model proposed by solving the problem of image semantic segmentation. In the domain of image segmentation, the most common method is to use a CNN to solve the problem. We enter the images into the CNN, and the network performs the convolution operation to images and then the pooling operation. As a result, the image size is reduced, and the receptive field of the network is increased. However, the image segmentation is a pixel-wise problem, and the smaller image must be upsampled after pooling to the original image size for prediction (a deconvolution operation is generally used to realize upsampling). Therefore, two key points exist in image segmentation. One is that pooling reduces the image size and increases the receptive field, and the other is upsampling to enlarge the image size. Information may be lost in the processes of reducing and resizing. Dilated convolution, which can make the network scale consistent, is proposed to solve this issue. A specific principle is to remove the pooling layer in the network that reduces the resolution of the feature map. However, the model cannot learn the global vision of images. Increasing the convolution kernel scale will cause the computation to increase sharply and overload the memory. We can increase the original convolution kernel to a certain expansion coefficient and fill the empty position with 0 to enlarge the scale of the convolution kernel and increase the receptive field. In this way, the receptive field widens due to the expansion of the convolution kernel and the computation remains unchanged because the effective computation points in the convolution kernel remain unchanged. The scale of each feature is invariable, and thus the image information is also preserved. The proposed model is based on adversarial dilated convolutions. On the one hand, the dilated convolution can extract the features of input image without losing resolution and the module uses different dilated convolutions to aggregate multi-scale context information. On the other hand, the adversarial loss function improves the accuracy of estimation results in a collaborating manner to fuse different-scale information. Result The proposed method reduces the mean absolute error (MAE) and the mean squared error (MSE) to 60.5 and 109.7 on the Part_A of ShanghaiTech dataset and to 10.2 and 15.3 on the Part_B, respectively. Compared with existing methods, the proposed method shows improved MAEs by 7.7 and 0.4 in the two parts. A synthetic analysis of five sets of video sequences on the WorldExpo' 10 database demonstrates that the average prediction result increases by 0.66 compared with that of the classical algorithm. On the UCF_CC_50 dataset, MAE and MSE improve by 18.6 and 22.9, respectively, which proves that the estimation accuracy is enhanced because of the noticeable effect on the environment with a complex number of scenes. However, the MAE reduces to 1.02 on the UCSD database and the MSE does not improve. The adversarial loss function limits the robustness of crowd counting with a low-density environment. Conclusion A new learning strategy named multiscale crowd counting via adversarial dilated convolutions is proposed in this study. The network uses the dilated convolutions to save significant image information, and the dilated convolutions with dilated coefficients of different sizes aggregate multiscale contextual information, which solves the problem of counting the head of different scales in crowd scene images due to angle difference. The adversarial loss function utilizes the image feature extracted by the network to estimate crowd density. Experimental results show that the algorithm model constructed in the scene with large population distribution has good adaptability. The model can estimate the density distribution according to different scenes and count the population accurately.