Multi-Scale Crowd Counting via Adversarial Dilated Convolutions
liusiqi,langcongyan,fengsonghe(Beijing Jiaotong University)
Objective Crowd counting is a task that estimates the counting results and density distribution of crowd by extracting and analyzing the crowd features. A common strategy extracts the crowd features of different scales through the multi-scale convolution neural networks (CNNs), then fused to yield the final density estimation results. However, there will be losing crowd information due to down-sampling operation in CNN, and the model averaging effects in multi-scale CNNs induced by fusion method, the strategy does not necessarily get the accurate estimation results. In order to solve these problems, we proposed a novel model named multi-scale crowd counting via adversarial dilated convolutions. Method Our background modeling is based on dilated convolution model proposed by solving the problem of image semantic segmentation. In the domain of image segmentation, the most common method is to use CNN to solve the problem. Firstly, we enter the images into the CNN, and the network will do the convolution operation to images, then do the pooling operation, so the image size is reduced and the receptive field of the network is increased. However, the image segmentation is a pixel-wise problem, we need to upsampling the smaller image after pooling to original image size to prediction (generally use deconvolution operation to realize upsampling). Therefore, there are two key points in image segmentation. One is that pooling reduces the size of the image and increases the receptive field, and the other is upsampling enlarging the image size. In the process of reducing and resizing, there must be some loss of information. To solve this issue, dilated convolution which could make the scale of the network consistent is proposed. The specific principle is to remove the pooling layer in the network that reduces the resolution of the feature map, but the model could not learn the global vision of the images, and increases the convolution kernel scale will make the computation increase sharply, the memory will be overload. In order to enlarge the scale of convolution kernel and increase the receptive field, we can increase the original convolution kernel to a certain expansion coefficient and fill the empty position with 0. In this way, the receptive field becomes larger due to the expansion of convolution kernel, and because the effective computation points in the convolution kernel remain unchanged, the computation remains unchanged. Besides, the scale of each feature is invariable, so the image information is preserved. The model proposed in the paper is based on adversarial dilated convolutions. On the one hand, the dilated convolution could extract features of input image without losing resolution, and the module uses different dilated convolutions to aggregate multi-scale context information. On the other hand, the adversarial loss function improves the accuracy of estimation results in a collaborating way to fuse different scales information. Result The proposed method reduces the mean absolute error (MAE) and the mean square error (MSE) to the 60.5 and 109.7 on the Part_A of ShanghaiTech dataset, otherwise, the method that reduces the MAE and MSE to the 10.2 and 15.3 on the Part_B. Compared with the existing methods, the MAE has improved 7.7 and 0.4 respectively. By analysing synthetically the five sets of video sequences on the WorldExpo''10 database, the average prediction result has increased 0.66 compared to the classical algorithm. On the UCF_CC_50 dataset, MAE and MSE has improved 18.6 and 22.9 respectively, which proves that the estimation accuracy is improved, having a noticeable effect on the environment with complex number of scenes. However, the MAE reduces to 1.02 on the UCSD database, the MSE did not improve. This explains that adversarial loss function limits the robustness of crowd counting with the low density environment. Conclusion A new learning strategy named multi-scale crowd counting via adversarial dilated convolutions is proposed in this paper. In order to save the image information as much as possible, the network uses the dilated convolution, and the dilated convolutions with dilated coefficients of different sizes to aggregate multi-scale contextual information, which solves the problem of counting the head of different scales in crowd scene images due to the difference of angles. And the adversarial loss function takes advantage of the image feature extracted by the network to estimate crowd density. The experimental results show that the algorithm model constructed in the scene with large population distribution has better adaptability. It can estimate the density distribution according to different scenes and count the population accurately.