Multi-scale crowd counting via adversarial dilated convolutions
Liu Siqi, Lang Congyan, Feng Songhe
Department of Computer and Information, Beijing Jiaotong University, Beijing 100044, China

# Abstract

Objective Crowd counting is a task that estimates the counting results and density distribution of a crowd by extracting and analyzing the crowd features. A common strategy extracts the crowd features of different scales through multiscale convolutional neural networks (CNNs) and then fuses them to yield the final density estimation results. However, crowd information will be lost due to the downsampling operation in CNNs and the model averaging effects in the multi-scale CNNs induced by the fusion method. The strategy does not necessarily acquire accurate estimation results. Accordingly, this study proposes a novel model named multi-scale crowd counting via adversarial dilated convolutions. Method Our background modeling is based on a dilated convolution model proposed by solving the problem of image semantic segmentation. In the domain of image segmentation, the most common method is to use a CNN to solve the problem. We enter the images into the CNN, and the network performs the convolution operation to images and then the pooling operation. As a result, the image size is reduced, and the receptive field of the network is increased. However, the image segmentation is a pixel-wise problem, and the smaller image must be upsampled after pooling to the original image size for prediction (a deconvolution operation is generally used to realize upsampling). Therefore, two key points exist in image segmentation. One is that pooling reduces the image size and increases the receptive field, and the other is upsampling to enlarge the image size. Information may be lost in the processes of reducing and resizing. Dilated convolution, which can make the network scale consistent, is proposed to solve this issue. A specific principle is to remove the pooling layer in the network that reduces the resolution of the feature map. However, the model cannot learn the global vision of images. Increasing the convolution kernel scale will cause the computation to increase sharply and overload the memory. We can increase the original convolution kernel to a certain expansion coefficient and fill the empty position with 0 to enlarge the scale of the convolution kernel and increase the receptive field. In this way, the receptive field widens due to the expansion of the convolution kernel and the computation remains unchanged because the effective computation points in the convolution kernel remain unchanged. The scale of each feature is invariable, and thus the image information is also preserved. The proposed model is based on adversarial dilated convolutions. On the one hand, the dilated convolution can extract the features of input image without losing resolution and the module uses different dilated convolutions to aggregate multi-scale context information. On the other hand, the adversarial loss function improves the accuracy of estimation results in a collaborating manner to fuse different-scale information. Result The proposed method reduces the mean absolute error (MAE) and the mean squared error (MSE) to 60.5 and 109.7 on the Part_A of ShanghaiTech dataset and to 10.2 and 15.3 on the Part_B, respectively. Compared with existing methods, the proposed method shows improved MAEs by 7.7 and 0.4 in the two parts. A synthetic analysis of five sets of video sequences on the WorldExpo' 10 database demonstrates that the average prediction result increases by 0.66 compared with that of the classical algorithm. On the UCF_CC_50 dataset, MAE and MSE improve by 18.6 and 22.9, respectively, which proves that the estimation accuracy is enhanced because of the noticeable effect on the environment with a complex number of scenes. However, the MAE reduces to 1.02 on the UCSD database and the MSE does not improve. The adversarial loss function limits the robustness of crowd counting with a low-density environment. Conclusion A new learning strategy named multiscale crowd counting via adversarial dilated convolutions is proposed in this study. The network uses the dilated convolutions to save significant image information, and the dilated convolutions with dilated coefficients of different sizes aggregate multiscale contextual information, which solves the problem of counting the head of different scales in crowd scene images due to angle difference. The adversarial loss function utilizes the image feature extracted by the network to estimate crowd density. Experimental results show that the algorithm model constructed in the scene with large population distribution has good adaptability. The model can estimate the density distribution according to different scenes and count the population accurately.

# Key words

crowd counting; multi-scale; adversarial loss; dilated convolutions; computer vision; crowd safety

# 0 引言

1) 传统的网络会对输入图像向下采样，扩大卷积核的感受野，获取全局特征。然而对于人群量大的场景，如图 1所示，由于人与人之间存在严重的遮挡，部分人头所占的像素很少，下采样操作容易丢失部分人群分布信息，也会降低每个不同尺度通道输出的密度图分辨率，影响到最后的密度估计结果。

2) 密度图由多通道网络输出的密度估计结果平均融合而来，此种方式使得到的多尺度密度估计结果平均化，降低了人群计数的准确性。

# 2.1 密度图

 $\mathit{\boldsymbol{H}}\left( x \right) = \sum\limits_{i = 1}^N {\mathit{\boldsymbol{\delta }}\left( {x - {x_i}} \right)}$ (1)

 $\mathop {\max }\limits_D {E_{x \sim {P_{\rm{r}}}}}\left[ {{{\log }_2}D\left( i \right)} \right] + {E_{x \sim {P_{\rm{g}}}}}\left[ {{{\log }_2}\left( {1 - D\left( i \right)} \right)} \right]$ (4)

 $\mathop {\min }\limits_G \mathop {\max }\limits_D {E_{x \sim {P_{\rm{r}}}}}\left[ {{{\log }_2}D\left( i \right)} \right] + {E_{x \sim {P_{\rm{g}}}}}\left[ {{{\log }_2}\left( {1 - D\left( i \right)} \right)} \right]$ (5)

 $\begin{array}{*{20}{c}} {{L_A}\left( {G,D} \right) = {E_{x,y \sim {p_{{\rm{data}}}}\left( {x,y} \right)}}\left[ {{{\log }_2}D\left( {\mathit{\boldsymbol{x}},\mathit{\boldsymbol{y}}} \right)} \right] + }\\ {{E_{x \sim {p_{{\rm{data}}}}\left( x \right)}}\left[ {{{\log }_2}\left( {1 - D\left( {\mathit{\boldsymbol{x}},G\left( \mathit{\boldsymbol{x}} \right)} \right)} \right)} \right]} \end{array}$ (6)

 ${L_{\rm{E}}}\left( G \right) = \frac{1}{C}\sum\limits_{c = 1}^C {\left\| {{\mathit{\boldsymbol{p}}^G}\left( c \right) - {\mathit{\boldsymbol{p}}^{{\rm{GT}}}}\left( c \right)} \right\|_2^2}$ (7)

 $L = \arg \mathop {\min }\limits_G \mathop {\max }\limits_{\rm{D}} {L_{\rm{A}}}\left( {G,D} \right) + {\lambda _{\rm{E}}}{L_{\rm{E}}}\left( G \right)$ (8)

# 3.1 评价指标

 ${f_{{\rm{MAE}}}} = \frac{1}{N}\sum\limits_{}^N {\left| {{C_i} - C_i^{{\rm{GT}}}} \right|}$ (9)

 ${f_{{\rm{MSE}}}} = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^N {{{\left( {{C_i} - C_i^{{\rm{GT}}}} \right)}^2}} }$ (10)

MSE是反映估计量与被估计量之间差异程度的一个评价指标，表现了方法的鲁棒性。

# 3.2 ShanghaiTech数据库

ShanghaiTech数据库首次出现于文献[6]中，整个数据库共包含1 198幅图像，共计330 165个已标记人头，是目前已知标记人数最多的数据库。数据库共分为两部分：Part_A包含482幅图像，来源于互联网；Part_B包含716幅图像，来源于上海的街道。将Part_A和Part_B分为训练集和测试集：Part_A有300张训练图片，剩余的182张图片用于测试；Part_B的训练集有400张图片而测试集有316张，本文根据文献[6]对数据集的划分进行网络的训练和测试。图 6为该方法在ShanghaiTech数据库Part_A上生成的密度图与真值图的比较结果。

Table 1 Comparing results on ShanghaiTech database

 方法 Part_A Part_B MAE MSE MAE MSE 文献[17] 181.8 277.7 32.0 49.8 MCNN[6] 110.2 173.2 26.4 41.3 Switch-CNN[7] 90.4 135.0 21.6 33.4 CSRNet[20] 68.2 115.0 10.6 16.0 本文 60.5 109.7 10.2 15.3

# 3.3 WorldExpo′10数据库

WorldExpo′10数据库首次出现于文献[17]，包含了被108个监控摄像头捕捉到的1 132个标注的视频时间序列，图像资料均来自于2010年上海世博会。文献[17]提供了3 980幅图像，包含了199 923个头部被标注的行人，将其中3 380幅图像作为训练数据。测试数据包含5组视频序列，每一组视频序列包含120幅被标注的图像，分别用S1到S5表示。文献[17]中提到的密度图的生成方法与文献[6]有所不同，文献[17]提供了每一幅实际场景的视角图，并计算了人群密度分布核，在测试时只考虑图像的感兴趣区域(ROI)。本文利用评价指标MAE来进行不同方法的性能比较。表 2为比较结果，分别对5组视频序列进行统一的结果比较并求出了多组预测结果的平均值。虽然本文在S1中的准确度低于文献[20]提出的方法，但平均预测结果相比于CSRNet有了0.66的提升，准确度有所提高。

Table 2 Comparing results on WorldExpo′10 database

 方法 S1 S2 S3 S4 S5 平均 文献[17] 9.8 14.1 14.3 22.2 3.7 12.9 MCNN[6] 3.4 20.6 12.9 13.0 8.1 11.6 Switch-CNN[7] 4.4 15.7 10.0 11.0 5.9 9.4 CSRNet[20] 2.9 14.7 10.5 10.4 5.8 8.86 本文 3.2 13.4 9.6 10.0 4.6 8.2

# 3.4 UCF_CC_50数据库

UCF_CC_50数据库由Idrees等人[27]在研究人群密度估计问题时制作，包含了50幅从互联网上搜集而来的图像。这是非常极具挑战性的一个人群计数数据库，不仅是因为数据库中包含的图像数量过少，且图像中人群的数量差异非常大：最少只有9个人，最多却可达4 543个人，平均每幅图像包含1 280个人。作者提供了63 974个标注数据。本文依据文献[27]中提到的方法，使用五倍交叉验证法来评价所提出的算法。表 3为不同方法得到的结果比较。由表 3可以看到，本文方法相比以往的经典算法均有了较大的提升。在MAE中相比文献[20]提出的算法有了18.6的提升，在MSE中有了22.9的提升。UCF_CC_50数据库主要考察的是密集场景下算法对人群数量的预测能力。实验结果表明，基于对抗式扩张卷积的网络模型在密集场景下仍具有优越的人群计数预测能力和鲁棒性。

Table 3 Comparing results on UCF_CC_50 database

 方法 MAE MSE 文献[27] 419.5 487.1 文献[17] 467.0 498.5 MCNN[6] 377.6 509.1 Switch-CNN[7] 318.1 439.2 CSRNet[20] 266.1 397.5 本文 247.5 374.6

# 3.5 UCSD数据库

Table 4 Comparing results on UCSD database

 方法 MAE MSE Kernel Ridge Regression[12] 2.16 7.45 Cumulative Attributes[13] 2.07 6.86 文献[17] 1.60 3.31 MCNN[6] 1.07 1.35 Switch-CNN[7] 1.62 2.10 CSRNet[20] 1.07 1.35 本文 1.02 1.75

