FastFace:实时鲁棒的人脸检测算法

李启运; 纪庆革; 洪赛丁

doi:10.11834/jig.180662

图像分析和识别 | 浏览量 : 0 下载量: 4 CSCD: 5

PDF
导出
分享
收藏
专辑

FastFace:实时鲁棒的人脸检测算法
FastFace: a real-time robust algorithm for face detection
2019年24卷第10期页码：1761-1771
收稿：2018-12-14，

修回：2019-5-9，

录用：2019-5-16，

纸质出版：2019-10-16
DOI： 10.11834/jig.180662
稿件说明：

移动端阅览

李启运, 纪庆革, 洪赛丁. FastFace:实时鲁棒的人脸检测算法[J]. 中国图象图形学报, 2019,24(10):1761-1771. DOI： 10.11834/jig.180662.

Qiyun Li, Qingge Ji, Saiding Hong. FastFace: a real-time robust algorithm for face detection[J]. Journal of Image and Graphics, 2019, 24(10): 1761-1771. DOI： 10.11834/jig.180662.

摘要

目的

尽管基于深度神经网络的人脸检测器在检测精度上有了极大的提升，但其代价是必须依赖强大的计算资源。如何在CPU上取得较高的检测精度的同时达到实时的检测速度是一个巨大的挑战。针对非约束性条件下的快速鲁棒的人脸检测问题，提出一种基于轻量级神经网络的检测方法。

方法

受轻量级网络MobileNet的启发，本文算法采用通道分离的卷积方式进行特征提取，并结合Inception和残差连接的思想，构建若干特征提取模块，最终训练出一个简单高效的特征提取网络；在检测时，采用One-Stage的检测策略，在骨干网络的若干不同层级上使用卷积的同时进行目标区域的分类和定位；在进行目标区域精调时，需要先在对应的特征层上预设先验框，然后再使用边界框回归算法调整先验框的位置和大小，使之接近真实框的位置。为了减少先验框的数量以节省模型参数，本算法针对人脸目标框的特点设置先验框。

结果

基于TensorFlow深度学习库构建和训练本文的检测模型，在FDDB数据集上对其进行测试，并与若干经典算法对比了检测速度和精度。相较于多任务级联卷积网络（MTCNN）等典型的深度学习方法，本文算法在CPU上将检测速度提升到25帧/s，同时平均精度（mAP）保持在0.892，高于大多数传统算法。实验结果表明本文方法能实现在CPU上的实时、高精度检测。

结论

提出了一种基于轻量级网络模型的人脸检测方法，以简单高效的卷积模块为基础构建骨干网络，并在检测时针对人脸比例特征设置合理的先验框。在非约束性条件以及有限计算资源条件下，该方法不仅在精度上表现良好，而且具有较快的检测速度，是一种鲁棒的检测方法。

Abstract

Objective

Face detection is a crucial step in various problems involving verification

identification

expression analysis. Although state-of-the-art convolutional neural network (CNN)-based face detectors exhibit improved detection accuracy

they are unsuitable to run on CPU devices because they are computationally prohibitive. Achieving high detection accuracy on CPUs and realizing real-time detection remain challenging. One of the reasons is that most back-bone networks in current face detection models are transferred from generic object detection networks. The models themselves are large and contain redundant information while modeling human faces. Moreover

the large search space of possible face locations and the variations of face sizes in one image require large computation for robust detection. Aiming at the fast and robust face detection problem under unconstrained conditions

this paper proposes a detection method based on a self-designed lightweight neural network.

Method

The instinct is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. Efforts have been made to design compact networks.

Results

proved that changing the direction of convolution can save parameters in neural networks. In this study

depth-wise separable convolution

which was first introduced in MobileNets

is used for feature extraction. We then combine the idea of inception and residual connection to construct several feature extraction modules

which finally consist of our backbone network. Unlike standard convolutions

depth-wise separable convolution uses depth-wise convolution followed by 1×1 point-wise convolution to implement convolution operation. When the kernel size is 3×3

depth-wise separable convolution uses 8 to 9 times less computation than standard convolution. Given that the inception modules and residual connections have become essential in new networks

we also use them in our model to enrich receptive field. In our backbone network

depth-wise separable convolution is used to extract features; residual connection and inception modules are introduced to feature extraction module to enrich receptive fields. We design our own bottleneck modules (with different strides)

inception modules

and residual inception modules based on depth-wise separable convolution in contrast to existing convolutional modules. The modules are then concatenated to form a complete network model. Inception modules

which are composed of bottleneck modules in parallel

aims at rapidly reducing the size of the input image. As the name suggests

residual inception modules are inception modules with residual connections and can decrease the sizes of feature maps and enrich receptive fields. Detection is carried out on multiple feature layers to increase the robustness to scale variants of faces in input images. While detecting faces

One-Stage detection strategy is applied for fast face detection. We conduct detection at three different levels of feature maps in a single feed forward manner

that is

we simultaneously classify and regress object areas at above-mentioned feature maps by using convolutions. When fine tuning the exact locations of the object areas

we need to set priori boxes

namely

default anchors

on the corresponding feature layers

and then use the bounding box regression algorithm to adjust the location and size of the anchors to make them closer to the locations of the ground truth. To reduce the number of default anchors and save model parameters

we set the default anchors according to the priori knowledge of face box ratio.

Result

We conduct and train our detection model based on TensorFlow deep learning library. Our model is trained on the WIDER FACE dataset with several data augmentation tricks. We test our model on the Face Detection Dataset and Benchmark and compare its mean average precision (mAP) and detection speed with several classical algorithms. The proposed method achieves real-time and high-precision detection on the CPU. Compared with typical deep learning methods

such as multitask cascaded convolutional networks (MTCNN)

our method exhibits detection speed that increases to 25 frames per second on CPUs and mAP maintained at 0.892

which is higher than those obtained using most traditional methods and reaches a relatively high precision level.

Conclusion

Face detectors based on deep learning exhibit improved detection accuracy. However

the high computational complexity of these methods leads to their very slow detection speed on CPUs. This paper presents a fast and robust face detection method based on lightweight neural network. A simple and efficient convolution neural network is constructed by depth-wise separable convolution

and the ideas of inception and residual connection are also used to keep the model lightweight and powerful. The default anchors are set according to the characteristics of the face boxes while applying one-stage detection strategy. Experiments demonstrate that the proposed method can significantly reduce redundant operation in the detection process. With a detection speed of 25 frames/s on CPUs

the face detection method is robust and not only performs well in terms of accuracy but also shows fast detection speed with limited computing resources under unconstrained conditions.

关键词

Keywords

references

Wu K, Zhu H L, Hao Y Y, et al. Cascade regression based multi-pose face alignment[J]. Journal of Image and Graphics, 2017, 22(2):257-264.

伍凯, 朱恒亮, 郝阳阳, 等.级联回归的多姿态人脸配准[J].中国图象图形学报, 2017, 22(2):257-264.[DOI:10.11834/jig.20170214]

Wang X H, Li R J, Hu M, et al. Occluded facial expression recognition based on the fusion of local features[J]. Journal of Image and Graphics, 2016, 21(11):1473-1482.

王晓华, 李瑞静, 胡敏, 等.融合局部特征的面部遮挡表情识别[J].中国图象图形学报, 2016, 21(11):1473-1482.[DOI:10.11834/jig.20161107]

Ding C X, Tao D C. Robust face recognition via multimodal deep face representation[J]. IEEE Transactions on Multimedia, 2015, 17(11):2049-2058.[DOI:10.1109/TMM.2015.2477042]

Zhang J, He H, Zhan X S, et al. Three dimensional face reconstruction via feature adaption and Laplace deformation[J]. Journal of Image and Graphics, 2014, 19(9):1349-1359.

张剑, 何骅, 詹小四, 等.结合特征适配与拉普拉斯形变的3维人脸重建[J].中国图象图形学报, 2014, 19(9):1349-1359.[DOI:10.11834/jig.20140912]

Viola P, Jones M J. Robust real-time face detection[J]. International Journal of Computer Vision, 2004, 57(2):137-154.[DOI:10.1023/b:visi.0000013087.49260.fb]

Mathias M, Benenson R, Pedersoli M, et al. Face detection without bells and whistles[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 720-735.[ DOI: 10.1007/978-3-319-10593-2_47 http://dx.doi.org/10.1007/978-3-319-10593-2_47 ]

Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(7):971-987.[DOI:10.1109/TPAMI.2002.1017623]

Smola A J, Schölkopf B. A tutorial on support vector regression[J]. Statistics and Computing, 2004, 14(3):199-222.[DOI:10.1023/B:STCO.0000035301.49549.88]

Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model[C]//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA: IEEE, 2008: 1-8.[ DOI: 10.1109/CVPR.2008.4587597 http://dx.doi.org/10.1109/CVPR.2008.4587597 ]

Zhang K P, Zhang Z P, Li Z F, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10):1499-1503.[DOI:10.1109/LSP.2016.2603342]

Jiang H Z, Learned-Miller E. Face detection with the faster R-CNN[C]//Proceedings of the 12th IEEE International Conference on Automatic Face & Gesture Recognition. Washington, DC, USA: IEEE, 2017: 650-657.[ DOI: 10.1109/FG.2017.82 http://dx.doi.org/10.1109/FG.2017.82 ]

Najibi M, Samangouei P, Chellappa R, et al. SSH: single stage headless face detector[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4875-4884.[ DOI: 10.1109/ICCV.2017.522 http://dx.doi.org/10.1109/ICCV.2017.522 ]

Cheng Y, Wang D, Zhou P, et al. A survey of model compression and acceleration for deep neural networks[EB/OL]. (2017-10-23)[2018-12-15] . https://arxiv.org/pdf/1710.09282.pdf https://arxiv.org/pdf/1710.09282.pdf .

Howard A G, Zhu M, Chen B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications[EB/OL]. (2017-04-17)[2018-12-15] . https://arxiv.org/pdf/1704.04861.pdf https://arxiv.org/pdf/1704.04861.pdf .

Sandler M, Howard A, Zhu M L, et al. Mobilenetv2: inverted residuals and linear Bottlenecks[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 4510-4520.[ DOI: 10.1109/CVPR.2018.00474 http://dx.doi.org/10.1109/CVPR.2018.00474 ] http://www.researchgate.net/publication/329740282_MobileNetV2_Inverted_Residuals_and_Linear_Bottlenecks .

Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2818-2826.[ DOI: 10.1109/CVPR.2016.308 http://dx.doi.org/10.1109/CVPR.2016.308 ] http://www.oalib.com/paper/4016453 .

Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-resnet and the impact of residual connections on learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI, 2017. http://www.researchgate.net/publication/301874967_Inception-v4_Inception-ResNet_and_the_Impact_of_Residual_Connections_on_Learning .

Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2015: 91-99.

Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ] http://www.researchgate.net/publication/278049038_You_Only_Look_Once_Unified_Real-Time_Object_Detection/ .

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ] http://www.computer.org/csdl/proceedings/cvpr/2016/8851/00/8851a770-abs.html .

Yang S, Luo P, Loy C C, et al. Wider Face: a face detection benchmark[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 5525-5533.[ DOI: 10.1109/CVPR.2016.596 http://dx.doi.org/10.1109/CVPR.2016.596 ]

Jain V, Learned-Miller E. FDDB:A benchmark for face detection in unconstrained settings[R]. Amherst Town, Massachusetts:University of Massachusetts Amherst, 2010.

Zhou Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:30-34.

周志华.机器学习[M].北京:清华大学出版社, 2016:30-34.

Yang B, Yan J J, Lei Z, et al. Aggregate channel features for multi-view face detection[C]//Proceedings of 2014 IEEE International Joint Conference on Biometrics. Clearwater, FL, USA: IEEE, 2014: 1-8.[ DOI: 10.1109/BTAS.2014.6996284 http://dx.doi.org/10.1109/BTAS.2014.6996284 ]

Li H X, Lin Z, Brandt J, et al. Efficient boosted exemplar-based face detection[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE. 2014: 1843-1850.[ DOI: 10.1109/CVPR.2014.238 http://dx.doi.org/10.1109/CVPR.2014.238 ] http://www.researchgate.net/publication/286594515_Efficient_Boosted_Exemplar-Based_Face_Detection .

Li J G, Wang T, Zhang Y M. Face detection using surf cascade[C]//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE, 2011: 2183-2190.[ DOI: 10.1109/ICCVW.2011.6130518 http://dx.doi.org/10.1109/ICCVW.2011.6130518 ]

Li H, Hua G, Lin Z, et al. Probabilistic elastic part model for unsupervised face detector adaptation[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia: IEEE, 2013: 793-800.[ DOI: 10.1109/ICCV.2013.103 http://dx.doi.org/10.1109/ICCV.2013.103 ] http://www.researchgate.net/publication/262249868_Probabilistic_Elastic_Part_Model_for_Unsupervised_Face_Detector_Adaptation .

Shen X H, Lin Z, Brandt J, et al. Detecting and aligning faces by image retrieval[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 3460-3467.[ DOI: 10.1109/CVPR.2013.444 http://dx.doi.org/10.1109/CVPR.2013.444 ] http://www.researchgate.net/publication/261259206_Detecting_and_Aligning_Faces_by_Image_Retrieval .

Subburaman V B, Marcel S. Fast bounding box estimation based face detection[R]. Crete, Greece: ECCV, 2010.

Viola P, Jones M. Rapid object detection using a boosted cascade of simple features[C]//Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Kauai, HI, USA: IEEE, 2001.[ DOI: 10.1109/CVPR.2001.990517 http://dx.doi.org/10.1109/CVPR.2001.990517 ] Rapid object detection using a boosted cascade of simple features .

Kalal Z, Mikolajczyk K, Matas J. Face-TLD: tracking-learning-detection applied to faces[C]//Proceedings of 2010 IEEE International Conference on Image Processing. Hong Kong, China: IEEE. 2010: 3789-3792.[ DOI: 10.1109/ICIP.2010.5653525 http://dx.doi.org/10.1109/ICIP.2010.5653525 ]

Kienzle W, Bakır G, Franz M, et al. Face detection-efficient and rank deficient[C]//Proceedings of the 17th International Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: MIT Press, 2004: 673-680. http://www.researchgate.net/publication/221620223_Face_Detection_-_Efficient_and_Rank_Deficient .

Hsu R L, Abdel-Mottaleb M, Jain A K. Face detection in color images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(5):696-706.[DOI:10.1109/34.1000242]

Li H X, Lin Z, Shen X H, et al. A convolutional neural network cascade for face detection[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 5325-5334.[ DOI: 10.1109/CVPR.2015.7299170 http://dx.doi.org/10.1109/CVPR.2015.7299170 ] http://www.researchgate.net/publication/308864105_A_convolutional_neural_network_cascade_for_face_detection .

Qin H W, Yan J J, Li X, et al. Joint training of cascaded CNN for face detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 3456-3465.[ DOI: 10.1109/CVPR.2016.376 http://dx.doi.org/10.1109/CVPR.2016.376 ] http://www.researchgate.net/publication/311611259_Joint_Training_of_Cascaded_CNN_for_Face_Detection .