Review of optimization methods for supervised deep learning

Jiang Lingyi; Zheng Yifeng; Chen Che; Li Guohe; Zhang Wenjie

doi:10.11834/jig.211139

Review | Views : 0 下载量: 0 CSCD: 0

PDF
Export
Share
Collection
Album

Review of optimization methods for supervised deep learning
Vol. 28, Issue 4, Pages: 963-983(2023)
Published： 16 April 2023 ，
DOI： 10.11834/jig.211139
稿件说明：

移动端阅览

江铃燚，郑艺峰，陈澈，李国和，张文杰. 2023. 有监督深度学习的优化方法研究综述. 中国图象图形学报， 28(04):0963-0983

Jiang Lingyi， Zheng Yifeng， Chen Che， Li Guohe， Zhang Wenjie. 2023. Review of optimization methods for supervised deep learning. Journal of Image and Graphics， 28(04):0963-0983
江铃燚，郑艺峰，陈澈，李国和，张文杰. 2023. 有监督深度学习的优化方法研究综述. 中国图象图形学报， 28(04):0963-0983 DOI： 10.11834/jig.211139.

Jiang Lingyi， Zheng Yifeng， Chen Che， Li Guohe， Zhang Wenjie. 2023. Review of optimization methods for supervised deep learning. Journal of Image and Graphics， 28(04):0963-0983 DOI： 10.11834/jig.211139.

摘要

随着大数据的普及和算力的提升，深度学习已成为一个热门研究领域，但其强大的性能过分依赖网络结构和参数设置。因此，如何在提高模型性能的同时降低模型的复杂度，关键在于模型优化。为了更加精简地描述优化问题，本文以有监督深度学习作为切入点，对其提升拟合能力和泛化能力的优化方法进行归纳分析。给出优化的基本公式并阐述其核心；其次，从拟合能力的角度将优化问题分解为3个优化方向，即收敛性、收敛速度和全局质量问题，并总结分析这3个优化方向中的具体方法与研究成果；从提升模型泛化能力的角度出发，分为数据预处理和模型参数限制两类对正则化方法的研究现状进行梳理；结合上述理论基础，以生成对抗网络（generative adversarial network，GAN）变体模型的发展历程为主线，回顾各种优化方法在该领域的应用，并基于实验结果对优化效果进行比较和分析，进一步给出几种在GAN领域效果较好的优化策略。现阶段，各种优化方法已普遍应用于深度学习模型，能够较好地提升模型的拟合能力，同时通过正则化缓解模型过拟合问题来提高模型的鲁棒性。尽管深度学习的优化领域已得到广泛研究，但仍缺少成熟的系统性理论来指导优化方法的使用，且存在几个优化问题有待进一步研究，包括无法保证全局梯度的Lipschitz限制、在GAN中找寻稳定的全局最优解，以及优化方法的可解释性缺乏严格的理论证明。

Abstract

Deep learning technique has been developing intensively in big data era. However， its capability is still challenged for the design of network structure and parameter setting. Therefore， it is essential to improve the performance of the model and optimize the complexity of the model. Machine learning can be segmented into five categories in terms of learning methods： 1） supervised learning， 2） unsupervised learning， 3） semi-supervised learning， 4） deep learning， and 5） reinforcement learning. These machine learning techniques are required to be incorporated in. To improve its fitting and generalization ability， we select supervised deep learning as a niche to summarize and analyze the optimization methods. First， the mechanism of optimization is demonstrated and its key elements are illustrated. Then， the optimization problem is decomposed into three directions in relevant to fitting ability： 1） convergence， 2） convergence speed， and 3） global-context quality. At the same time， we also summarize and analyze the specific methods and research results of these three optimization directions. Among them， convergence refers to running the algorithm and converging to a synthesis like a stationary point. The gradient exploding/vanishing problem is shown that small changes in a multi-layer network may amplify and stimuli or decline and disappear for each layer. The speed of convergence refers to the ability to assist the model to converge at a faster speed. After the convergence task of the model， the optimization algorithm to accelerate the model convergence should be considered to improve the performance of the model. The global-context quality problem is to ensure that the model converges to a lower solution （the global minimum）. The first two problems are local-oriented and the last one is global-concerned. The boundary of these three problems is fuzzy， for example， some optimization methods to improve convergence can accelerate the convergence speed of the model as well. After the fitting optimization of the model， it is necessary to consider the large number of parameters in the deep learning model as well， which can cause poor generalization effect due to overfitting. Regularization can be regarded as an effective method for generalization. To improve the generalization ability of the model， current situation of regularization methods are categorized from two aspects： 1） data processing and 2） model parameters-constrained. Data processing refers to data processing during model training， such as dataset enhancement， noise injection and adversarial training. These optimization methods can improve the generalization ability of the model effectively. Model parameters constraints are oriented to parameters-constrained in the network， which can also improve the generalization ability of the model. We take generative adversarial network （GAN） as the application background and review the growth of its variant model because it can be as a commonly-used deep learning network. We analyze the application of relevant optimization methods in GAN domain from two aspects of fitting and generalization ability. Taking WGAN with gradient penalty （WGAN-GP） as the basic model， we design an experiment on MNIST-10 dataset to study the applicability of the six algorithms （stochastic gradient method（SGD）， momentum SGD， Adagrad， Adadelta， root mean square propagation（RMSProp）， and Adam） in the context of deep learning based GAN domain. The optimization effects are compared and analyzed in relevant to the experimental results of multiple optimization methods on variants of GAN model， and some GAN-based optimization strategies are required to be clarified further. At present， various optimization methods have been widely used in deep learning models. Various optimization methods to improve the fitting ability can improve the performance of the model. Furthermore， these regularized optimization methods are beneficial to alleviate the problem of model overfitting and improve the robustness of the model. But， there is still a lack of systematic theories and mechanisms for guidance. In addition， there are still some optimization problems to be further studied. The Lipschitz limitation of global gradients is not guaranteed in deep neural networks due to the gap between theory and practice. In the field of GAN， there is still a lack of theoretical breakthroughs to find the stable global optimal solution， that is， the optimal Nash equilibrium. Moreover， some of the existing optimization methods are empirical and its interpretability is lack of clear theoretical proof. There are many and complex optimization methods in deep learning. The use of various optimization methods should be focused on the integrated effect of multiple optimizations. Our critical analysis is potential to provide a reference for the optimization method selection in the design of deep neural network.

关键词

机器学习深度学习深度学习优化正则化生成对抗网络（GAN）

Keywords

machine learningdeep learningdeep learning optimizationregularizationgenerative adversarial network （GAN）

references

Abdolahnejad M and Liu P X. 2020. Deep learning for face image synthesis and semantic manipulations： a review and future perspectives. Artificial Intelligence Review， 53（8）： 5847-5880 ［DOI： 10.1007/s10462-020-09835-4http://dx.doi.org/10.1007/s10462-020-09835-4］

Agbo-Ajala O and Viriri S. 2021. Deep learning approach for facial age classification： a survey of the state-of-the-art. Artificial Intelligence Review， 54（1）： 179-213 ［DOI： 10.1007/s10462-020-09855-0http://dx.doi.org/10.1007/s10462-020-09855-0］

Arjovsky M， Chintala S and Bottou L. 2017. Wasserstein generative adversarial networks//Proceedings of the 34th International Conference on Machine Learning. Sydney， Australia： JMLR.org： 214-223 ［DOI： 10.5555/3305381.3305404http://dx.doi.org/10.5555/3305381.3305404］

Arora S， Li Z Y and Lyu K. 2019. Theoretical analysis of auto rate-tuning by batch normalization//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Ba J L， Kiros J R and Hinton G E. 2016. Layer normalization ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1607.06450.pdfhttps://arxiv.org/pdf/1607.06450.pdf

Berthelot D， Carlini N， Cubuk E D， Kurakin A， Zhang H and Raffel C. 2019a. ReMixMatch： semi-supervised learning with distribution alignment and augmentation anchoring ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1911.09785.pdfhttps://arxiv.org/pdf/1911.09785.pdf

Berthelot D， Carlini N， Goodfellow I， Oliver A， Papernot N and Raffel C. 2019b. MixMatch： a holistic approach to semi-supervised learning//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver， Canada： NeurIPS： #32

Bjorck J， Gomes C， Selman B and Weinberger K Q. 2018. Understanding batch normalization//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal， Canada： Curran Associates Inc.： 7705-7716 ［DOI： 10.5555/3327757.3327868http://dx.doi.org/10.5555/3327757.3327868］

Bottou L. 1988. Reconnaissance de la parole par reseaux connexionnistes. Proceedings of Neuro Nimes， 88： 197-218 ［DOI： 10.51257/a-v2-h3728http://dx.doi.org/10.51257/a-v2-h3728］

Brock A， Donahue J and Simonyan K. 2019. Large scale GAN training for high fidelity natural image synthesis//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Brock A， Lim T， Ritchie J M and Weston N. 2017. Neural photo editing with introspective adversarial networks//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net ［DOI： 10.48550/arXiv.1609.07093http://dx.doi.org/10.48550/arXiv.1609.07093］

Burton Jr R M and Mpitsos G J. 1992. Event-dependent control of noise enhances learning in neural networks. Neural Networks， 5（4）： 627-637 ［DOI： 10.1016/S0893-6080（05）80040-1http://dx.doi.org/10.1016/S0893-6080（05）80040-1］

Cai Y Q， Li Q X and Shen Z W. 2019. A quantitative analysis of the effect of batch normalization on gradient descent//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 882-890

Candès E J， Li X D， Ma Y and Wright J. 2011. Robust principal component analysis？ Journal of the ACM， 58（3）： #11 ［DOI： 10.1145/1970392.1970395http://dx.doi.org/10.1145/1970392.1970395］

Candès E J， Romberg J and Tao T. 2006. Robust uncertainty principles： exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory， 52（2）： 489-509 ［DOI： 10.1109/TIT.2005.862083http://dx.doi.org/10.1109/TIT.2005.862083］

Chawla N V， Bowyer K W， Hall L O and Kegelmeyer W P. 2002. SMOTE： synthetic minority over-sampling technique. Journal of Artificial Intelligence Research， 16： 321-357 ［DOI： 10.1613/jair.953http://dx.doi.org/10.1613/jair.953］

Chen B F， Li J D， Lu X J， Sha C F， Wang X L and Zhang J. 2021. Survey of deep learning based graph anomaly detection methods. Computer Research and Development， 58（7）： 1436-1455

陈波冯，李靖东，卢兴见，沙朝锋，王晓玲，张吉. 2021. 基于深度学习的图异常检测技术综述. 计算机研究与发展， 58（7）： 1436-1455 ［DOI： 10.7544/issn1000-1239.2021.20200685http://dx.doi.org/10.7544/issn1000-1239.2021.20200685］

Chen T， Lucic M， Houlsby N and Gelly S. 2019. On self modulation for generative adversarial networks//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Choi Y， Choi M， Kim M， Ha J W， Kim S and Choo J. 2018. StarGAN： unified generative adversarial networks for multi-domain image-to-image translation//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 8789-8797 ［DOI： 10.1109/CVPR.2018.00916http://dx.doi.org/10.1109/CVPR.2018.00916］

Chu X X， Zhang B and Li X D. 2021. Noisy differentiable architecture search//Proceedings of the 32nd British Machine Vision Conference. Addis Ababa， Ethiopia： BMVA Press

Coates A， Lee H and Ng A Y. 2011. An analysis of single-layer networks in unsupervised feature learning. Journal of Machine Learning Research， 15： 215-223

Cubuk E D， Zoph B， Mané D， Vasudevan V and Le Q V. 2018. AutoAugment： learning augmentation strategies from data//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 113-123 ［DOI： 10.1109/cvpr.2019.00020http://dx.doi.org/10.1109/cvpr.2019.00020］

Dauphin Y N and Schoenholz S S. 2019. Metainit： initializing learning by learning to initialize//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc.： #1133

Deng J， Dong W， Socher R， Li L J， Li K and Li F. 2009. ImageNet： a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 248-255 ［DOI： 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848］

Dinh L， Pascanu R， Bengio S and Bengio Y. 2017. Sharp minima can generalize for deep nets//Proceedings of the 34th International Conference on Machine Learning. Sydney， Australia： PMLR： 1019-1028 ［DOI： 10.48550/arXiv.1703.04933http://dx.doi.org/10.48550/arXiv.1703.04933］

Donoho D L. 1995. De-noising by soft-thresholding. IEEE Transactions on Information Theory， 41（3）： 613-627 ［DOI： 10.1109/18.382009http://dx.doi.org/10.1109/18.382009］

Donoho D L. 2006. Compressed sensing. IEEE Transactions on Information Theory， 52（4）： 1289-1306 ［DOI： 10.1109/TIT.2006.871582http://dx.doi.org/10.1109/TIT.2006.871582］

Draxler F， Veschgini K， Salmhofer M and Hamprecht F A. 2018. Essentially no barriers in neural network energy landscape//Proceedings of the 35th International Conference on Machine Learning. Stockholm， Sweden： PMLR： 1309-1318

Duan B， Fu X， Jiang Y and Zeng J X. 2020. Lightweight blurred car plate recognition method combined with generated images. Journal of Image and Graphics， 25（9）： 1813-1824

段宾，符祥，江毅，曾接贤. 2020. 结合GAN的轻量级模糊车牌识别算法. 中国图象图形学报， 25（9）： 1813-1824［DOI： 10.11834/jig.190604http://dx.doi.org/10.11834/jig.190604］

Duchi J， Hazan E and Singer Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research， 12： 2121-2159 ［DOI： 10.5555/1953048.2021068http://dx.doi.org/10.5555/1953048.2021068］

Dumoulin V， Shlens J and Kudlur M. 2017. A learned representation for artistic style//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net： 1-9

Frankle J and Carbin M. 2019. The lottery ticket hypothesis： finding sparse， trainable neural networks//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Garipov T， Izmailov P， Podoprikhin D， Vetrov D and Wilson A G. 2018. Loss surfaces， mode connectivity， and fast ensembling of DNNs//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal， Canada： Curran Associates Inc.： 8803-8812 ［DOI： 10.5555/3327546.3327556http://dx.doi.org/10.5555/3327546.3327556］

Ghorbani B， Krishnan S and Xiao Y. 2019. An investigation into neural net optimization via hessian eigenvalue density//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 2232-2241

Gilboa D， Chang B， Chen M M， Yang G， Schoenholz S S， Chi E H and Pennington J. 2019. Dynamical isometry and a mean field theory of LSTMs and GRUs ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1901.08987.pdfhttps://arxiv.org/pdf/1901.08987.pdf

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 1440-1448 ［DOI： 10.1109/ICCV.2015.169http://dx.doi.org/10.1109/ICCV.2015.169］

Glorot X and Bengio Y. 2010. Understanding the difficulty of training deep feedforward neural networks//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Sardinia， Italy： JMLR.org： 249-256

Glorot X， Bordes A and Bengio Y. 2011. Deep sparse rectifier neural networks//Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. Fort Lauderdale， USA： JMLR.org： 315-323

Goodfellow I， Bengio Y and Courville A. 2016. Deep Learning. Cambridge： The MIT Press： 1-100

Goodfellow I， Pouget-Abadie J， Mirza M， Xu B， Warde-Farley D， Ozair S， Courville A and Bengio Y. 2020. Generative adversarial networks. Communications of the ACM， 63（11）： 139-144 ［DOI： 10.1145/3422622http://dx.doi.org/10.1145/3422622］

Grandvalet Y and Bengio Y. 2004. Semi-supervised learning by entropy minimization//Proceedings of the 17th International Conference on Neural Information Processing Systems. Vancouver， Canada： MIT Press： 529-536 ［DOI： 10.5555/2976040.2976107http://dx.doi.org/10.5555/2976040.2976107］

Gulcehre C， Moczulski M， Denil M and Bengio Y. 2016. Noisy activation functions//Proceedings of the 33rd International Conference on Machine Learning. New York， USA： JMLR.org： 3059-3068 ［DOI： 10.48550/arXiv.1603.00391http://dx.doi.org/10.48550/arXiv.1603.00391］

Gulrajani I， Ahmed F， Arjovsky M， Dumoulin V and Courville A. 2017. Improved training of wasserstein GANs//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 5769-5779 ［DOI： 10.5555/3295222.3295327http://dx.doi.org/10.5555/3295222.3295327］

Hanin B and Rolnick D. 2018. How to start training： the effect of initialization and architecture//Proceedings of the 32nd Conference on Neural Information Processing Systems. Montréal， Canada： NeurIPS： 571-581

Hanson S J. 1990. A stochastic version of the delta rule. Physica D： Nonlinear Phenomena， 42（1/3）： 265-272 ［DOI： 10.1016/0167-2789（90）90081-Yhttp://dx.doi.org/10.1016/0167-2789（90）90081-Y］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90］

Heusel M， Ramsauer H， Unterthiner T， Nessler B and Hochreiter S. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6629-6640 ［DOI： 10.5555/3295222.3295408http://dx.doi.org/10.5555/3295222.3295408］

Hinton G， Deng L， Yu D， Dahl G E， Mohamed A R， Jaitly N， Senior A， Vanhoucke V， Nguyen P， Sainath T N and Kingsbury B. 2012. Deep neural networks for acoustic modeling in speech recognition： the shared views of four research groups. IEEE Signal Processing Magazine， 29（6）： 82-97 ［DOI： 10.1109/MSP.2012.2205597http://dx.doi.org/10.1109/MSP.2012.2205597］

Hinton G E and Salakhutdinov R R. 2006. Reducing the dimensionality of data with neural networks. Science， 313（5786）： 504-507 ［DOI： 10.1126/science.1127647http://dx.doi.org/10.1126/science.1127647］

Huang G， Liu Z， van der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2261-2269 ［DOI： 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243］

Ioffe S and Szegedy C. 2015. Batch normalization： accelerating deep network training by reducing internal covariate shift//Proceedings of the 32nd International Conference on International Conference on Machine Learning. Lille， France： JMLR.org： 448-456 ［DOI： 10.5555/3045118.3045167http://dx.doi.org/10.5555/3045118.3045167］

Jenni S and Favaro P. 2019. On stabilizing generative adversarial training with noise//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 12137-12145 ［DOI： 10.1109/CVPR.2019.01242http://dx.doi.org/10.1109/CVPR.2019.01242］

Karnewar A and Wang O. 2020. MSG-GAN： multi-scale gradients for generative adversarial networks//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 7796-7805 ［DOI： 10.1109/CVPR42600.2020.00782http://dx.doi.org/10.1109/CVPR42600.2020.00782］

Karras T， Aila T， Laine S and Lehtinen J. 2018. Progressive growing of GANs for improved quality， stability， and variation//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Karras T， Laine S and Aila T. 2019. A style-based generator architecture for generative adversarial networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 4396-4405 ［DOI： 10.1109/CVPR.2019.00453http://dx.doi.org/10.1109/CVPR.2019.00453］

Khan Z Y， Niu Z D， Sandiwarno S and Prince R. 2021. Deep learning techniques for rating prediction： a survey of the state-of-the-art. Artificial Intelligence Review， 54（1）： 95-135 ［DOI： 10.1007/s10462-020-09892-9http://dx.doi.org/10.1007/s10462-020-09892-9］

Kingma D P and Ba J. 2015. Adam： a method for stochastic optimization//Proceedings of the 3rd International Conference on Learning Representations. San Diego， USA： ICLR

Ko T， Peddinti V， Povey D and Khudanpur S. 2015. Audio augmentation for speech recognition//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden， Germany： ISCA： 3586-3589 ［DOI： 10.21437/interspeech.2015-711http://dx.doi.org/10.21437/interspeech.2015-711］

Kodali N， Abernethy J， Hays J and Kira Z. 2017. On convergence and stability of GANs ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1705.07215v1.pdfhttps://arxiv.org/pdf/1705.07215v1.pdf

Kohler J， Daneshmand H， Lucchi A， Hofmann T， Zhou M and Neymeyr K. 2019. Exponential convergence rates for batch normalization： the power of Length-direction decoupling in non-convex optimization//Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. Naha， Japan： PMLR： 806-815

Krizhevsky A. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report TR-2009. University of Toronto

Krizhevsky A， Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks. Communications of the ACM， 60（6）： 84-90 ［DOI： 10.1145/3065386http://dx.doi.org/10.1145/3065386］

Kurach K， Lučić M， Zhai X H， Michalski M and Gelly S. 2019. A large-scale study on regularization and normalization in GANs//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 3581-3590

Laine S and Aila T. 2017. Temporal ensembling for semi-supervised learning//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net： #7

LeCun Y， Bottou L， Bengio Y and Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of 1998 IEEE， 86（11）： 2278-2324 ［DOI： 10.1109/5.726791http://dx.doi.org/10.1109/5.726791］

LeCun Y A， Bottou L， Orr G B and Müller K R. 2012. Efficient backprop//Montavon G， Orr G B and Müller K R. Neural Networks： Tricks of the Trade. Berlin： Springer： 9-48 ［DOI： 10.1007/978-3-642-35289-8_3http://dx.doi.org/10.1007/978-3-642-35289-8_3］

Lee D H. 2013. Pseudo-label： the simple and efficient semi-supervised learning method for deep neural networks//Proceedings of 2013 Workshop： Challenges in Representation Learning. Atlanta， USA：［s.n.］： 1-6

Li P and Nguyen P M. 2019. On random deep weight-tied autoencoders： exact asymptotic analysis， phase transitions， and implications to training//Proceedings of the 7th International Conference on Learning Representations. New Orleans， Louisiana， USA： OpenReview.net

Lim S， Kim I， Kim T， Kim C and Kim S. 2019. Fast autoAugment//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver， Canada： NeurIPS

Liu K L， Qiu G P， Tang W M and Zhou F. 2020. Spectral regularization for combating mode collapse in GANs. Image and Vision Computing， 104： #104005 ［DOI： 10.1016/j.imavis.2020.104005http://dx.doi.org/10.1016/j.imavis.2020.104005］

Liu Z W， Luo P， Wang X G and Tang X O. 2015. Deep learning face attributes in the wild//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago， Chile： IEEE： 3730-3738 ［DOI： 10.1109/ICCV.2015.425http://dx.doi.org/10.1109/ICCV.2015.425］

Loshchilov I and Hutter F. 2017. SGDR： stochastic gradient descent with warm restarts//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net

Luo P， Zhang R M， Ren J M， Peng Z L and Li J Y. 2021. Switchable normalization for learning-to-normalize deep representation. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（2）： 712-728 ［DOI： 10.1109/TPAMI.2019.2932062http://dx.doi.org/10.1109/TPAMI.2019.2932062］

Madry A， Makelov A， Schmidt L， Tsipras D and Vladu A. 2018. Towards deep learning models resistant to adversarial Attacks//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Mescheder L， Geiger A and Nowozin S. 2018. Which training methods for GANs do actually converge？//Proceedings of the 35th International Conference on Machine Learning. Stockholm， Sweden： PMLR： 3481-3490

Mishkin D and Matas J. 2016. All you need is a good init//Proceedings of the 4th International Conference on Learning Representations. San Juan， Puerto Rico， USA： ICLR

Miyato T， Kataoka T， Koyama M and Yoshida Y. 2018. Spectral normalization for generative adversarial networks//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Miyato T， Maeda S I， Koyama M and Ishii S. 2019. Virtual adversarial training： a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（8）： 1979-1993 ［DOI： 10.1109/TPAMI.2018.2858821http://dx.doi.org/10.1109/TPAMI.2018.2858821］

Natarajan B K. 1995. Sparse approximate solutions to linear systems. SIAM Journal on Computing， 24（2）： 227-234 ［DOI： 10.1137/S0097539792240406http://dx.doi.org/10.1137/S0097539792240406］

Neelakantan A， Vilnis L， Le Q V， Sutskever I， Kaiser L， Kurach K and Martens J. 2015. Adding gradient noise improves learning for very deep networks ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1511.06807.pdfhttps://arxiv.org/pdf/1511.06807.pdf

Netzer Y， Wang T， Coates A， Bissacco A， Wu B and Ng A Y. 2011. Reading digits in natural images with unsupervised feature learning//Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Granada， Spain：［s.n.］： 1-9

Ng A. 2011. Sparse autoencoder ［EB/OL］. ［2021-12-16］. https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdfhttps://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

Odena A. 2016. Semi-supervised learning with generative adversarial networks ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1606.01583.pdfhttps://arxiv.org/pdf/1606.01583.pdf

Pennington J， Schoenholz S S and Ganguli S. 2017. Resurrecting the sigmoid in deep learning through dynamical isometry： theory and practice//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 4788-4798 ［DOI： 10.5555/3295222.3295232http://dx.doi.org/10.5555/3295222.3295232］

Pennington J， Schoenholz S S and Ganguli S. 2018. The emergence of spectral universality in deep networks//Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. Playa Blanca， Spain： PMLR： 1924-1932

Petzka H， Fischer A and Lukovnikov D. 2018. On the regularization of wasserstein GANs//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Poole B， Lahiri S， Raghu M， Sohl-Dickstein J and Ganguli S. 2016. Exponential expressivity in deep neural networks through transient chaos//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona， Spain： Curran Associates Inc.： 3368-3376 ［DOI： 10.5555/3157382.3157474http://dx.doi.org/10.5555/3157382.3157474］

Qian N. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks， 12（1）： 145-151 ［DOI： 10.1016/S0893-6080（98）00116-6http://dx.doi.org/10.1016/S0893-6080（98）00116-6］

Robbins H and Monro S. 1951. A stochastic approximation method. The Annals of Mathematical Statistics， 22（3）： 400-407 ［DOI： 10.1214/aoms/1177729586http://dx.doi.org/10.1214/aoms/1177729586］

Rumelhart D E， Hinton G E and Williams R J. 1986. Learning representations by back-propagating errors. Nature， 323（6088）： 533-536 ［DOI： 10.1038/323533a0http://dx.doi.org/10.1038/323533a0］

Salimans T， Goodfellow I， Zaremba W， Cheung V， Radford A and Chen X. 2016. Improved techniques for training GANs//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona， Spain： Curran Associates Inc.： 2234-2242 ［DOI： 10.5555/3157096.3157346http://dx.doi.org/10.5555/3157096.3157346］

Salimans T and Kingma D P. 2016. Weight normalization： a simple reparameterization to accelerate training of deep neural networks//Proceedings of the 30th Conference on Neural Information Processing Systems. Barcelona， Spain： Curran Associates Inc.： 901-909 ［DOI： 10.5555/3157096.3157197http://dx.doi.org/10.5555/3157096.3157197］

Santurkar S， Tsipras D， Ilyas A and Madry A. 2018. How does batch normalization help optimization？//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal， Canada： Curran Associates Inc.： 2488-2498 ［DOI： 10.5555/3327144.3327174http://dx.doi.org/10.5555/3327144.3327174］

Saxe A M， McClelland J L and Ganguli S. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks//Proceedings of the 2nd International Conference on Learning Representations. Banff， Canada： ICLR

Shafahi A， Najibi M， Ghiasi A， Xu Z， Dickerson J， Studer C， Davis L S， Taylor G and Goldstein T. 2019. Adversarial training for free！//Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc.： #302 ［DOI： 10.5555/3454287.3454589http://dx.doi.org/10.5555/3454287.3454589］

Shahshahani B M and Landgrebe D A. 1994. The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing， 32（5）： 1087-1095 ［DOI： 10.1109/36.312897http://dx.doi.org/10.1109/36.312897］

Smith L N. 2017. Cyclical learning rates for training neural networks//Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision. Santa Rosa， USA： IEEE： 464-472 ［DOI： 10.1109/WACV.2017.58http://dx.doi.org/10.1109/WACV.2017.58］

Sohn K， Berthelot D， Li C L， Zhang Z Z， Carlini N， Cubuk E D， Kurakin A， Zhang H and Raffe C. 2020. FixMatch： simplifying semi-supervised learning with consistency and confidence//Proceedings of the 34th Conference on Neural Information Processing Systems. Vancouver， Canada： NeurIPS

Sønderby C K， Caballero J， Theis L， Shi W Z and Huszár F. 2017. Amortised map inference for image super-resolution//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net

Srivastava R K， Greff K and Schmidhuber J. 2015. Highway networks ［EB/OL］. ［2021-05-03］. http：//arxiv.org/pdf/1505.00387.pdfhttp://arxiv.org/pdf/1505.00387.pdf

Sun R Y. 2020. Optimization for deep learning： an overview. Journal of the Operations Research Society of China， 8（2）： 249-294 ［DOI： 10.1007/s40305-020-00309-6http://dx.doi.org/10.1007/s40305-020-00309-6］

Szegedy C， Zaremba W， Sutskever I， Bruna J， Erhan D， Goodfellow I J and Fergus R. 2014. Intriguing properties of neural networks//Proceedings of the 2nd International Conference on Learning Representations. Banff， Canada： ICLR

Tan M X and Le Q V. 2019. EfficientNet： rethinking model scaling for convolutional neural networks//Proceedings of 2019 International Conference on Machine Learning. Long Beach， USA： PMLR： 6105-6114

Tarvainen A and Valpola H. 2017. Mean teachers are better role models： weight-averaged consistency targets improve semi-supervised deep learning results//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 1195-1204

Thanh-Tung H， Tran T and Venkatesh S. 2019. Improving generalization and stability of generative adversarial networks//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Tieleman T and Hinton G. 2012. Lecture 6.5-RMSProp： divide the gradient by a running average of its recent magnitude. COURSERA： Neural Networks for Machine Learning， 4（2）： 26-31

Torfi A， Shirvani R A， Keneshloo Y， Tavaf N and Fox E A. 2020. Natural language processing advancements by deep learning： a survey ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/2003.01200v1.pdfhttps://arxiv.org/pdf/2003.01200v1.pdf

Ulyanov D， Vedaldi A and Lempitsky V. 2016. Instance normalization： the missing ingredient for fast stylization ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1607.08022v3.pdfhttps://arxiv.org/pdf/1607.08022v3.pdf

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010 ［DOI： 10.5555/3295222.3295349http://dx.doi.org/10.5555/3295222.3295349］

Vincent P， Larochelle H， Bengio Y and Manzagol P A. 2008. Extracting and composing robust features with denoising autoencoders//Proceedings of the 25th International Conference on Machine Learning. Helsinki， Finland： ACM： 1096-1103 ［DOI： 10.1145/1390156.1390294http://dx.doi.org/10.1145/1390156.1390294］

Wei J and Zou K. 2019. EDA： easy data augmentation techniques for boosting performance on text classification tasks//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong， China： Association for Computational Linguistics： 6382-6388 ［DOI： 10.18653/v1/d19-1670http://dx.doi.org/10.18653/v1/d19-1670］

Werbos P J. 1974. Beyond Regression： New Tools for Prediction and Analysis in the Behavioral Sciences. Cambridge， USA： Harvard University

Wu J Q， Huang Z W， Thoma J， Acharya D， and van Gool L. 2018. Wasserstein divergence for GANs//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 673-688 ［DOI： 10.1007/978-3-030-01228-1_40http://dx.doi.org/10.1007/978-3-030-01228-1_40］

Wu Y X and He K M. 2020. Group normalization. International Journal of Computer Vision， 128（3）： 742-755 ［DOI： 10.1007/s11263-019-01198-whttp://dx.doi.org/10.1007/s11263-019-01198-w］

Xiao L C， Bahri Y， Sohl-Dickstein J， Schoenholz S S and Pennington J. 2018. Dynamical isometry and a mean field theory of CNNs： how to train 10，000-layer vanilla convolutional neural networks//Proceedings of the 35th International Conference on Machine Learning. Stockholm， Sweden： PMLR： 5393-5402

Xie L X， Wang J D， Wei Z， Wang M and Tian Q. 2016. DisturbLabel： regularizing CNN on the loss layer//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 4753-4762 ［DOI： 10.1109/CVPR.2016.514http://dx.doi.org/10.1109/CVPR.2016.514］

Xie Q Z， Dai Z H， Hovy E， Luong M T and Le Q V. 2020. Unsupervised data augmentation for consistency training//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver， Canada： Curran Associates Inc： #3496249 ［DOI： 10.5555/3495724.3496249http://dx.doi.org/10.5555/3495724.3496249］

Xie S N， Girshick R， Dollár P， Tu Z W and He K M. 2017. Aggregated residual transformations for deep neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 5987-5995 ［DOI： 10.1109/CVPR.2017.634http://dx.doi.org/10.1109/CVPR.2017.634］

Yoshida Y and Miyato T. 2017. Spectral norm regularization for improving the generalizability of deep learning ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1705.10941v1.pdfhttps://arxiv.org/pdf/1705.10941v1.pdf

Yu A W， Dohan D， Luong M T， Zhao R， Chen K， Norouzi M and Le Q V. 2018. QANET： combining local convolution with global Self-attention for reading comprehension//Proceedings of the 6th International Conference on Learning Representations. Vancouver， Canada： OpenReview.net

Yu F， Seff A， Zhang Y D， Song S R， Funkhouser T and Xiao J X. 2015. LSUN： construction of a Large-scale image dataset using deep learning with humans in the loop ［EB/OL］. ［2021-12-01］. https://arxiv.org/pdf/1506.03365.pdfhttps://arxiv.org/pdf/1506.03365.pdf

Zhang D H， Zhang T Y， Lu Y P， Zhu Z X and Dong B. 2019a. You only propagate once： accelerating adversarial training via maximal principle//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver， Canada： NeurIPS

Zhang H， Goodfellow I， Metaxas D and Odena A. 2019b. Self-attention generative adversarial networks//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 7354-7363

Zhang H， Le Z L， Shao Z F， Xu H and Ma J Y. 2021. MFF-GAN： an unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion. Information Fusion， 66： 40-53 ［DOI： 10.1016/j.inffus.2020.08.022http://dx.doi.org/10.1016/j.inffus.2020.08.022］

Zhang Z H， Zeng Y B， Bai L， Hu Y Q， Wu M H， Wang S and Hancock E R. 2020. Spectral bounding： strictly satisfying the 1-lipschitz property for generative adversarial networks. Pattern Recognition， 105： #107179 ［DOI： 10.1016/j.patcog.2019.107179http://dx.doi.org/10.1016/j.patcog.2019.107179］

Zhou B and Krähenbühl P. 2019. Don’t let your discriminator be fooled//Proceedings of the 7th International Conference on Learning Representations. New Orleans， USA： OpenReview.net

Zhou C S， Zhang J S and Liu J M. 2018. Lp-WGAN： using Lp-norm normalization to stabilize wasserstein generative adversarial networks. Knowledge-Based Systems， 161： 415-424 ［DOI： 10.1016/j.knosys.2018.08.004http://dx.doi.org/10.1016/j.knosys.2018.08.004］

Zhou Z M， Liang J D， Song Y X， Yu L T， Wang H W， Zhang W N， Yu Y and Zhang Z H. 2019. Lipschitz generative adversarial nets//Proceedings of the 36th International Conference on Machine Learning. Long Beach， USA： PMLR： 7584-7593

Zhuo L A， Zhang B C， Chen C， Ye Q X， Liu J Z and Doermann D. 2019. Calibrated stochastic gradient descent for convolutional neural networks//Proceedings of the 31st Innovative Applications of Artificial Intelligence Conference. Honolulu， USA： AAAI： 9348-9355 ［DOI： 10.1609/aaai.v33i01.33019348http://dx.doi.org/10.1609/aaai.v33i01.33019348］

Zoph B， Cubuk E D， Ghiasi G， Lin T Y， Shlens J and Le Q V. 2020. Learning data augmentation strategies for object detection//Proceedings of the 16th European Conference on Computer Vision. Glasgow， UK： Springer： 566-583 ［DOI： 10.1007/978-3-030-58583-9_34http://dx.doi.org/10.1007/978-3-030-58583-9_34］

Zoph B and Le Q V. 2017. Neural architecture search with reinforcement learning//Proceedings of the 5th International Conference on Learning Representations. Toulon， France： OpenReview.net

Zou H and Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society： Series B （Statistical Methodology）， 67（2）： 301-320 ［DOI： 10.1111/j.1467-9868.2005.00503.xhttp://dx.doi.org/10.1111/j.1467-9868.2005.00503.x］

Alert me when the article has been cited

提交

A survey of Deepfake and related digital forensics

Intelligent visualization and visual analytics

Cross-view gait recognition： a review

The review of distortion-related image quality assessment

An overview of visual DeepFake detection techniques