融合自注意力和自编码器的视频异常检测

梁家菲; 李婷; 杨佳琪; 李亚楠; 方智文; 杨丰

doi:10.11834/jig.211147

图像分析和识别 | 浏览量 : 0 下载量: 2 CSCD: 3

PDF
导出
分享
收藏
专辑

融合自注意力和自编码器的视频异常检测
Video anomaly detection by fusing self-attention and autoencoder
2023年28卷第4期页码：1029-1040
纸质出版日期： 2023-04-16 ，
DOI： 10.11834/jig.211147
稿件说明：

移动端阅览

梁家菲，李婷，杨佳琪，李亚楠，方智文，杨丰. 2023. 融合自注意力和自编码器的视频异常检测. 中国图象图形学报， 28(04):1029-1040

Liang Jiafei， Li Ting， Yang Jiaqi， Li Yanan， Fang Zhiwen， Yang Feng. 2023. Video anomaly detection by fusing self-attention and autoencoder. Journal of Image and Graphics， 28(04):1029-1040
梁家菲，李婷，杨佳琪，李亚楠，方智文，杨丰. 2023. 融合自注意力和自编码器的视频异常检测. 中国图象图形学报， 28(04):1029-1040 DOI： 10.11834/jig.211147.

Liang Jiafei， Li Ting， Yang Jiaqi， Li Yanan， Fang Zhiwen， Yang Feng. 2023. Video anomaly detection by fusing self-attention and autoencoder. Journal of Image and Graphics， 28(04):1029-1040 DOI： 10.11834/jig.211147.

摘要

目的

视频异常检测通过挖掘正常事件样本的模式来检测不符合正常模式的异常事件。基于自编码器的模型广泛用于视频异常检测领域，由于自监督学习的特征提取具有一定盲目性，使得网络的特征表达能力有限。为了提升模型对正常模式的学习能力，提出一种基于Transformer和U-Net的视频异常检测方法。

方法

首先，编码器对输入的连续帧进行下采样提取低层特征，并将最后一层特征图输入Transformer编码全局信息，学习特征像素之间的相关信息。然后解码器对编码特征进行上采样，通过跳跃连接与编码器中相同分辨率的低层特征融合，将全局空间信息与局部细节信息结合从而实现异常定位。针对近景康复动作的异常反馈需求，本文基于周期性动作收集了一个室内近景数据集，并进一步引入动态图约束引导网络关注近景周期性运动区域。

结果

实验在4个室外公开数据集和1个室内近景数据集上与同类方法比较。在室外数据集CUHK（Chinese University of Hong Kong）Avenue，UCSD Ped1（University of California， San Diego， pedestrian1），UCSD Ped2，LV（live videos）中，本文算法的帧级AUC（area under curve）值分别提高了1%，0.4%，1.1%，6.8%。在室内数据集中，本文算法相比同类算法提升了1.6%以上。消融实验结果分别验证了Transformer 模块以及动态图约束的有效性。

结论

本文将U-Net网络与基于自注意力机制的Transformer网络结合，能够提升模型对正常模式的学习能力，从而有效检测视频中的异常事件。

Abstract

Objective

Anomaly detection has been developing in video surveillance domain. Video anomaly detection is focused on motions-irregular detection and extraction in relevant to long-distance rehabilitation motion analysis. But， it is challenged to obtain training samples that include all types of abnormal events. Therefore， existing anomaly detection methods in videos usually train a model on datasets， which contain normal samples only. In the testing phase， the events whose patterns are different from normal patterns are detected as abnormities. To represent the normal motion patterns in videos， early works are based on hand-crafted feature and concerned about low-level trajectory features. However， it is challenged to get effective trajectory features in complicated scenarios. Spatial-temporal features like the histogram of oriented flows （HOF） and the histogram of oriented gradients （HOG） are commonly used as representations of motion and content in anomaly detection. To model the motion and appearance patterns in anomaly detection， spatial-temporal features-based Markov random field （MRF）， the mixture of probabilistic PCA （MPPCA）， and the Gaussian mixture model are employed. Based on the assumption that normal patterns can be represented via linear combinations in dictionaries， sparse coding and dictionary learning can be used to encode normal patterns. Due to the insufficient descriptive power of hand-craft features， the robustness of these models is still poor in multiple scenarios. Currently， autoencoder-based deep learning methods are introduced in video anomaly detection. A 3D convolutional Auto-Encoder is designed to model normal patterns in regular frames. A convolutional long short term memory （LSTM） Auto-Encoder is developed to model normal appearance and motion patterns simultaneously in terms of the incorporation between convolutional neural network （CNN） and LSTM. To learn the sparse representation and dictionary of normal patterns， an adaptive iterative hard-thresholding algorithm is designed within an LSTM framework in according to the strong performance of sparse coding-based anomaly detection. Autoencoder-based prediction networks are introduced into anomaly detection in contrast to reconstruction-based models， which can detect anomalies through error computing between predicted frames and ground truth frames. Additionally， to process spatial-temporal information of different scales， a convolutional gate recurrent unit （ConvGRU） based multipath frame prediction network is demonstrated. Due to the blindness of self-supervised learning in anomaly detection， CNNs-based methods have their limitations in mining normal patterns. To improve the capability of feature expression， the vision transformer （ViT） model can used to extend the Transformer from natural language processing to the image domain. It can integrate CNN and Transformer to learn the global context information. Hence， we develop a Transformer and U-Net-based anomaly detection method as well.

Method

In this study， Transformer is embedded in a naive U-Net to learn local and global spatial-temporal information of normal events. First， an encoder is designed to extract spatial-temporal features from consecutive frames. To encode global information and learn the relevant information between feature pixels， final features of the encoder are fed into the Transformer. Then， a decoder is used to upsample the features of Transformer， and merges them with the low-level features of the encoder with the same resolution via skip connections. The whole network can combine the global spatial-temporal information with the local detail information. The size of the convolution kernel and deconvolution kernel is set to 3 × 3. The maximum pooling kernel size is 2 × 2. The encoder and decoder have four layers both. To make predicted frames close to their ground truth， we alleviate the intensity and gradient distances between predicted frames and their ground truth. To meet the requirements for anomaly detection of close-range rehabilitation movement， we collected an indoor motion dataset from published datasets based on hand movements for anomaly analysis because existing anomaly detection datasets are based on outdoor settings with long-distance attribution. For periodic hand movements， in addition to the traditional reconstruction loss， we introduce a dynamic image constraint to guide the network to focus on the periodic close-range motion area further.

Result

We compare the proposed approach to several anomaly detection methods on four outdoor public datasets and one indoor dataset. The improvements of the frame-level area under curve （AUC） performance on Avenue， Ped1， and Ped2 are 1.0%， 0.4%， and 1.1%， respectively. It can detect abnormal events on Ped1/Ped2 with the low-resolution attribute effectively. On the LV dataset， it achieves an AUC of 65.1%. Since the Transformer-based network can capture richer feature information in terms of the self-attention mechanism， the proposed network can mine various normal patterns in multiple scenes and improve detection performance effectively. On the collected indoor dataset， our performance of four actions， which are denoted as A1-1， A1-2， A1-3， and A1-4， reached 60.3%， 63.4%， 67.7%， and 64.4%， respectively. To verify the effectiveness of the Transformer module and dynamic image constraint， we conduct the ablation experiments in the training phase through removing the Transformer module and dynamic image constraint. Experimental results show that the Transformer module can improve the performance of anomaly detection. The performance of four actions of using the dynamic image constraint in the indoor dataset are improved by 0.6%， 2.4%， 1.1%， and 0.9%， respectively. It means the dynamic image loss can yield the network to pay attention to the foreground motion area.

Conclusion

We develop a video anomaly detection method in relevant to Transformer and U-Net. A dataset of indoor motion is collected for the abnormal analysis of indoor close-up rehabilitation movement. Experimental results show that our method has its potentials to detect abnormal behaviors in indoor and outdoor videos effectively.

关键词

异常检测卷积神经网络（CNN）Transformer编码器自注意力机制自监督学习

Keywords

anomaly detectionconvolutional neural network（CNN）Transformer encoderself-attention mechanismself-supervised learning

references

Bilen H， Fernando B， Gavves E， Vedaldi A and Gould S. 2016. Dynamic image networks for action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 3034-3042 ［DOI： 10.1109/CVPR.2016.331http://dx.doi.org/10.1109/CVPR.2016.331］

Cong Y， Yuan J S and Liu J. 2011. Sparse reconstruction cost for abnormal event detection//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs， USA： IEEE： 3449-3456 ［DOI： 10.1109/CVPR.2011.5995434http://dx.doi.org/10.1109/CVPR.2011.5995434］

Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego， USA： IEEE： 886-893 ［DOI： 10.1109/CVPR.2005.177http://dx.doi.org/10.1109/CVPR.2005.177］

Dalal N， Triggs B and Schmid C. 2006. Human detection using oriented histograms of flow and appearance//Proceedings of the 9th European Conference on Computer Vision. Graz， Austria： Springer： 428-441 ［DOI： 10.1007/11744047_33http://dx.doi.org/10.1007/11744047_33］

Deepak K， Srivathsan G， Roshan S and Chandrakala S. 2021. Deep multi-view representation learning for video anomaly detection using spatiotemporal autoencoders. Circuits， Systems， and Signal Processing， 40（3）： 1333-1349 ［DOI： 10.1007/s00034-020-01522-7http://dx.doi.org/10.1007/s00034-020-01522-7］

Denton E， Chintala S， Szlam A and Fergus R. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal， Canada： MIT Press： 1486-1494

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. Vienna， Austria： OpenReview.net

Georgescu M I， Bărbălău A， Ionescu R T， Khan F S， Popescu M and Shah M. 2021. Anomaly detection in video via self-supervised and multi-task learning//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 12737-12747 ［DOI： 10.1109/CVPR46437.2021.01255http://dx.doi.org/10.1109/CVPR46437.2021.01255］

Hasan M， Choi J， Neumann J， Roy-Chowdhury A K and Davis L S. 2016. Learning temporal regularity in video sequences//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 733-742 ［DOI： 10.1109/CVPR.2016.86http://dx.doi.org/10.1109/CVPR.2016.86］

Ionescu R T， Khan F S， Georgescu M I and Shao L. 2019. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 7834-7843 ［DOI： 10.1109/CVPR.2019.00803http://dx.doi.org/10.1109/CVPR.2019.00803］

Ionescu R T， Smeureanu S， Alexe B and Popescu M. 2017. Unmasking the abnormal events in video//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 2914-2922 ［DOI： 10.1109/ICCV.2017.315http://dx.doi.org/10.1109/ICCV.2017.315］

Kim J and Grauman K. 2009. Observe locally， infer globally： a space-time MRF for detecting abnormal activities with incremental updates//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami， USA： IEEE： 2921-2928 ［DOI： 10.1109/CVPR.2009.5206569http://dx.doi.org/10.1109/CVPR.2009.5206569］

Kiran B R， Thomas D M and Parakkal R. 2018. An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. Journal of Imaging， 4（2）： #36 ［DOI： 10.3390/jimaging4020036http://dx.doi.org/10.3390/jimaging4020036］

Leyva R， Sanchez V and Li C T. 2017. The LV dataset： a realistic surveillance video dataset for abnormal event detection//Proceedings of the 5th International Workshop on Biometrics and Forensics. Coventry， UK： IEEE： #7935096 ［DOI： 10.1109/IWBF.2017.7935096http://dx.doi.org/10.1109/IWBF.2017.7935096］

Liu W， Luo W X， Lian D Z and Gao S H. 2018. Future frame prediction for anomaly detection–a new baseline//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 6536-6545 ［DOI： 10.1109/CVPR.2018.00684http://dx.doi.org/10.1109/CVPR.2018.00684］

Lu C W， Shi J P and Jia J Y. 2013. Abnormal event detection at 150 FPS in MATLAB//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney， Australia： IEEE： 2720-2727 ［DOI： 10.1109/ICCV.2013.338http://dx.doi.org/10.1109/ICCV.2013.338］

Luo W X， Liu W and Gao S H. 2017a. Remembering history with convolutional LSTM for anomaly detection//Proceedings of 2017 IEEE International Conference on Multimedia and Expo. Hong Kong， China： IEEE： 439-444 ［DOI： 10.1109/ICME.2017.8019325http://dx.doi.org/10.1109/ICME.2017.8019325］

Luo W X， Liu W and Gao S H. 2017b. A revisit of sparse coding based anomaly detection in stacked RNN framework//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 341-349 ［DOI： 10.1109/ICCV.2017.45http://dx.doi.org/10.1109/ICCV.2017.45］

Luo W X， Liu W， Lian D Z， Tang J H， Duan L X， Peng X and Gao S H. 2021. Video anomaly detection with sparse coding inspired deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（3）： 1070-1084 ［DOI： 10.1109/TPAMI.2019.2944377http://dx.doi.org/10.1109/TPAMI.2019.2944377］

Mahadevan V， Li W X， Bhalodia V and Vasconcelos N. 2010. Anomaly detection in crowded scenes//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco， USA： IEEE： 1975-1981 ［DOI： 10.1109/CVPR.2010.5539872http://dx.doi.org/10.1109/CVPR.2010.5539872］

Mathieu M， Couprie C and LeCun Y. 2016. Deep multi-scale video prediction beyond mean square error//Proceedings of the 4th International Conference on Learning Representations. San Juan， Puerto Rico：［s.n.］

Morais R， Le V， Tran T， Saha B， Mansour M and Venkatesh S. 2019. Learning regularity in skeleton trajectories for anomaly detection in videos//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 11988-11996 ［DOI： 10.1109/CVPR.2019.01227http://dx.doi.org/10.1109/CVPR.2019.01227］

Negin F， Rodriguez P， Koperski M， Kerboua A， Gonzàlez J， Bourgeois J， Chapoulie E， Robert P and Bremond F. 2018. PRAXIS： towards automatic cognitive assessment using gesture recognition. Expert Systems with Applications， 106： 21-35 ［DOI： 10.1016/j.eswa.2018.03.063http://dx.doi.org/10.1016/j.eswa.2018.03.063］

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Tang Y， Zhao L， Zhang S S， Gong C， Li G Y and Yang J. 2020. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters， 129： 123-130 ［DOI： 10.1016/j.patrec.2019.11.024http://dx.doi.org/10.1016/j.patrec.2019.11.024］

Wang X Z， Che Z P， Jiang B， Xiao N， Yang K， Tang J， Ye J P， Wang J Y and Qi Q. 2022. Robust unsupervised video anomaly detection by multipath frame prediction. IEEE Transactions on Neural Networks and Learning Systems， 33（6）： 2301-2312 ［DOI： 10.1109/TNNLS.2021.3083152http://dx.doi.org/10.1109/TNNLS.2021.3083152］

Wu S D， Moore B E and Shah M. 2010. Chaotic invariants of Lagrangian particle trajectories for anomaly detection in crowded scenes//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco， USA： IEEE： 2054-2060 ［DOI： 10.1109/CVPR.2010.5539882http://dx.doi.org/10.1109/CVPR.2010.5539882］

Zhang D， Gatica-Perez D， Bengio S and McCowan I. 2005. Semi-supervised adapted HMMs for unusual event detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego， USA： IEEE： 611-618 ［DOI： 10.1109/CVPR.2005.316http://dx.doi.org/10.1109/CVPR.2005.316］

Zhou J T， Du J W， Zhu H Y， Peng X， Liu Y and Goh R S M. 2019. AnomalyNet： an anomaly detection network for video surveillance. IEEE Transactions on Information Forensics and Security， 14（10）： 2537-2550 ［DOI： 10.1109/TIFS.2019.2900907http://dx.doi.org/10.1109/TIFS.2019.2900907］

文章被引用时，请邮件提醒。

提交

双Gabor滤波器手掌静脉识别网络

结合双边交叉增强与自注意力补偿的点云语义分割

语音深度伪造及其检测技术研究进展

长短期时间序列关联的视频异常事件检测

面向图数转化的曲线提取与细化神经网络