Current Issue Cover
融合自编码器和one-class SVM的异常事件检测

胡海洋, 张力, 李忠金(杭州电子科技大学计算机学院, 杭州 310018)

摘 要
目的 在自动化和智能化的现代生产制造过程中,视频异常事件检测技术扮演着越来越重要的角色,但由于实际生产制造中异常事件的复杂性及无关生产背景的干扰,使其成为一项非常具有挑战性的任务。很多传统方法采用手工设计的低级特征对视频的局部区域进行特征提取,然而此特征很难同时表示运动与外观特征。此外,一些基于深度学习的视频异常事件检测方法直接通过自编码器的重构误差大小来判定测试样本是否为正常或异常事件,然而实际情况往往会出现一些原本为异常的测试样本经过自编码得到的重构误差也小于设定阈值,从而将其错误地判定为正常事件,出现异常事件漏检的情形。针对此不足,本文提出一种融合自编码器和one-class支持向量机(support vector machine,SVM)的异常事件检测模型。方法 通过高斯混合模型(Gaussian mixture model,GMM)提取固定大小的时空兴趣块(region of interest,ROI);通过预训练的3维卷积神经网络(3D convolutional neural network,C3D)对ROI进行高层次的特征提取;利用提取的高维特征训练一个堆叠的降噪自编码器,通过比较重构误差与设定阈值的大小,将测试样本判定为正常、异常和可疑3种情况之一;对自编码器降维后的特征训练一个one-class SVM模型,用于对可疑测试样本进行二次检测,进一步排除异常事件。结果 本文对实际生产制造环境下的机器人工作场景进行实验,采用AUC (area under ROC)和等错误率(equal error rate,EER)两个常用指标进行评估。在设定合适的误差阈值时,结果显示受试者工作特征(receiver operating characteristic,ROC)曲线下AUC达到91.7%,EER为13.8%。同时,在公共数据特征集USCD (University of California,San Diego) Ped1和USCD Ped2上进行了模型评估,并与一些常用方法进行了比较,在USCD Ped1数据集中,相比于性能第2的方法,AUC在帧级别和像素级别分别提高了2.6%和22.3%;在USCD Ped2数据集中,相比于性能第2的方法,AUC在帧级别提高了6.7%,从而验证了所提检测方法的有效性与准确性。结论 本文提出的视频异常事件检测模型,结合了传统模型与深度学习模型,使视频异常事件检测结果更加准确。
关键词
Anomaly detection with autoencoder and one-class SVM

Hu Haiyang, Zhang Li, Li Zhongjin(School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China)

Abstract
Objective With the recent improvements in people's living standards and quality and the rapid development of digital information technology, all sectors of society have paid increasing attention to the application of science and technology in the field of public safety. To maintain a safe public environment, video surveillance equipment has been increasingly installed in streets, schools, communities, subways, and other public places. However, traditional video surveillance systems gradually become unable to process the ever-increasing size of video data. Therefore, the development of intelligent surveillance systems with automatic detection, identification, and alarm functions has broad and far-reaching significance for maintaining public safety and developing artificial intelligence. Anomaly detection is an important part of intelligent monitoring systems that plays a key role in maintaining public safety. As such, anomaly detection has become a hot research topic for both academic and industrial practitioners. In the past, video anomalies are manually detected, which requires much human labor. Therefore, the introduction of an efficient and automated anomaly detection system has significantly reduced the labor costs for such undertaking. Video anomaly detection technologies play an important role in automated and intelligent modern production and manufacturing, video anomaly detection remains a challenging task in complex factory environments given the anomalous events and interference of unrelated contexts in such scenarios. Many methods use hand-designed low-level features to extract features from the local areas of a video. However, these features cannot represent both motion and appearance. To address this problem, we propose a novel detection method based on deep spatial-temporal features. Method First, given that abnormalities are mainly observed in the motion areas of videos, this article extracts the surveillance video motion area via a Gaussian mixture model (GMM). Specifically, this model is used to extract a fixed-size spatial-temporal region of interest from a video. Second, to facilitate the detection of subsequent abnormal events, high-level features are extracted from the region of interest (ROI) via a 3Dconvolutional neural network. Third, to enhance anomaly detection efficiency, the extracted features are used to train a denoising auto-encoder and to detect anomalous events based on reconstruction errors. Finally, given that the self-encoding reconstruction errors of some tested abnormal samples tend to be very small, a model that uses only self-encoding reconstruction errors for anomaly detection can miss many abnormal events. To further rule out anomalies, a one-class support vector machine (SVM) is trained on low-dimensional features Result Several experiments are performed in an actual manufacturing environment operated by robots. Two common indicators are used for evaluation, namely, area under ROC (AUC) and equal error rate (EER).The receiver operating characteristic (ROC) curve is drawn by using the results obtained from various classification standards and can be used to evaluate classifier performance. Meanwhile, the AUC represents the coverage area under the ROC curve, whereas the EER can be represented by the point where the ROC curve intersects with a 45° straight line. A smaller EER indicates a better detection effect. When the appropriate error threshold is set(approximately 0.15), the AUC under the ROC curve reaches 91.7%, whereas the EER is 13.8%.The performance of the proposed model is also evaluated and compared with that of other models on public data feature sets University of California, San Diego (USCD) Ped1 and Ped2. In the USCD Ped1 dataset, the proposed model demonstrates 2.6% and 22.3% improvements in its AUC at the frame and pixel levels, respectively. In the same dataset, compared with the second-best method, the proposed model has a 5.7% higher AUC at the frame level, thereby verifying its effectiveness and accuracy. Conclusion The proposed video abnormal event detection model combines traditional and deep learning models to increase the accuracy of video abnormal event detection results. A 3D convolutional neural network (C3D) was used to extract the spatiotemporal features. A video anomaly event detection method based on deep spatiotemporal features was also developed by combining the stacked denoising autoencoder with a one-class SVM model. In extracting deep spatiotemporal features through a pre-trained C3D network, those features that were extracted from the last convolutional layer of the network were treated as the features of the spatiotemporal interest block. These features consider both the appearance and motion modes. A denoising auto-encoder was also trained to reduce the dimensions of C3D-extracted features, and the reconstruction error of an auto-encoder was used to facilitate the detection of abnormal events. Experimental results show that the proposed model can still detect anomalies when such events appear in partially occluded situations. Therefore, this model can be used for anomalous event detection in dense scenes. Future studies may consider examining other network architectures, integrating multiple input data (e.g., RGB or optical flow frames), and introducing trajectory tracking methods to track obstructed objects and improve abnormality detection accuracy. The proposed framework is suitable for highly complex scenarios.
Keywords

订阅号|日报