朱新瑞, 钱小燕, 施俞洲, 陶旭东, 李智昱(南京航空航天大学)
目的 多示例学习是解决弱监督视频异常事件检测问题的有力工具．异常事件发生往往具有稀疏性、突发性以及局部连续性等特点，然而，目前的多示例学习方法没有充分考虑示例之间的联系，忽略了视频片段之间的时间关联，无法充分分离正常片段和异常片段。针对这一问题，本文提出了一种长短期时间序列关联的二阶段异常检测网络。方法 第一阶段是长短期时间序列关联的异常检测网络 (Long-and-Short-Term correlated MIL abnormal detection framework, LSC-transMIL)，将Transformer结构应用到多示例学习方法中，添加局部和全局时间注意力机制，在学习不同视频片段间的空间关联语义信息的同时强化连续视频片段的时间序列关联；第二阶段构建了一个基于时空注意力机制的异常检测网络，将第一阶段生成的异常分数作为细粒度伪标签，使用伪标签训练策略训练异常事件检测网络，并微调骨干网络，提高异常事件检测网络的自适应性。结果 实验在两个大型公开数据集上与同类方法比较，两阶段的异常检测模型在UCF-crime , ShanghaiTech数据集上曲线下面积(area under curve, AUC)分别达到82.88%，96.34%，相比同为两阶段的方法分别提高了1.58%和0.58%，消融实验证明了关注时间序列的Transformer模块以及长短期注意力的有效性。结论 本文将Transformer应用于时间序列的多示例学习，并添加长短期注意力，突出局部异常事件和正常事件的区别，有效检测视频中的异常事件。
Anomaly Event Detection with Long-Short-Term Time Series Correlations
Zhu Xinrui, Qian Xiaoyan, Shi Yuzhou, Tao Xudong, Li Zhiyu(Nanjing University of Aeronautics and Astronautics)
Objective Video anomaly detection has been applied in many fields such as manufacturing manufacturing, traffic management and security monitoring. However, detailed annotation of video data is labor-intensive and cumbersome. Consequently, many researchers have started to employ weakly supervised learning methods to address this issue. Compared with the supervised learning method, the weakly supervised learning only requires video-level labels in the training stage, which greatly reduces the workload of data set labeling, and only frame level labeling information is required for the test dataset. Multiple instance learning (MIL) has been recognized as a powerful tool for addressing weakly supervised video abnormal event detection. Abnormal behavior in video is highly correlated with video context information. The traditional MIL method uses C3D (convolutional 3D) network to extract video features, uses the ordering loss function, and introduces sparsity and time smoothing constraints into the ordering loss function to integrate time information into the ordering model. It is not enough to introduce time concern only into the loss function. The use of TCN (temporal convolutional network) to extract video context information further enhances the effect of video anomaly detection network. However, this global introduction of time information cannot sufficiently separate abnormal video clips from normal video clips. Therefore, the attention MIL builds time-enhancing networks to learn motion features, while using the attention mechanism to incorporate temporal information into the ranking model. The learned attention weights can help better distinguish between abnormal and normal video clips. The Spatio-Temporal fusion graph network, on the other hand, constructs spatial similarity graphs and temporal continuity graphs separately for video segments, which are then fused to generate a spatio-temporal fusion graph. This strengthens the spatio-temporal correlations among video segments, ultimately enhancing the accuracy of abnormal behavior detection. MIST (multiple instance self-training framework) uses pseudo-label training strategy, which is an effective training strategy to improve model quality in weakly supervised learning. It constructs a two-stage training network, and uses the pseudo-label trained by the first-stage MIL to guide the training of the second-stage self-guided attention feature extractor, providing a general idea to improve model quality. However, these approaches do not fully exploit temporal correlations, as the feature representation of the instances lacks fusion with neighboring and global features. Abnormal events often exhibit characteristics such as sparsity, suddenness, and local continuity, and the insufficient temporal correlations between video segments result in an inadequate separation between normal and abnormal segments. To address this issue, this paper proposes a two-stage abnormal detection network with long-short-term time series association. Methods The first stage involves a Long-Short-Term time series association abnormal detection network (LSC-transMIL) that applies the Transformer structure to MIL methods. It consists of two layers, each containing a local temporal sequence correlation attention module and a global instance correlation attention module. The former learns information in the temporal dimension between individual instances and neighboring instances, while the latter focuses on the association between individual instances and global information. By combining both local and global attention mechanisms, it becomes possible to establish meaningful information correlations among instances, highlighting the distinctions between local and global features within the video. This makes it easier to distinguish abnormal video segments from normal ones. This module generates new instance features, which are then fed into the ranking model to generate video abnormal scores and pseudo-labels. In the second stage, a spatiotemporal attention mechanism-based abnormal detection network is constructed. The Slowfast backbone network is employed to extract video features, and the slow and fast pathway features are weighted and fused using spatiotemporal attention. The Slow branch pays attention to the spatio-temporal information of the video frame using the spatio-temporal attention module, while the fast branch guides the attention to the temporal information through the time-dimensional attention module, then the two branche features are spliced to obtain the final video features. The abnormal scores generated in the first stage are used as fine-grained pseudo-labels to train the abnormal event detection network using a pseudo-labeling strategy. Furthermore, the backbone network is fine-tuned to enhance the adaptive capability of the abnormal event detection network. Results Extensive experiments were conducted on two large-scale public datasets, UCF-crime and ShanghaiTech, to compare the proposed two-stage abnormal detection model with similar methods. The two-stage model achieved area under the curve (AUC) scores of 82.88% and 96.34% on the UCF-crime and ShanghaiTech datasets, respectively, demonstrating an improvement of 1.58% and 0.58% compared to other two-stage methods. At the same time, sufficient ablation experiments were conducted on the two datasets, and the effects of the proposed LSC-transMIL, traditional MIL method and attention MIL method were compared under three backbone networks, proving the effectiveness of LSC-transMIL. Both qualitative and quantitative explanations are given for the ablation experiments of global attention and global local attention, and the effectiveness of combining local and global attention is proved, and The role of local and global time correlation is visualized using heat maps Conclusion This paper applies the Transformer to time series-based MIL and introduces long-short-term attention to highlight the differences between local abnormal events and normal events. The proposed two-stage abnormal detection network utilizes the abnormal scores generated in the first stage as pseudo-labels, trains a network based on the Slowfast backbone network and spatiotemporal attention modules, and fine-tunes the backbone network to enhance the adaptive capability of the abnormal detection network. The results show that the proposed approach effectively improves the accuracy of abnormal event detection.