目的 视频动作质量评估（AQA）旨在评估视频中特定动作的执行情况和完成质量。自动化的动作质量评估能够有效地减少人力资源的损耗，可以更加精准、公正地对视频内容进行评估。然而，传统视频动作质量评估方法主要存在以下问题：1）视频中动作主体的多尺度时空特征问题；2）认知差异导致的标记内在模糊性问题；3）多头自注意力机制的注意力头冗余问题。针对以上问题，本文提出了一种能够关注视频序列中不同时空位置、生成细粒度标记的动作质量评估模型SALDL。方法 SALDL采用具有多感受野卷积核的I3D网络结构提取视频片段内的空间特征，并提出一种Attention-Inc模块，该模块通过Embedding、MHSA以及MLP将自注意力机制渐进式融入Inception模块，使模型能够获得不同尺度卷积特征之间的上下文信息。提出一种带有正负注意力头的时间注意力模块PNTA（Pos-Neg Temporal Attention），通过PNTA损失充分挖掘时间注意力特征，从而减少自注意力头冗余并提取不同时间片段的注意力特征。SALDL模型通过标记增强及标记分布学习生成细粒度的动作质量标记。结果 本文提出的SALDL模型在MTL-AQA、JIGSAWS等数据集进行了大量的对比实验及消融实验，在MTL-AQA数据集中的斯皮尔曼等级相关系数（Sp.Corr）为0.9416，在JIGSAWS子数据集中的Sp.Corr分别为0.8364、0.8660以及0.7531，均达到了sota结果。结论 本文所提出的SALDL模型通过充分挖掘不同尺度的时空特征解决了多尺度时空特征问题，并引入符合标记分布的先验知识进行标记增强，达到了解决标记的内在模糊性问题以及减少注意力头的冗余问题。
Video Action Quality Assessment Based on Label Distribution
ZHANG Yu, XU Tianyu, MI Siya()
Video Action Quality Assessment (AQA) is aimed to assess the execution and completion quality of specific actions in a video. Automated action quality assessment can effectively reduce the loss of human resources and can evaluate video content more accurately and fairly. However, traditional video action quality assessment task methods mainly suffer from the following problems: 1) Most current methods have the problem of multi-scale spatial and temporal features; the spatial and temporal location of the action in the video is critical for action quality assessment, and the sample video has much information unrelated to the action, so the current video action quality assessment methods have the problem of multi-scale spatial features, i.e., different videos may have different subject scale sizes in the spatial dimension, which makes it difficult to capture action information. In addition, the action quality assessment also has the problem of multi-scale temporal features, i.e., the different durations and execution rates that may exist in the temporal dimension, and the correlations between different time segments and labels are different. 2) Existing methods ignore the problem of the inherent ambiguity of labels caused by cognitive differences: previous methods of action quality assessment tend to focus on individual score labels and ignore the problem of the inherent ambiguity of score labels, the possibility of different judges giving different scores, and the subjectivity of the scores given. For example, scores in diving are given by seven judges and are not determined by a single label. 3) There is a general problem of redundancy in the self-attention heads of the currently proposed attention mechanisms; the number of self-attentive mechanism heads tends to be large in past work, however, many self-attention heads are redundant when tested, and even after removing most of the Heads, the model performance was not greatly affected, and in the experiments in this paper, when the Head number increases, the effect of action quality assessment becomes worse instead. To address the above problems, this paper proposes SALDL, an action quality assessment model capable of focusing on different Spatio-temporal locations in video sequences and generating fine-grained labels. Method This paper designs a new video action quality assessment model SALDL, which can focus on action information at different Spatio-temporal locations in video sequences and generate fine-grained labels by label distribution learning method to deal with label ambiguity. SALDL is composed of three main parts: video representation module, PNTA module, and LDL module. In the video representation module, SALDL uses an I3D network structure with multi-receptive field convolution kernels to extract spatial features within video clips and proposes an Attention-Inc module that uses Embedding, MHSA, and MLP to progressively incorporate the self-attentive mechanism into the Inception module, enabling the model to obtain contextual information between convolutional features at different scales. In the PNTA module, a temporal attention module PNTA with positive and negative attention heads is proposed to fully exploit temporal attention features through PNTA loss, thus reducing the redundancy of self-attentive heads and extracting attention features from different time segments. In the LDL module, the SALDL model uses label distribution learning for generating fine-grained action quality labels, thereby resolving the inherent ambiguity of the tags. We introduce a priori knowledge that the score label fits a certain distribution, and use label enhancement methods to convert single labels to label distributions. Finally, the predicted label distribution is approximated by the Kullback-Leibler divergence loss function to the ground truth label distribution. Result In this paper, Extensive comparison experiments were conducted on MTL-AQA, JIGSAWS, and the Spearman rank correlation coefficient (Sp.Corr) was 0.9416 in the MTL-AQA dataset 0.8364, 0.8660, and 0.7531, all of which achieved state-of-the-art results. In addition, extensive ablation experiments for PNTA, LDL, and Attention-Inc structures in the SALDL model were conducted. The experimental regression-based SALDL model, with the output dimension of the fully connected layer, changed to 1 and the Softmax function excluded, directly outputs the prediction score with an Sp.Corr of 0.9320. SALDL-w/oPNTA denotes the SALDL model without the use of the PNTA module, with an Sp.Corr of 0.9384. SALDL-w/o Attention-Ins denotes the SALDL model without using the Attention-Inc structure and its Sp.Corr is 0.9399. The experimental results show the enhancement of each module for SALDL. Besides, we conducted ablation experiments on the selection of the segmentation strategy and the distribution function. The experimental results show that the selection of segmentation strategy and distribution function needs to be dynamically selected according to the type of dataset. Therefore, further work can be considered to investigate how to determine what kind of distribution function to select, the fusion of different distribution functions, and other methods to achieve adaptive label enhancement. Conclusion The proposed SALDL model solves the problem of multi-scale Spatio-temporal features by fully mining Spatio-temporal features at different scales, and solves the problem of intrinsic ambiguity of labels and redundancy of self-attention heads by introducing a priori knowledge that labels conform to a certain distribution for label enhancement and thus label distribution learning. The proposed SALDL model achieves state-of-the-art performance on several action quality assessment datasets, which fully validates the effectiveness of the algorithm.