最新刊期

    25 11 2020

      Review

    • Progress and challenges in facial action unit detection

      Yong Li, Jiabei Zeng, Xin Liu, Shiguang Shan
      Vol. 25, Issue 11, Pages: 2293-2305(2020) DOI: 10.11834/jig.200343
      Progress and challenges in facial action unit detection
      摘要:The anatomically based facial action coding system defines a unique set of atomic nonoverlapping facial muscle actions called action units (AUs),which can accurately characterize facial expression. AUs correspond to muscular activities that produce momentary changes in facial appearance. Combinations of AUs can represent any facial expression. As a multilabel classification problem,AU detection suffers from insufficient AU annotations,various head poses,individual differences,and imbalance among different AUs. This article systematically summarizes representative methods that have been proposed since 2016 to facilitate the development of AU detection methods. According to different input data,AU detection methods are categorized on the basis of images,videos,and other modalities. We also discuss how AU detection methods can deal with partial supervision given the large scale of unlabeled data. Image-based methods,including approaches that learn local facial representations,exploit AU relations and utilize multitask and weakly supervised learning methods. Handcrafted or automatically learned local facial representations can represent local deformations caused by active AUs. However,the former is incapable of representing different AUs with adaptive local regions while the latter suffers from insufficient training data. Approaches that exploit AU relations can utilize prior knowledge that some AUs appear together or exclusively at the same time. Such methods adopt either Bayesian or graph neural networks to model manually inferred AU relations from annotations of specified datasets. However,these inflexible methods fail to perform cross dataset evaluation. Multitask AU detection methods are inspired by the phenomena that facial shapes represented by facial landmarks are helpful in AU detection and facial deformations caused by active AUs affect the location distribution of landmarks. Except for detecting facial AUs,such methods typically estimate facial landmarks or recognize facial expressions in a multitask manner. Other tasks of facial emotion analysis,such as emotional dimension estimation,can be incorporated in the multitask learning setting. Video-based methods are categorized into strategies that rely on temporal representation and self-supervised learning. Temporal representation learning methods commonly adopt long short-term memory (LSTM) or 3D convolutional neural networks (3D-CNNs) to model the temporal information. Other temporal representation approaches utilize optical flow between frames to detect facial AUs. Several self-supervised approaches have recently exploited the prior knowledge that facial actions,which are movements of facial muscles and between facial frames,can be used as the self-supervisory signal. Such video-based weakly supervised AU detection methods are reasonable and explainable and can effectively alleviate the problem of insufficient AU annotations. However,these methods rely on massive amounts of unlabeled video data in the training phase and fail to perform AU detection in an end-to-end manner. We also review methods that exploit point cloud or thermal images for AU detection and are capable of alleviating the influence of head pose or illumination. Finally,we compare representative methods and analyze their advantages and drawbacks. The analysis summarizes and discusses challenges and potential directions of AU detection. We conclude that methods capable of utilizing weakly annotated or unlabeled data are important research directions for future investigations. Such methods should be carefully designed according to the prior knowledge of AUs to alleviate the demand for large amounts of labeled data.  
      关键词:facial action unit(AU);image-based AU detection;video-based AU detection;weakly-supervised learning;insufficient annotations   
      113
      |
      178
      |
      3
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700430 false
      更新时间:2024-05-07
    • Deep facial expression recognition: a survey

      Shan Li, Weihong Deng
      Vol. 25, Issue 11, Pages: 2306-2320(2020) DOI: 10.11834/jig.200233
      Deep facial expression recognition: a survey
      摘要:Facial expression is a powerful,natural,and universal signal for human beings to convey their emotional states and intentions. Numerous studies have been conducted on automatic facial expression analysis because of its practical importance in sociable robotics,medical treatment,driver fatigue surveillance,and many other human-computer interaction systems. Various facial expression recognition (FER) systems have been explored to encode expression information from facial representations in the field of computer vision and machine learning. Traditional methods typically use handcrafted features or shallow learning for FER. However,related studies have collected training samples from challenging real-world scenarios,which implicitly promote the transition of FER from laboratory-controlled to in-the-wild settings since 2013. Meanwhile,studies in various fields have increasingly used deep learning methods,which achieve state-of-the-art recognition accuracy and remarkably exceed the results of previous investigations due to considerably improved chip processing abilities (e.g.,GPU units) and appropriately designed network architectures. Moreover,deep learning techniques are increasingly utilized to handle challenging factors for emotion recognition in the wild because of the effective training of facial expression data. The transition of facial expression recognition from being laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields have promoted the use of deep neural networks to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on the following important issues. 1) Deep neural networks require a large amount of training data to avoid overfitting. However,existing facial expression databases are insufficient for training common neural networks with deep architecture,which achieve promising results in object recognition tasks. 2) Expression-unrelated variations are common in unconstrained facial expression scenarios,such as illumination,head pose,and identity bias. These disturbances are nonlinearly confounded with facial expressions and therefore strengthen the requirement of deep networks to address the large intraclass variability and learn effective expression-specific representations. We provide a comprehensive review of deep FER,including datasets and algorithms that provide insights into these intrinsic problems,in this survey. First,we introduce the background of fields of FER and summarize the development of available datasets widely used in the literature as well as FER algorithms in the past 10 years. Second,we divide the FER system into two main categories according to feature representations,namely,static image and dynamic sequence FER. The feature representation in static-based methods is encoded with only spatial information from the current single image,whereas dynamic-based methods consider temporal relations among contiguous frames in input facial expression sequences. On the basis of these two vision-based methods,other modalities,such as audio and physiological channels,have also been used in multimodal sentiment analysis systems to assist in FER. Although pure expression recognition based on visible face images can achieve promising results,incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. We introduce existing novel deep neural networks and related training strategies,which are designed for FER based on both static and dynamic image sequences,and discuss their advantages and limitations in state-of-the-art deep FER. Competitive performance and experimental comparisons of these deep FER systems in widely used benchmarks are also summarized. We then discuss relative advantages and disadvantages of these different types of methods with respect to two open issues (data size requirement and expression-unrelated variations) and other focuses (computation efficiency,performance,and network training difficulty). Finally,we review and summarize the following challenges in this field and future directions for the design of robust deep FER systems. 1) Lacking training data in terms of both quantity and quality is a main challenge in deep FER systems. Abundant sample images with diverse head poses and occlusions as well as precise face attribute labels,including expression,age,gender,and ethnicity,are crucial for practical applications. The crowdsourcing model under the guidance of expert annotators is a reasonable approach for massive annotations. 2) Data bias and inconsistent annotations are very common among different facial expression datasets due to various collecting conditions and the subjectiveness of annotating. Furthermore,the FER performance fails to improve when training data is enlarged by directly merging multiple datasets due to inconsistent expression annotations. Cross-database performance is an important evaluation criterion of generalizability and practicability of FER systems. Deep domain adaption and knowledge distillation are promising trends to address this bias. 3) Another common issue is imbalanced class distribution in facial expression due to the practicality of sample acquirement. One solution is to resample and balance the class distribution on the basis of the number of samples for each class during the preprocessing stage using data augmentation and synthesis. Another alternative is to develop a cost-sensitive loss layer for reweighting during network work training. 4) Although FER within the categorical model has been extensively investigated,the definition of prototypical expressions covers only a small portion of specific categories and cannot capture the full repertoire of expressive behavior for realistic interactions. Incorporating other affective models,such as FACS(facial action coding system) and dimensional models,can facilitate the recognition of facial expressions and allow them to learn expression-discriminative representations. 5) Human expressive behavior in realistic applications involves encoding from different perspectives,with facial expressions as only one modality. Although pure expression recognition based on visible face images can achieve promising results,incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. For example,the fusion of other modalities,such as the audio information,infrared images,and depth information from 3D face models and physiological data,has become a promising research direction due to the large complementarity of facial expressions and the good application value of human-computer interaction (HCI) applications.  
      关键词:facial expression recognition(FER);real world;deep learning;survey   
      138
      |
      251
      |
      9
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700427 false
      更新时间:2024-05-07
    • Remote photoplethysmography-based physiological measurement: a survey

      Xuesong Niu, Hu Han, Shiguang Shan
      Vol. 25, Issue 11, Pages: 2321-2336(2020) DOI: 10.11834/jig.200341
      Remote photoplethysmography-based physiological measurement: a survey
      摘要:Physiological signals,such as heart rate (HR),respiration frequency (RF),and heart rate variability (HRV),are important clues to analyze a person's health and affective status. Traditional measurements of physiological signals are based on the electrocardiography (ECG) or contact photoplethysmography (cPPG) technology. However,both technologies require professional equipment,which may cause inconvenience and discomfort for subjects. Remote photoplethysmography (rPPG) technology for remote measurement of physiological signals has progressed considerably and recently attracted considerable research attention. The rPPG technology,which is based on skin color variations due to the periodical optical absorption of skin tissue caused by cardiac activity,demonstrates high potential in many applications,such as healthcare,sleep monitoring,and defection detection. The process for rPPG-based physiological measurement can be divided into three steps. First,regions of interest (ROIs) are extracted from the face video. Second,blood volume pulse(BVP) signal is reconstructed from signals generated from the ROIs. Finally,the reconstructed BVP signal is used for physiological measurements. The reconstruction of the BVP signal is the key step for rPPG-based remote physiological measurements. A detailed review of methods for rPPG-based remote physiological measurement is presented in this study from the aspect of assumptions they use,which can be categorized into three kinds,i.e.,methods based on the skin reflection model,methods based on the BVP signal's physical characteristics,and data-driven methods. Studies on the skin reflection model-based methods can be further categorized into spatial skin and skin reflection models of different color channels. Studies on methods that using the BVP signal's physical characteristics can be further categorized into blind signal separation,manifold projection,low rank factorization,and frequency domain constraint. Studies on data-driven methods can be further categorized into methods based on hand-crafted features and deep learning. A detailed review of evaluations of different rPPG-based physiological measurement methods is also presented from the aspects of tasks,databases,metrics,and protocols. Evaluation tasks used for remote physiological measurement include average heart rate measurement,respiration frequency measurement,and heart rate variability analysis. Databases of rPPG-based physiological measurements are summarized according to database scale and variations. Evaluation metrics for remote physiological measurement can be categorized into statistics of error,correlation,and signal quality. Evaluation protocols for data-driven methods are summarized into fixed partition,subject-independent division,subject-exclusive division,and cross-database protocols. Finally,we discuss the challenges of the rPPG-based remote physiological measurement and put forward the potential research directions for future investigations. Challenges include video quality (i.e.,video compression and pre-processing of frames),influence of subject's head movements,variations of lighting conditions,and lacking data. Future research trends include designing hand-crafted methods for different challenge scenarios and exploring technologies,such as self-supervised,semi-supervised,and weakly-supervised learning,for data-driven methods.  
      关键词:remote photoplethysmography(rPPG);cardiac cycle;physiological measurement;literature survey;algorithm evaluation   
      423
      |
      298
      |
      3
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700429 false
      更新时间:2024-05-07
    • Advances and challenges in facial expression analysis

      Xiaojiang Peng, Yu Qiao
      Vol. 25, Issue 11, Pages: 2337-2348(2020) DOI: 10.11834/jig.200308
      Advances and challenges in facial expression analysis
      摘要:Facial expression analysis aims to understand human emotions by analyzing visual face information and has been a popular topic in the computer vision community. Its main challenges include annotating difficulties,inconsistent labels,large face poses,and occlusions in the wild. To promote the advance of facial expression analysis,this paper comprehensively reviews recent advances,challenges,and future trends. First,several common tasks for facial expression analysis,basic algorithm pipeline,and datasets are explored. In general,facial expression analysis mainly includes three tasks,namely,facial expression recognition (i.e.,basic emotion recognition),facial action unit detection,and valence-arousal regression. The well-known pipeline of facial expression analysis mainly consists of three steps,namely,face extraction (which includes face detection,face alignment,and face cropping),facial feature extraction,and classification (or regression for the valence-arousal task). Datasets for facial expression analysis come from laboratory and real-world environments,and recent datasets mainly focus on in-the-wild conditions. Second,this paper provides an algorithm survey on facial expression recognition,including hand-crafted feature-based methods,deep learning-based methods,and action unit(AU)-based methods. For hand-crafted features,appearance- and geometry-based features can be used. Specifically,traditional appearance-based features mainly include local binary patterns and Gabor features. Geometry-based features are mainly computed from facial key points. In deep learning-based methods,early studies apply deep belief networks while almost all recent methods use deep convolutional neural networks. Apart from direct facial expression recognition,some methods leverage the corresponding map between AUs and emotion categories,and infer categories from detected AUs. Third,this paper summarizes and discusses recent challenges in facial expression recognition (FER),including the small scale of reliable FER datasets,uncertainties in large-scale FER datasets,occlusion and large pose problems in FER datasets,and comparability of FER algorithms. Lastly,this paper discusses the future trends of facial expression analysis. According to our comprehensive review and discussion,1) for the small-scale challenge posed by reliable FER data,two important strategies are transfer learning from face recognition models and semi-supervised methods based on large-scale unlabeled facial data. 2) Owing to ambiguous expression,low-quality face images,and subjectivity of annotators,uncertain annotations inevitably exist in large-scale in-the-wild FER datasets. For better learning facial expression features,it is beneficial to suppress these uncertain annotations. 3) For the challenges posed by occlusion and large pose,combining different local regions is effective,and another valuable strategy is to first learn an occlusion- and pose-robust face recognition model and then transfer it to facial expression recognition. 4) Current FER methods are difficult to compare due to the impact of many hyper parameters in deep learning methods. Thus,comparing various baselines for different FER methods is necessary. For example,a facial expression recognition method should be compared in the configuration of learning from scratch and from pretrained models. Recently,although facial expression analysis has great progress,the abovementioned challenges remain unsolved. Facial expression analysis is a practical task,and algorithms should also pay attention to the time and memory consumption except for accuracy in the future. In the deep learning era,facial action unit detection in the wild has achieved great progress,and using the results of action unit detection in inferring facial expression categories in the wild may be feasible in the future.  
      关键词:facial expression analysis;facial expression recognition(FER);convolutional neural network(CNN);deep learning;transfer learning   
      188
      |
      0
      |
      7
      <HTML>
      <Meta-XML>
      <引用本文> <批量引用> 55700426 false
      更新时间:2024-05-07

      Dataset

    • MED: multimodal emotion dataset in the wild

      Jing Chen, Kejun Wang, Cong Zhao, Chaoqun Yin, Ziqiang Huang
      Vol. 25, Issue 11, Pages: 2349-2360(2020) DOI: 10.11834/jig.200215
      MED: multimodal emotion dataset in the wild
      摘要:ObjectiveEmotion recognition or affective computing is crucial in various human-computer interactions,including interaction with artificial intelligence (AI) assistants,such as home robots,Google assistant,Alexa,and even self-driving cars. AI assistants or other forms of technology can also be used to identify a person's emotional or cognitive state to help people live a happy,healthy,and productive life and even help with mental health treatment. Adding emotion recognition to human-machine systems can help the computer recognize emotion and intention of users when speaking and give an appropriate response. To date,computers inaccurately capture and interpret user emotions and intentions mainly because of the different datasets used when developing an intelligent system and lack of data collection in an actual application environment that reduces system robustness. The simple dataset collected in the laboratory environment,which uses an unreasonable induction method of emotion generation,is typically characterized by a solid background and uniform and strong illumination. The resulting emotion category is very exaggerated but untrue. User age,gender,and ethnicity as well as complexity of the application environment and diversity of collection angles in the actual application process are problems that need solutions when developing a system. Therefore,application of systems developed in the laboratory environment is difficult in the real world.MethodCreating a dataset from the real environment can solve the problem of inconsistency between datasets used in software development and the real-world application. Wild datasets,especially multimodal sentiment datasets containing dynamic information,are limited. Therefore,The paper collected and annotated a new multimodal emotion dataset (MED) in the real environment. First,five collectors watched videos from various data sources with different content,such as TV series,movies,talk shows,and live broadcasts,and extracted over 2 500 video clips containing emotional states. Second,the video frame of each video is obtained and saved in a folder to determine the video sequence. The pedestrian detection model is used to obtain valid video frames because only some video frames contain valid person or face information. Clips without people are considered invalid video frames and undetected. The resulting video frame containing only personal information can be used to investigate postural emotional information,such as limbs. Posture emotion can be used to assess the emotional state of a person when the face is blocked or the character has a large motion range. Facial expressions account for a large proportion of emotional judgment. Third,two methods are used to face detection. Finally,Annotators manually annotated video sequences of detected people and faces although the staff collected videos according to the emotional state in the manual cutting process. Given that humans will have deviations in emotional judgment and each person has a different sensitivity to emotion,the paper used crowdsourcing method to make annotations. Crowdsourcing methods are used in the collection of many datasets,such as ImageNet and RAF. Fifteen taggers with professional emotional information training independently tagged all the video clips. A total of 1 839 video clips were obtained on the basis of seven types of emotions after annotation.ResultDifferent divisions of the dataset are presented in the study. The dataset is classified into training and verification sets by 0.65:0.35 according to acted facial expression in the wild(AFEW) division. The amount of data for each type of emotion in the AFEW and MED datasets are then compared and presented in the form of a graph. MED has more quantities for each type of emotion than AFEW. The paper evaluates the dataset using a large number of deep and machine learning algorithms and provides baselines for each modality. First,classic machine learning methods,such as local binary patterns(LBP),histogram of oriented gradient(HOG),and Gabor wavelet are applied to obtain the baseline of the CK+ dataset. The same method is applied to the MED dataset,and accuracy decreases by more than 50%. Data collected in the real environment is complicated. The algorithm developed using the dataset in the laboratory environment is unsuitable for the real environment. Hence,creating the dataset in the real environment is necessary. The comparison of AFEW and MED datasets verifies that data of MED are reasonable and effective. The baseline of facial expression recognition and the two other modalities also are provided. The results indicate other modalities can be used as an auxiliary method for comprehensively assessing emotions,especially when the face is blocked or the face information is unavailable. Finally,the accuracy of emotion recognition improves by 4.03% through multimodal fusion.ConclusionMED is a multimodal real-world dataset that expands the existing multimodal dataset. Researchers can develop a deep learning algorithm by combining MED with other datasets to form a large multimodal database that contains multiple languages and ethnicities,promote cross-cultural emotion recognition and perception analysis of different emotion evaluations,and improve the performance of automatic emotion computing systems in real applications.  
      关键词:in the wild;multimodal;facial expression;body posture;speech emotion;dataset   
      58
      |
      51
      |
      1
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700491 false
      更新时间:2024-05-07

      Facial Expression Recognition

    • Facial expression recognition improved by continual learning

      Jing Jiang, Weihong Deng
      Vol. 25, Issue 11, Pages: 2361-2369(2020) DOI: 10.11834/jig.200315
      Facial expression recognition improved by continual learning
      摘要:ObjectiveFacial expression recognition (FER) has become an important research topic in the field of computer vision. FER plays an important role in human-computer interaction. Most studies focus on classifying basic discrete expressions (i.e.,anger,disgust,fear,happiness,sadness,and surprise) using static image-based approaches. Recognition performance in deep learning-based methods has progressed considerably. Deep neural networks,especially convolutional neural networks (CNNs),achieve outstanding performance in image classification tasks. A large amount of labeled data is needed for training deep networks. However,insufficient samples in many widely used FER datasets lead to overfitting in the trained model. Fine-tuning a network that has been well pre-trained on a large face recognition dataset is commonly performed to solve the shortage of samples in FER datasets and prevent overfitting. The pre-trained network can capture facial information and the similarity between face recognition (FR) and FER domains facilitates the transfer of features. Although this transfer learning strategy demonstrates satisfactory performance,the fine-tuned FR network may still contain face-dominated information,which can weaken the network's ability to represent different expressions. On the one hand,we expect to reserve the strong ability of the FR network to capture important facial information,such as face contour,and guide the FER network training in real cases. On the other hand,we want the network to learn additional expression-specific information. The FER model training using a continual learning approach is proposed to utilize the close relationship between FR and FER effectively and exploit the ability of the pre-trained FR network.MethodThis study aims to train an expression recognition network with auxiliary significant information of face recognition network instead of only using a fine-tuning approach. We first introduce a continual learning approach into the field of FER. Continual learning analyzes the problem learning from an infinite stream of data with the objective of gradually extending the acquired knowledge and using it for future learning. Synaptic intelligence consolidates important parameters of previous tasks to solve the problem of catastrophic forgetting and alleviate the reduction in performance by preventing those important parameters from changing in future tasks. Similar to continual learning,we conduct the FR task before the FER task is added. However,we only focus on the performance of the later task while continual learning also aims to alleviate the catastrophic forgetting of the original task. Sequential tasks in continual learning commonly contain a small number of classes so that important parameters are related to current classes. However,important parameters are more likely to capture common facial features rather than specific classes due to the large amount of categories in the FR task,thereby remarkably increasing their contributions to the total loss. Hence,a two-stage training strategy is proposed in this study. We train a FR network and compute each parameter's importance while training in the first stage. We refine the pre-trained network with the supervision of expression label information while preventing important parameters from excessively changing in the second stage. The loss function for expression classification is composed of two parts,namely,softmax loss and parameter-wise importance regularization.ResultWe conduct experiments on three widely used FER datasets,including CK+(the extended Cohn-Kanade database),Oulu-CASIA,and RAF-DB(real-word affective faces database). RAF-DB is an in-the-wild database while the two other databases are laboratory-controlled. The use of RAF-DB achieves an accuracy of 88.04%,which improves the performance of direct fine-tuning by 1.83% and surpasses the state-of-the-art algorithm self-cure network (SCN) by 1.01%. The result using CK+ improves the fine-tuning baseline by 1.1%. The experiment using Oulu-CASIA also indicated that the network has satisfactory generalization performance with the addition of parameter-wise importance regularization. Meanwhile,the effect of such regularization improves the performance on in-the-wild datasets more remarkblely due to the more complex faces under occlusion and pose variations.ConclusionWe exploit the relationship between FR and FER and adopt the idea and algorithm of continual learning in FER to avoid overfitting in this study. The main purpose and effect of continual learning is to preserve the powerful feature extraction ability of the FR network via parameter-wise importance regularization and allow less-important parameters to learn additional expression-specific information. The experimental results showed that our training strategy helps the FER network to learn additional discriminative features and thus promote recognition performance.  
      关键词:deep learning;facial expression recognition(FER);face recognition(FR);pre-trained network;continual learning;parameter-wise importance regularization   
      115
      |
      258
      |
      1
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700520 false
      更新时间:2024-05-07
    • Zhuangqiang Zheng, Qisheng Jiang, Shangfei Wang
      Vol. 25, Issue 11, Pages: 2370-2379(2020) DOI: 10.11834/jig.200264
      Posed and spontaneous expression distinction through multi-task and adversarial learning
      摘要:ObjectivePosed and spontaneous expression distinction is a major problem in the field of facial expression analysis. Posed expressions are deliberately performed to confuse or cheat others,while spontaneous expressions occur naturally. The difference between the posed and spontaneous delivery of the same expression by a person is little due to the subjective fraud of posed expressions. At the same time,posed and spontaneous expression distinction suffers from the problem of high intraclass differences caused by individual differences. These limitations bring difficulties in posed and spontaneous expression distinction. However,behavioral studies have shown that significant differences exist between posed and spontaneous expressions in spatial patterns. For example,compared with spontaneous smiles,the contraction of zygomatic muscles is more likely to be asymmetric in posed smiles. Moreover,constricted orbicularis oculi muscle is presented in spontaneous smiles but absent in posed smiles. Such inherent spatial patterns in posed and spontaneous expressions can be utilized to facilitate posed and spontaneous expression distinction. Therefore,modeling spatial patterns inherent in facial behavior and extracting subject-independent facial features are important for posed and spontaneous expression distinction. Previous works typically focused on spatial pattern modeling in the facial behavior. Researchers commonly use landmarks to describe motion patterns of facial muscles approximately and capture spatial patterns inherent in facial behavior based on landmark information due to the difficulty in obtaining motion patterns of facial muscles. According to the difference in modeling spatial patterns inherent in facial behavior,studies on posed and spontaneous expression distinction can be categorized into two approaches,namely,feature- and probabilistic graphical model (PGM)-based methods. Feature-based methods implicitly capture spatial patterns using handcrafted low-level or deep features extracted by deep convolution networks. However,handcrafted low-level features have difficulty in describing complex spatial patterns inherent in the facial behavior. PGM-based methods model the distribution among landmarks and explicitly capture spatial patterns existing in facial behavior using PGMs. However,PGMs frequently simplify reasoning and calculation of models through independence or energy distribution assumptions,which are sometimes inconsistent with the ground truth distribution. At the same time,PGM-based methods typically use handcrafted low-level features and thus face similar defects. An adversarial network for posed and spontaneous expression distinction is proposed to solve the problems.MethodOn the one hand,we use landmark displacements between onset and corresponding apex frames to describe motion patterns of facial muscles approximately and capture spatial patterns inherent in facial behavior explicitly by modeling the joint distribution between expressions and landmark displacements. On the other hand,we alleviate the problem of high intraclass differences by extracting subject-independent features. Specifically,the proposed adversarial network consists of a feature extractor,a multitask learner,a multitask discriminator,and a feature discriminator. The feature extractor attempts to extract facial features,which are discriminative for posed and spontaneous expression distinction and robust for subjects. The multitask learner is used to classify posed and spontaneous expressions as well as predict facial landmark displacement simultaneously. The multitask discriminator distinguishes the predicted expression and landmark displacement from ground truth ones. The feature discriminator is a subject classifier that can be used to measure the correlation and independence between extracted facial features and subject identities. The feature extractor is trained cooperatively with the multitask learner but in an adversarial way with the feature discriminator. Thus,the feature extractor can learn good facial features for expression distinction and landmark displacement regression but not for subject recognition. The multitask learner competes with the multitask discriminator. The distribution of predicted expression and landmark displacement converges to the distribution of ground truth labels through adversarial learning. Thus,spatial patterns can be thoroughly explored for posed and spontaneous expression distinction.ResultExperimental results on three benchmark datasets,i.e. MMI(M&M Initiative),NVIE(Natural visible and infrared facial expression),and BioVid(Biopotential and Video) demonstrate that the proposed adversarial network can not only effectively learn subject-independent and expression-discriminative facial features,so as to improve the generalization ability of the model on unseen subjects,but also make full use of spatial and temporal patterns inherent in facial behaviors to improve the performance of posed and spontaneous expression distinction,leading to the superior performance compared to state of the arts.ConclusionExperiments demonstrate the effectiveness of the proposed method.  
      关键词:posed and spontaneous expression distinction;adversarial learning;multi-task learning;spatial and temporal patterns of facial behavior;subject-independent facial feature   
      30
      |
      28
      |
      1
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700538 false
      更新时间:2024-05-07
    • Spatiotemporal attention network for microexpression recognition

      Guohao Li, Yifan Yuan, Xianye Ben, Junping Zhang
      Vol. 25, Issue 11, Pages: 2380-2390(2020) DOI: 10.11834/jig.200325
      Spatiotemporal attention network for microexpression recognition
      摘要:ObjectiveMicroexpression, a kind of spontaneous facial muscle movement, can conceal the real underlying emotions of people. Microexpression has potential applications in security, police interrogation, and psychological testing. Compared with macroexpression, the lower intensity and shorter duration of microexpressions increase the difficulty in recognition. Traditional methods can be divided into facial image- and optical flow-based approaches. Facial image-based methods utilize spatiotemporal partition blocks to construct feature vectors wherein spatiotemporal segmentation parameters are regarded as hyperparameters. Each sample of the dataset uses the same hyperparameters. The performance of microexpression recognition may suffer when using the same spatiotemporal division blocks for different samples, which may require varying spatiotemporal segmentation blocks. Optical flow-based methods are widely used for microexpression recognition. Although such methods demonstrate satisfactory robustness in the variation of illumination, facial features in different regions are considered equally important but ignore the appearance of microexpression in partial regions. Attention mechanism, which has been introduced in many fields, such as natural language processing and computer vision, can focus on salient regions of the object and give additional weights to these regions. We apply the attention mechanism to the microexpression recognition task and propose a spatiotemporal attention network (STANet) due to its outstanding performance in recognition tasks.MethodSTANet mainly consists of spatial spatial attention module (SAM) and temporal temporal attention module (TAM) attention modules. SAM is used to focus on microexpression regions with high intensity while TAM is incorporated to learn discriminative frames, which are given additional weights. Inspired by the fully convolutional network (FCN), which was proposed in semantic segmentation, we propose a spatial attention branch (SAB) in the SAM. SAB, a top-down and bottom-up structure, is a crucial component of SAM. Convolutional layers and nonlinear transformation are used to extract salient features of the microexpression in the downsampling process, followed by maximum pooling. The maximum pooling operation is utilized to reduce the resolution and increase the receptive field of the feature map. We use bilinear interpolation in the upsampling process to recover the feature map to its original size gradually and adopt skip connections to retain detailed information, which may be lost in the upsampling process. Sigmoid function is ultimately adopted after the last layer of the feature map to normalize the SAB output to [0, 1]. Furthermore, we propose a temporal attention branch (TAB) to focus on the additional discriminative frames in the microexpression sequence, which are crucial in microexpression recognition. Experiments are conducted using the Chinese Academy of Sciences microexpression (CASME), the Chinese Academy of Sciences microexpression II (CASME II), and spontaneous microexpression database-high speed camera (SMIC-HS) datasets with 171, 246 and 164 samples, respectively. Corner crop and rescaling augmentations are used in CASME and CASME II to avoid overfitting. Scaling factors are set to 0.9, 1.0 and 1.1. Corner crop and horizontal flip augmentations are applied in the SMIC-HS dataset. Linear interpolation is used to interpolate samples into 20 frames because various samples have different numbers of frames. Samples are then resized to 192×192 pixels. Finally, we use FlowNet 2.0 to obtain the optical flow sequence of each frame. Experimental settings use the Adam optimizer with a learning rate of 1E-5. Weight decay coefficient is set to 1E-4 and the regularization term of coefficient λ of $\ell $1 is set to 1E-8. The number of iterations is 60, 30 and 100 for CASME, CASME II and SMIC-HS, respectively.ResultWe compared our model with eight state-of-the-art frameworks, including facial image- and optical flow-based methods using three public microexpression datasets, namely, CASME, CASME II and SMIC-HS. Leave-one-subject-out (LOSO) cross validation is used due to insufficient samples. We utilize classification accuracy to measure the performance of methods. The results showed that our model achieves the best performance with CASME and CASME II datasets. Our model's classification accuracy rate in the CASME dataset is 1.78% higher than the Sparse MDMO, which ranks second. The classification accuracy rate of STANet in the CASME II dataset is 1.90% higher than histogram of image gradient orientation (HIGO). The classification accuracy rate of our model in the SMIC-HS dataset is 68.90%. Ablation studies are also performed using the CASME dataset. The results verified the validity of the SAM and TAM, and the fusion algorithm can significantly improve the recognition accuracy.ConclusionSTANet is proposed in this study for microexpression recognition. SAM emphasizes salient regions of the microexpression by placing additional weights on these regions. Additionally, TAM can learn large weights for clips with high variation in sequences. Experiments performed using the three public microexpression datasets illustrated that STANet achieves the highest recognition accuracy rate on CASME and CASME II datasets compared with eight other state-of-the-art methods and demonstrates satisfactory prediction performance using the SMIC-HS dataset.  
      关键词:microexpression recognition;classification;facial feature;deep learning;attention mechanism;spatiotemporal attention   
      78
      |
      170
      |
      5
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700559 false
      更新时间:2024-05-07
    • Ying Shu, Longbiao Mao, Si Chen, Yan Yan
      Vol. 25, Issue 11, Pages: 2391-2403(2020) DOI: 10.11834/jig.200334
      Self-supervised learning and generative adversarial network-based facial attribute recognition with small sample size training
      摘要:ObjectiveFacial attribute recognition is an important research topic in the fields of computer vision and emotion sensing. Face, an important biological feature of human beings, contains a large number of attributes, such as expression, age, and gender. Facial attribute recognition aims to predict the different attributes in a given facial image. Facial attribute recognition has progressed considerably with the remarkable development of deep learning. State-of-the-art deep learning-based facial attribute recognition methods typically rely on large-scale training facial data with complete attribute labels. However, the number of training facial data may be limited in some real-world applications and several attribute labels of the facial image are unavailable, mainly because attribute labeling is a time-consuming and labor-intensive task. Notably, defining a standard criterion for attribute labeling is difficult for some subjective attributes. As a result, the accuracy of these methods is poor when addressing the problem of missing attribute labels in small sample size training. Previous methods attempted to find samples that match the required label from the unlabeled dataset and then added these samples to the corresponding category of the training set to augment the training data. Note that the unlabeled dataset is typically of low quality, thereby affecting the final performance of the model. Furthermore, the selection of matching samples is time consuming. Some methods directly take advantage of similar data to augment the original dataset. However, deciding whether two datasets are similar and finding similar datasets are still challenging. Current methods need further investigation on facial attribute recognition under small sample size training. A self-supervised learning and generative adversarial network (GAN)-based method is proposed in this study to solve the above-mentioned problems and improve the accuracy of facial attribute recognition for small sample size training with missing attribute labels.MethodFirst, we adopt a rotation-based self-supervised learning technique to pretrain the attribute classification network. We use ResNet50 as the basic model of the network and modify output nodes of the last fully connected layer to four to predict the rotation angle of the input image (including 0°, 90°, 180° and 270°). Second, we concatenate rotated and original images along the channel dimension and use them as the input of the self-supervised learning network. Third, we utilize an attention mechanism-based GAN as the facial attribute synthesis model, where facial attributes can be edited to augment both attribute labels and training data. Specifically, the feature map is passed through a 1×1 convolution layer and then multiplied with its own transpose. In this way, we obtain attention features by multiplying the attention and feature maps. Fourth, we use this model to edit facial attributes to augment labels and training data. Finally, we use the augmented training data to train the attribute classification network initialized with self-supervised learning. We use the stochastic gradient descent algorithm during the training process. We select attributes of "baldness", "bangs", "black hair", "blonde", "brown hair", "dense eyebrows", "wearing glasses", "male", "slightly open mouth", "mustache", "no beard", "pale skin", and "young" in the experiment of synthesizing facial attributes. Both the encoder and decoder of the GAN generator contain five layers, and the discriminator also consists of five layers. We set the batch size to 64 and use a learning rate of 0.000 2.ResultWe use one-tenth of the data for training using CelebFaces attributes dataset (CelebA), labeled faces in the wild attributes dataset (LFWA), and University of Maryland attribute evaluation dataset (UMD-AED) in the experiments. The accuracy of the proposed method using CelebA and LFWA is improved by 2.42% and 3.17% respectively, compared with the traditional supervised learning-based method. The accuracy of the proposed method using UMD-AED is improved by 5.77%. We also conduct experiments on different sizes of training sets using CelebA, LFWA, and UMD-AED to verify the effectiveness of the self-supervised learning technique on the small dataset further. Experimental results showed that the model demonstrates a significantly improved performance with self-supervised learning when the size of the training set is small (from the complete training data to one-tenth of the training data). The performance of the supervised model using CelebA decreases from 90.86% to 81.97%, while the performance of the self-supervised model decreases from 90.72% to 83.57%. The performance of the supervised model using LFWA decreases by 6.11%, while the performance of the self-supervised model decreases by 3.90%. The performance of the supervised model using UMD-AED decreases by 16.95%, while the performance of the self-supervised model decreases by 11.50%.ConclusionThe proposed method utilizes self-supervised learning to pretrain the initial model and uses GAN for data augmentation. Experimental results showed that our proposed method effectively improves the accuracy of facial attribute recognition for small sample size training with missing attribute labels.  
      关键词:facial attribute recognition;self-supervised learning;generative adversarial network(GAN);data augmentation;small sample size training   
      32
      |
      131
      |
      5
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700580 false
      更新时间:2024-05-07

      Mental Health Assessment

    • Chuangao Tang, Wenming Zheng, Yuan Zong, Nana Qiu, Simeng Yan, Mengyao Zhai, Xiaoyan Ke
      Vol. 25, Issue 11, Pages: 2404-2414(2020) DOI: 10.11834/jig.200360
      Automated early identification of ASD under the deep spatiotemporal model-based facial expression analysis for infants and toddlers
      摘要:ObjectiveThe traditional screening of high-risk autism spectrum disorder (HR-ASD) mainly relies on the evaluation of pediatric clinicians. Owing to the low efficiency of this method, highly efficient automated screening tools have become a major research topic. Although most facial expression markers-related research have achieved some progress, the findings were derived from comparatively older children, and the effectiveness of the paradigm used in these research has not been tested in babies aged 8~18 months, whose intelligence quotient (IQ), language, and social ability are still developing. In these studies, a lack of a final diagnostic model led to low feasibility values in the large-scale screening of ASD. In this study, a novel automated screening method that provide diagnostic results based on the analysis of babies' facial expression symptoms under social stress environments was proposed.MethodDifferences among the babies' facial expressions in the HR-ASD and typically developing (TD) comparison groups were determined. A total of 30 infants and toddlers were enrolled in our study, of which 10 were at risk of HR-ASD and 20 were TD babies. All the babies were 8~18 months at the time of enrollment, all the participants received a re-diagnosis during 25 months of life in order that each case in the two groups is true ASD or TD. The still-face paradigm, including an amusing mother-baby interaction episode (baseline, 2 min) and a still-face episode (1 min), was employed to induce babies' emotion regulation behaviors for social stress environment in subsequent episode. We hypothesized that facial features derived from an accurate facial expression recognition system can be used in distinguishing the two groups. This hypothesis was then verified. For the establishment of an accurate facial expression recognition system, a deep spatiotemporal feature learning network was proposed. The spatial feature learning module was pretrained on an open-access dataset named AffectNet and was further trained on a video-based baby facial expression dataset, Research Center of Learning Science & Nanjing Brain Hospital dataset+(RCLS&NBH+ 53 babies subjects, 101 videos, and 95 207 babies' facial images), and a bidirectional long-short term memory network (Bi-LSTM) was used. The trained deep spatiotemporal neural network was verified using the collected babies' facial expression dataset including 30 babies, that is, the infant emotion dataset (IED). Three types of learned features derived from deep neural networks, including feature_a (the output of the last fully connected layer with 1 024 units in a convolutional neural network (CNN) that was only trained on the AffectNet dataset), feature_b (feature_a's counterpart in the CNN part of the CNN+LSTM(long short term memory) model), and feature_c (the output containing 1 024 units derived from the Bi-LSTM module), were compared. Pearson's correlation was computed between these frame-level learned features and their corresponding frame-level facial expression labels. Feature subsets were selected using different correlation thresholds, including without threshold (1 024-d features), 0 < |r| < 1, 0.2 < |r| < 1, 0.4 < |r| < 1, and 0.6 < |r| < 1. Then, the use of first-order statistical measurement, that is, frame-level mean values of selected features within a video was proposed in exploring the association between babies' mental health status and their facial expression symptoms under social stress environment. Such features were fed to linear classifiers for the automated screening of HR-ASD.Result1) Basing on the human coding for babies' facial expressions under the still-face paradigm, we find that babies at a high risk of ASD showed more neutral facial expressions (55.03±7.34 s) than those in the TD comparison group (46.26±11.02 s) during the one-minute still-face episode (p < 0.01). The other two types of facial expressions (positive and negative facial expressions) did not show statistically significant differences; 2) The proposed deep spatiotemporal neural network achieved an overall average recognition accuracy of 87.1% on a self-collected infants and toddlers' facial expression dataset, IED, which included 30 babies in this study. The recall rates for positive, neutral, and negative facial expressions were 68.82%, 93.79%, and 59.57%, respectively, whereas the recall rates for positive, neutral and negative facial expressions corresponding to the spatial model just trained on AffectNet were 36.32%, 77.06%, and 58.42%, respectively. A high consistency between the automated emotion prediction results from the CNN+LSTM model and human coding results was found, that is, a Kappa coefficient of 0.63 and Pearson coefficient of 0.67 were attained. 3) Through a leave-one subject out cross-validation, a sensitivity of 70%, specificity of 90%, and overall diagnostic accuracy of 83.3%, where the p-value of permutation test was lower than 0.05, were achieved using the proposed automated screening model based on the linear discriminant classifier and proposed features (feature_c derived from the CNN+LSTM model) under the correlation threshold of 0.6 < |r| < 1, which also verified the proposed hypothesis, and a more accurate facial expression recognition model showed better diagnostic performance between HR-ASD and TD according to the comparison between feature_a and feature_c.ConclusionThe automated screening method based on the proposed features from babies' facial expression is effective, showing potential for large-scale applications.  
      关键词:autism spectrum disorder(ASD);automated screening;deep spatio-temporal neural networks;baby facial expression recognition;mental health prediction   
      29
      |
      52
      |
      1
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700638 false
      更新时间:2024-05-07
    • Automatic depression estimation using facial appearance

      Yi An, Zhen Qu, Ning Xu, Zhaxi Nima
      Vol. 25, Issue 11, Pages: 2415-2427(2020) DOI: 10.11834/jig.200322
      Automatic depression estimation using facial appearance
      摘要:ObjectiveDepression is a serious mood disorder that causes noticeable problems in day-to-day activities. Current methods for assessing depression depend almost entirely on clinical interviews or questionnaires and lack systematic and efficient ways for utilizing behavioral observations that are strong indicators of psychological disorder. To help clinicians effectively and efficiently diagnose depression severity, the affective computing community has shown a growing interest in developing automated systems using objective and quantifiable data for depression recognition. Based on these developments, we propose a framework for the automatic diagnosis of depression from facial expressions.MethodThe method consists of following steps.1) To extract facial dynamic features, we propose a novel dynamic feature descriptor, namely, median robust local binary patterns from three orthogonal planes (MRELBP-TOP), which can capture the microstructure and macrostructure of facial appearance and dynamics. To extend the MRELBP descriptors to the temporal domain, we follow the procedure of the LBP-TOP algorithm, where an image sequence is regarded as a video volume from the perspective of three different stacks of planes, that is, the XY, XT, and YT planes. The XY plane provides spatial domain information, whereas the XT and YT planes provide temporal information. The robust center intensity based LBP (RELBP_CI) and robust neighborhood intensity based LBP(RELBP_NI)features are extracted independently from three sets of orthogonal planes, and co-occurrence statistics in these three directions are considered. The features are then stacked in a joint histogram. 2) The proposed MRLBP-TOP descriptors are typically high dimensional. Standard methods, such as principle component analysis (PCA) and linear discriminant analysis (LDA), have been widely used in dimensionality reduction. However, PCA and LDA have some drawbacks. Compared with PCA, random projenction(RP) has a lower computational cost and is easier to implement. 3) To obtain a compact feature representation, sparse coding (SC) is used. SC refers to a general class of techniques that automatically select a sparse set of elements from a large pool of possible bases to encode an input signal. Basically, SC assumes that objects in the world and their relationships are simple and succinct and can be represented by only a small number of prominent elements. 4) Finally, support vector regression(SVR) is adopted to predict Beck depression inventory(BDI) scores over an entire video clip for depression recognition and analysis.ResultThe root mean square error between the predicted values and the Beck depression inventory-II (BDI-II) scores is 9.70 and 9.01 on the test sets of the continuous audiovisual emotion and depression 2013 (AVEC 2013)and AVEC2014, respectively.Conclusion1) We develop an automated framework that effectively captures facial dynamics information for the measurement depression severity. 2) We propose a robust yet dynamic feature descriptor that captures the macrostructure, microstructure, and spatiotemporal motion patterns. The proposed feature descriptor can be adopted for facial expression recognition tasks in the future. Furthermore, we adopt sparse coding to learn overcomplete dictionary and organize MRELBP-TOP feature descriptors into compact behavior patterns.  
      关键词:depression;median robust local binary patterns from three orthogonal planes(MRELBP-TOP);local binary patterns(LBP);sparse coding(SC);random projection(RP)   
      34
      |
      387
      |
      2
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700682 false
      更新时间:2024-05-07

      Physiological Signal and Psychological Analysis

    • Noncontact pulse signal extraction based on multiview neural network

      Changchen Zhao, Feng Ju, Yuanjing Feng
      Vol. 25, Issue 11, Pages: 2428-2438(2020) DOI: 10.11834/jig.200415
      Noncontact pulse signal extraction based on multiview neural network
      摘要:ObjectiveRemote photoplethysmography (rPPG) has recently attracted considerable research attention due to its capability of measuring blood volume pulse from video recordings using computer vision techniques without any physical contact with the subject. Extracting pulse signals from video data requires the simultaneous consideration of both spatial and temporal information. However, such signals commonly processed separately using different methods can result in inaccurate modeling and low measurement accuracy. A multiview 2D convolutional neural network for pulse extraction from video is proposed to model the intra- and interframe correlation of video data from three points of view. This study aims to investigate the effective spatiotemporal modeling method for rPPG and improve the pulse measurement accuracy.MethodThe proposed network contains three pathways. The network performs 2D convolution operations in a given video segment from three perspectives of input data, namely, height-width, height-time, and width-time, and then integrates complementary spatiotemporal features of the three perspectives to obtain the final pulse signal, which is called multiview heart rate network (MVHRNet). MVHRNet consists of two normal (H-W) convolutional blocks and three multiview (height-width, height-time, width-time (H-W, H-T, and W-T)) 2D convolutional blocks. Each convolutional block (except the last block) includes dropout, convolutional, pooling, and batch normalization layers. The input and output of the network are a video clip and a predicted pulse signal, respectively. Multiview 2D convolution is a natural generalization of the single-view 2D convolution to all the three viewpoints of volumetric data. The normal 2D convolution in H-W view is taken as an example. H-W filters go through one image from left to right, top to bottom, move to next frame (slice), and repeat the process. Filters can learn the spatial correlation within each slice in H-W view in this manner. Similarly, the same process is performed on each slice in H-T view to ensure that filters can learn the correlation within H-T view, which is the partial temporal information of the video clip. The convolution in W-T view can learn the temporal information within W-T. Compared with rPPG methods, the proposed method simultaneously models the spatiotemporal information, preserves the original structure of the video, and exploits the complimentary spatiotemporal features by performing a three-view 2D convolution.ResultExtensive experiments on two datasets (one public dataset pulse rate detection dataset(PURE) and one self-built dataset self-built rPPG dataset(Self-rPPG)) are conducted, including an ablation study, comparison experiments, and cross-data set testing. Experimental results showed that the signal-to-noise ratio (SNR) of the extracted signal via the proposed network is 3.92 dB and 1.92 dB higher than that of the signal extracted using traditional methods on two datasets, respectively, and 2.93 dB and 3.2 dB higher than the single-view network on two datasets, respectively. We also evaluate the impact of the window length of the input video clip on the quality of the extracted signal. The results showed that the SNR of the extracted signal increases and the mean absolute error (MAE) decreases as the window length of the input video clip increases. SNR and MAE tend to saturate when T is greater than 120. The experimental results showed that the training times of multiview and single-view networks have the same order of magnitude.ConclusionSpatiotemporal correlation in videos can be effectively modeled using multiview 2D convolution. Compared with traditional rPPG methods (plane-orthogonal-to-skin (POS) and chrominance-based methods (CHROM)), SNR of pulse signals extracted via the proposed method using two datasets increases by 52.9% and 42.3%. Compared with the rPPG algorithm based on the single-view 2D convolutional neural network (CNN), the proposed network can extract pulse signals with less noise, fewer low-frequency components, stronger generalization ability, and nearly equal computational cost. This study demonstrates the effectiveness of multiview 2D CNN in rPPG pulse extraction. Hence, the proposed network outperforms existing methods in extracting pulse signals of subjects in complex environments.  
      关键词:heart rate measurement;neural network;remote photoplethysmography(rPPG);multiview convolution;spatialtemporal feature   
      167
      |
      712
      |
      1
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700683 false
      更新时间:2024-05-07
    • Face detection and tracking algorithm for remote photoplethysmography

      Changchen Zhao, Peiyi Mei, Yuanjing Feng
      Vol. 25, Issue 11, Pages: 2439-2450(2020) DOI: 10.11834/jig.200314
      Face detection and tracking algorithm for remote photoplethysmography
      摘要:ObjectiveRemote photoplethysmography (rPPG) is a video-based noncontact heart rate measurement method. It tracks the skin area of the face,extracts periodic subtle color variations within video data,and estimates heart rate from color signals. It has a broad application in the field of medical healthcare and daily living. Currently,facial landmark-based tracking methods are widely used by researchers to track regions of interest (ROIs) because it can quickly and accurately locate face contours. The Dlib library trained based on the cascade regression tree method is widely used. However,in practice,it has problems,such as the irregular jitter of landmarks during tracking,and present research does not consider the effect of target shaking. Thus,color signal extraction is inaccurate,and the accuracy of heart rate estimation is poor. To overcome the above problems,we first use the threshold method to stabilize landmarks,then rotate the image to correct the shaking face,and finally extract the region of interest and extract the color signal to estimate the heart rate.MethodWhen Dlib is applied to a frame,it detects the face bounding box,fits a set of average landmark points in the model to the face frame as the first predicted landmark,and updates the landmark through a cascade of regression trees. In each regression tree,a tree node decides the direction of splitting on the basis of the difference in pixel intensity between two pixels in a graph and threshold,and the offset is obtained until the last layer. When the detected face position is different or the offset obtained by a certain tree is different,deviation appears between the landmarks of two frames,that is,the landmarks jitter irregularly. Dlib suffers from the problem of landmark jitter. In some low-head scenes,the degree of jitter is particularly large,but the contours of landmarks detected using two-frame images are approximate. Nevertheless,the facial landmark detection accuracy of Dlib is more accurate than most object detection and tracking algorithms. Accordingly,the proposed method for stabilizing landmarks is based on the threshold method. First,we employ Euclidean distance and the standard deviation of a landmark as the threshold,which can be used in determining the current movement state of a subject. A large standard deviation indicates that the difference between the two frames of the landmark is large. Conversely,a small standard deviation indicates that the difference between the two frames of landmarks is small,that is,the target may be stationary. In this paper,the landmark of a previous frame is used when a target is stationary or landmark jitter is strong. Otherwise,the landmarks are updated regularly. Second,for motion shake,for maintaining the straight position of faces in images,a rotation correction mechanism is proposed. It calculates the rotation angle through the midpoint of the left and right eye landmarks for image rotation,then maps landmarks to the rotated image,and finally extracts an ROI to ensure that the ROI is consistent.ResultThis paper evaluates the performance of a tracking method for rPPG pulse extraction by using signal-to-noise ratio (SNR). SNR represents the quality of the pulse signal estimated from the color signals extracted from the tracking area. This paper selects UBFC-RPPG(stands for Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy) and PURE(Pulse Rate Detection Dataset) datasets to test method performance. Compared with Dlib,the proposed method improves SNR by 0.425 dB and root mean squared error(RMSE) decreases by 0.645 3 bpm but mean absolute eror(MAE) increases by 0.291 5 bpm on UBFC-RPPG dataset. MAE decreases by 0.065 2 bpm and RMSE decreases by 0.271 8 bpm but SNR decreases by 0.0411 dB on PURE dataset.ConclusionCompared with Dlib,the proposed method effectively improves the stability of the tracking frame and can track the same ROI still and moving images of the subject in still or moving situation. It is a tracking method suitable for rPPG applications.  
      关键词:remote photoplethysmography(rPPG);heart rate measurement;object tracking;facial landmark;rotation correction   
      62
      |
      342
      |
      2
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700754 false
      更新时间:2024-05-07
    • Multimodal human-computer interactive technology for emotion regulation

      Kaile Zhang, Tingting Liu, Zhen Liu, Yin Zhuang, Yanjie Chai
      Vol. 25, Issue 11, Pages: 2451-2464(2020) DOI: 10.11834/jig.200251
      Multimodal human-computer interactive technology for emotion regulation
      摘要:ObjectiveEmotion is closely related to human social life. People increasingly encounter psychological problems because of the aging population and the acceleration of the pace of social life. If people's negative emotions cannot be adjusted in a timely manner,then social harmony and stability may have adverse effects. Face-to-face professional counseling for people with psychological problems can achieve emotional regulation. However,the number of treatment centers and counselors that have the ability to carry out psychotherapy is insufficient. At the same time,some people are unwilling to confide their psychological problems to others to protect their privacy. The development of artificial intelligence,virtual reality,and human computer interaction(HCI) enables emotion regulation in the use of a suitable human-computer emotional interaction system. A multimodal emotional interaction model is proposed in this study to achieve an improved emotional regulation effect because existing human-computer emotional interaction methods are in single mode,and do not consider machine learning algorithms. The proposed model integrates many emotional interaction methods,including text dialogue,somatosensory interaction,and expression recognition,and provides a new strategy for negative emotion regulation.MethodThe proposed model uses expression recognition and text dialogue to detect user emotions and designs a three-dimensional realistic agent,which can express itself through its facial expression,body posture,voice,and text,to interact with users. Traditional support vector machine(SVM) method is used to recognize user expressions,and the data-driven method based on production rules is utilized to realize the text dialogue. Emotion dictionary,syntactic analysis,and production rules are combined to realize the text sentiment analysis for the input text containing emotional words. The Seq2Seq model with emotional factors is used to realize the text sentiment analysis for the input text without emotional words. In addition,multimodal HCI scenarios,including conversations,birthdays,and interactive games (playing basketball),are used to achieve emotional regulation. Hand gestures and body movements are utilized to assist the interaction. Felzenszwalb histogram of orizented gradients(FHOG) features are used to extract and recognize gestures,and the tracking algorithm MediaFlow is applied to track gestures. The user's body movement can be assessed according to the change in joint point position. As a companion,the agent can improve the user's experience. The collected expression,posture,and text information from users can comprehensively assess the emotion state of users.,Reinforcement learning algorithm is used to regulate emotions further and improve the feelings of users by automatically adjusting the difficulty of the game according to the emotional feedback of users. Accordingly,a prototype multimodal interactive system for emotion regulation is implemented using the computer. A normal camera is used for expression and gesture recognition,Kinect is utilized for body motion recognition,and iFLYTEK is applied to convert the voice input of users into text.ResultThe regulation effect of single-mode and multimode HCIs is compared. Results showed that the interaction between the agent and the user is limited in the single-mode HCI. The agent can neither fully understand why the user has such an emotion nor take appropriate measures to regulate user emotion. Hence,the user,who was not properly regulated,may feel disappointed. By contrast,the agent can fully understand the user emotion through multiple channels and provide a reasonable adjustment scheme in the multimodal interaction. The user will have additional interactions with the agent to participate in the regulation. Emotions can be regulated by language and exercise. This comprehensive and natural regulation interaction is effective in achieving enhanced emotion regulation.ConclusionA multimodal HCI method is proposed and a prototype system for emotion regulation is implemented in this study. An agent with autonomous emotion expression is constructed in our method,which can reasonably identify user emotions through expression,text dialogue,and gesture as well as realize the regulation according to this information. Our method can easily be promoted because expensive hardware is unnecessary. The proposed method provides a computable scheme for the regulation of negative emotions and can be useful in monitoring and regulating emotions of people living at home and socially isolated in the postepidemic period.  
      关键词:human computer interaction(HCI);emotion regulation;machine learning;affective computing;multimodality   
      153
      |
      764
      |
      3
      <HTML>
      <L-PDF><Meta-XML>
      <引用本文> <批量引用> 55700772 false
      更新时间:2024-05-07
    0