《中国图象图形学报》编辑部

2024 Volume 29 Number 7

ISSN 1006-8961
CN 11-3758/TB
CODEN ZTTXFZ
Superintended by Chinese Academy of Sciences
Sponsored by RADI, CSIG & IAPCN

Call For Front/Back Cover

Ranking

Current Issue
All Issues

1
12631
2
Few-shot SAR image classification： a survey
9564
3
Generalized adversarial defense against unseen attacks： a su...
6486
4
6448
5
Review on fairness in image recognition
6414
6
Adaptive heterogeneous federated learning
6392

1
Generalized adversarial defense against unseen attacks： a su...
6437
2
6372
3
6368
4
Review on fairness in image recognition
4388
5
Large-scale datasets for facial tampering detection with inp...
678
6
Adaptive heterogeneous federated learning
146

About the Journal

Journal of image and Graphics(JIG) is a peer-reviewed monthly periodical, JIG is an open forum and platform which aims to present all key aspects, theoretical and practical, of a broad interest in computer engineering, technology and science in China since 1996. Its main areas include, but are not limited to, state-of-the-art techniques and high-level research in the areas of image analysis and recognition, image interpretation and computer visualization, computer graphics, virtual reality, system simulation, animation, and other hot topics to meet different application requirements in the fields of urban planning, public security, network communication, national defense, aerospace, environmental change, medical diagnostics, remote sensing, surveying and mapping, and others.

Current Issue
Online First

Cover&Content

doi:10.11834/jig.24007 16-07-2024 12631 6372
Abstract：

Trusted AI

doi:10.11834/jig.2400007 16-07-2024 6448 6368
Abstract：

Generalized adversarial defense against unseen attacks： a survey Zhou Dawei, Xu Yibo, Wang Nannan, Liu Decheng, Peng Chunlei, Gao Xinbodoi:10.11834/jig.230423 16-07-2024 6486 6437
Abstract：Deep learning-based models have achieved impressive breakthroughs in various areas in recent years. However, they are vulnerable when their inputs are affected by imperceptible but adversarial noises, which can easily lead to wrong outputs. To tackle this problem, many defense methods have been proposed to mitigate the effect from these threat models for deep neural networks. As adversaries seek to improve the technologies of disrupting the models’ performances, an increasing number of attacks that are unseen to the model during the training process are emerging. Thus, the defense mechanism, which defends against only some specific types of adversarial perturbations, is becoming less robust. The ability of a model to generally defend against various unseen attacks becomes pivotal. Unseen attacks should be as different as possible from the attacks used in the training process in terms of theory and attack performance rather than adjustment of parameters from the same attack method. The core is to defend against any attacks via efficient training procedures, while the defense is expected to be as independent as possible from adversarial attacks during training. Our survey aims to summarize and analyze the existing adversarial defense methods against unseen adversarial attacks. We first briefly review the background of defending against unseen attacks. One of the main reasons that the model is robust against unseen attacks is that it can extract robust features through a specially designed training mechanism without explicitly designing a defense mechanism that has special internal structures. A robust model can be achieved by modifying its structure or designing additional modules. Therefore, we divide these methods into two categories： training mechanism-based defense and model structure-based defense. The former mainly seeks to improve the quality of the robust feature extracted by the model via its training process. 1） Adversarial training is one of the most effective adversarial defense strategies, but it can easily overfit to some specific types of adversarial noises. Well-designed attacks for training can explicitly improve the model’s ability to explore the perturbation space during training, which directly helps the model learn more representative features compared with traditional adversarial attacks in the perturbation space. Adding regularization terms is another way to obtain robust models by improving the robust features from the basic training process. Furthermore, we introduce some adversarial training-based methods combined with knowledge from other domains, such as domain adaptation, pre-training, and fine tuning. Different examples make different contributions to the model’s robustness. Thus, example reweighting is also a way to achieve robustness against attacks. 2） Standard training is the most basic training method in deep learning. Data augmentation methods focus on example diversity of standard training, while adding regularization terms into standard training aims to enhance the model outputs’ stabilization. Pre-training strategy aims to achieve a robust model within a predefined perturbation bound. 3） We also found that contrastive learning is a useful strategy as its core ideas about feature similarity match well with the goal of acquiring representative robust features. Model structure-based defense, meanwhile, mainly focuses on intrinsic drawbacks from the model’s structure. It is divided into structure optimization for target network methods and input data pre-processing methods according to how the structures are modified. 1） Structure optimization for target network aims to enhance the model’s ability to obtain useful information from inputs and features because the network itself is susceptible to variations from them. 2） Input data pre-processing focuses on eliminating the threats from examples before feeding them into the target network. Removing adversarial noise from inputs or detecting adversarial examples to reject them are two popular strategies because they are easily modeled and rely less on adversarial training examples compared with other methods such as adversarial training. Finally, we analyze the trends of research in this area and summarize some research on other related domains. 1） Defending against multiple adversarial perturbation well cannot make sure that the model is robust against various unseen attacks but contributes to the improvement of robustness against one specific type of perturbation. 2） With the development of defense against unseen adversarial attacks, some auxiliary tools such as the accelerating module have been proposed. 3） Defense against unseen common corruptions is beneficial for applications of defense methods because adversarial perturbations cannot represent the whole perturbation space in the real world. To summarize, defending against attacks that are totally different from the attacks during training has stronger generalizability. The analysis based on this goal shows differences from traditional surveys about adversarial defense. We hope that this survey can further motivate research on defending against unseen adversarial attacks.

Review on fairness in image recognition Wang Mei, Deng Weihong, Su Sendoi:10.11834/jig.230226 16-07-2024 6414 4388
Abstract：In the past few decades, image recognition technology has undergone rapid developments and has been integrated into people’s lives, profoundly changing the course of human society. However, recent studies and applications indicate that image recognition systems would show human-like discriminatory bias or make unfair decisions toward certain groups or populations, even reducing the quality of their performances in historically underserved populations. Consequently, the need to guarantee fairness for image recognition systems and prevent discriminatory decisions to allow people to fully trust and live in harmony has been increasing. This paper presents a comprehensive overview of the cutting-edge research progress toward fairness in image recognition. First, fairness is defined as achieving consistent performances across different groups regardless of peripheral attributes （e.g., color, background, gender, and race） and the reasons for the emergence of bias are illustrated from three aspects. 1） Data imbalance. In existing datasets, some groups are overrepresented and others are underrepresented. Deep models will facilitate optimization for the overrepresented groups to boost the accuracy, while the underrepresented ones are ignored during training. 2） Spurious correlations. Existing methods continuously capture unintended decision rules from spurious correlations between target variables and peripheral attributes, failing to generalize the images with no such correlations. 3） Group discrepancy. A large discrepancy exists between different groups. Performance on some subjects is sacrificed when deep models cannot trade off the specific requirements of various groups. Second, datasets （e.g., Colored Mixed National Institute of Standards and Technology database （MNIST）, Corrupted Canadian Institute for Advanced Research-10 database （CIFAR-10）, CelebFaces attributes database （CelebA）, biased action recognition （BAR）, and racial faces in the wild （RFW）） and evaluation metrics （e.g., equal opportunity and equal odds） used for fairness in image recognition are also introduced. These datasets enable researchers to study the bias of image recognition models in terms of color, background, image quality, gender, race, and age. Third, the debiased methods designed for image recognition are divided into seven categories. 1） Sample reweighting （or resampling）. This method simultaneously assigns larger weights （increases the sampling frequency） to the minority groups and smaller weights （decreases the sampling frequency） to the majority ones to help the model focus on the minority groups and reduce the performance difference across groups. 2） Image augmentation. Generative adversarial networks （GANs） are introduced into debiased methods to translate the images of overrepresented groups to those of underrepresented groups. This method modifies the bias attributes of overrepresented samples while maintaining their target attributes. Therefore, additional samples are generated for underrepresented groups, and the problem of data imbalance is addressed. 3） Feature augmentation. Image augmentation suffers from model collapse in the training process of GANs； thus, some works augment samples on the feature level. This augmentation encourages the recognition model to produce consistent predictions for the samples before and after perturbing and editing the bias information of features, making it impossible for the model to predict target attributes based on bias information and thus improving model fairness. 4） Feature disentanglement. This method is one of the most commonly used for debiasing, which removes the spurious correlation between target and bias attributes in the feature space and learns target features that are independent of bias. 5） Metric learning. This method utilizes the power of metric learning （e.g., contrastive learning） to encourage the model to make predictions based on target attributes rather than bias information to promote pulling the same target class with different bias class samples close and pushing the different target classes with similar bias class samples away in the feature space. 6） Model adaptation. Some works adaptively change the network depth or hyperparameters for different groups according to their specific requirements to address group discrepancy, which improves the performance on underrepresented groups. 7） Post-processing. This method assumes black-box access to a biased model and aims to modify the final predictions outputted by the model to mitigate bias. The advantages and limitations of these methods are also discussed. Competitive performances and experimental comparisons in widely used benchmarks are summarized. Finally, the following future directions in this field are reviewed and summarized. 1） In existing datasets, bias attributes are limited to color, background, image quality, race, age, and gender. Diverse datasets must be constructed to study highly complex biases in the real world. 2） Most of the recent studies dealing with bias mitigation require annotations of the bias source. However, annotations require expensive labor, and multiple biases may occasionally coexist. Mitigation of multiple unknown biases must still be fully explored. 3） A tradeoff dilemma exists between fairness and algorithm performance. Simultaneously reducing the effect of bias without hampering the overall model performance is challenging. 4） Causal intervention is introduced into object classification to mitigate bias, while individual fairness is proposed to encourage models to provide the same predictions to similar individuals in face recognition. 5） Fairness on video data has also recently attracted attention.

Large-scale datasets for facial tampering detection with inpainting techniques Li Wei, Huang Tianqiang, Huang Liqing, Zheng Aokun, Xu Chaodoi:10.11834/jig.230422 16-07-2024 2051 678
Abstract：Objective DeepFake technology, born with the continuous maturation of deep learning techniques, primarily utilizes neural networks to create non-realistic faces. This method has enriched people’s lives as computer vision advances and deep learning technologies mature. It has revolutionized the film industry by generating astonishing visuals and reducing production costs. Similarly, in the gaming industry, it has facilitated the creation of smooth and realistic animation effects. However, the malicious use of image manipulation to spread false information poses significant risks to society, casting doubt on the authenticity of digital content in visual media. Forgery techniques encompass four main categories： face reenactment, face replacement, face editing, and face synthesis. Face editing, a commonly employed image manipulation method, involves falsifying facial features by modifying the information related to the five facial regions. As one of the commonly employed methods in facial editing, image inpainting technology involves utilizing known content from an image to fill in missing areas, aiming to restore the image in a way that aligns as closely as possible with human perception. In the context of facial forgery, image inpainting is primarily used for identity falsification, wherein facial features are altered to achieve the goal of replacing a face. The use of image inpainting for facial manipulation similarly introduces significant disruption to people’s lives. To support research on detection methods for such manipulations, this paper produced a large-scale dataset for face manipulation detection based on inpainting techniques.Method This paper specifically focuses on the field of image tampering detection, utilizing two classic datasets： the high-quality CelebA-HQ dataset, comprising 25 000 high-resolution （1 024×1 024 pixels） celebrity face images, and the low-quality FF++ dataset, consisting of 15 000 face images extracted from video frames. On the basis of the two datasets, facial feature regions （eyebrows, eyes, nose, mouth, and the entire facial area） are segmented using image segmentation methods. Corresponding mask images are created, and the segmented facial regions are directly obscured on the original image. Two deep neural network-based inpainting methods （image inpainting via conditional texture and structure dual generation （CTSDG） and recurrent feature reasoning for image inpainting （RFR）） along with a traditional inpainting method （struct completion（SC）） were employed. The deep neural network methods require the provision of mask images to indicate the areas for inpainting, while the traditional method could directly perform inpainting on segmented facial feature images. The facial regions were inpainted using these three methods, resulting in a large-scale dataset comprising 600 000 images. This extensive dataset incorporates diverse pre-processing techniques, various inpainting methods, and includes images with different qualities and inpainted facial regions. It serves as a valuable resource for training and testing in related detection tasks, offering a rich dataset for subsequent research in the field, and also establishes a meaningful benchmark dataset for future studies in the domain of face tampering detection.Result We present comparative experiments conducted on the generated dataset, revealing notable findings. Experimental results indicate a 15% decrease in detection accuracy for images derived from the FF++ dataset under the ResNet-50 benchmark detection network. Under the Xception-Net network, the detection accuracy experiences a 5% decline. Furthermore, significant variations in detection accuracy are observed among different facial regions, with the lowest accuracy recorded in the eye region at 0.91. Generalization experiments suggest that inpainted images from the same source dataset exhibit a certain degree of generalization across different facial regions. In contrast, minimal generalization is observed among datasets created from different source data. Consequently, this dataset also serves as valuable research data for studying the generalization of inpainted images across different facial regions. Visualization tools demonstrate that the detection network indeed focuses on the inpainted facial features, affirming its attention to the manipulated facial regions. This work provides new research perspectives for methods of detecting image restoration-based manipulations.Conclusion The use of image inpainting techniques for tampering introduces a challenging scenario that can deceive conventional tampering detectors to a certain extent. Researching detection methods for this type of tampering is of practical significance. The provided large-scale face tampering dataset, based on inpainting techniques, encompasses high- and low-quality images, employing three distinct inpainting methods and targeting various facial features. This dataset offers a novel source of data for research in this field, enhancing diversity and providing benchmark data for further exploration of image restoration-related forgeries. With the scarcity of relevant datasets in this domain, we propose the utilization of this dataset as a benchmark for the field of image inpainting tampering detection. This dataset not only supports research in detection methodologies but also contributes to studies on the generalization of such methods. It serves as a foundational resource, filling the gap in the available datasets and facilitating advancements in the detection and generalization studies in the domain of image inpainting tampering. This benchmark includes a large-scale inpainting image dataset, totaling 600 000 images. The dataset’s quality is evaluated based on accuracy on manipulation detection networks, generalizability across different inpainting networks and facial regions, and modules such as data visualization.

Adaptive heterogeneous federated learning Huang Wenke, Ye Mang, Du Bodoi:10.11834/jig.230239 16-07-2024 6392 146
Abstract：Objective The current development of deep learning has caused significant changes in numerous research fields and has had profound impacts on every aspect of societal and industrial sectors, including computer vision, natural language processing, multi-modal learning, and medical analysis. The success of deep learning heavily relies on large-scale data. However, the public and scientific communities have become increasingly aware of the need for data privacy. In the real world, data are commonly distributed among different entities such as edge devices and companies. With the increasing emphasis on data sensitivity, strict legislation has been proposed to govern data collection and utilization. Thus, the traditional centralized training model, which requires data aggregation, is unusable in the practical setting. In response to such real-world challenges, federated learning （FL） has emerged as a popular research field because it can train a global model for different participants without centralizing data owned by the distributed parties. FL is a privacy-preserving multiparty collaboration model that adheres to privacy protocols without data leakage. Typically, FL requires clients to share a global model architecture for the central server to aggregate parameters from participants and then redistributes the global model （averaged parameters）. However, this prerequisite largely restricts the flexibility of the client model architecture. In recent years, the concept of objective model heterogeneous FL has garnered substantial attention because it allows participants to independently design unique models in FL without compromising privacy. Specifically, participants may need to design special model architecture to ease the communication burden or refuse to share the same architecture due to intellectual property concerns. However, existing methods often rely on publicly shared related data or a global model for communication, limiting their applicability. In addition, FL is proposed to handle privacy concerns in the distributed learning environment. A pioneering FL method trains a global model by aggregating local model parameters. However, its performance is impeded due to decentralized data, which results in non-i.i.d distribution （called data heterogeneity）. Each participant optimizes toward the local empirical risk minimum, which is inconsistent with the global direction. Therefore, the average global model has a slow convergence speed and achieves limited performance improvement.Method Model heterogeneity largely impedes the local model section flexibility, and data heterogeneity hinders federated performance. To address model and data heterogeneity, this paper introduces a groundbreaking approach called adaptive heterogeneous federated （AHF） learning, which employs a unique strategy by utilizing a randomly generated input signal, such as random noise and public unrelated samples, to facilitate direct communication among heterogeneous model architectures. This task is achieved by aligning the output logit distributions, fostering collaborative knowledge sharing among participants. The primary advantage of AHF is its ability to address model heterogeneity without depending on additional related data collection or shared model design. To further enhance AHF’s effectiveness in handling data heterogeneity, the paper proposes adaptive weight updating on both model and sample levels, which enables AHF participants to acquire rich and diverse knowledge by leveraging dissimilarities in model output on unrelated data while emphasizing the importance of meaningful samples.Result Empirical validation of the proposed AHF method is conducted through a meticulous series of extensive empirical experiments. Random noise inputs are employed in the context of two distinct federated learning tasks： Digits and Office-Caltech scenarios. Specifically, our solution presents the stable generalization performance on the more challenging scenario, Office-Caltech. Notably, when a larger domain gap exists among private data, AHF achieves higher overall generalization performance on these different unrelated data samples and obtains stable improvements on most unseen private data. By contrast, competing methods achieve limited generalization performance in the Office-Caltech scenario. The empirical findings validate our solution’s ability, showcasing a marked improvement in within-domain accuracy and demonstrating superior cross-domain generalization performance compared with existing methodologies.Conclusion In summary, the AHF learning method, as extensively examined in this thorough investigation, not only presents a straightforward yet remarkably efficient foundation for future progress in the domain of federated learning but also emerges as a transformative paradigm in comprehensively addressing model and data heterogeneity. AHF not only lays the groundwork for more resilient and adaptable FL models but also serves as a guide for the transformation of collaborative knowledge sharing in the upcoming era of machine learning. Studying AHF is more than an exploration of an innovative FL methodology； it provides numerous opportunities that arise given the complexities of model and data heterogeneity in the development of machine learning models.

Contrastive semi-supervised adversarial training method for hyperspectral image classification networks Shi Cheng, Liu Ying, Zhao Minghua, Miao Qiguang, Pun Chi-Mandoi:10.11834/jig.230462 16-07-2024 92 100
Abstract：Objective Deep neural networks have demonstrated significant superiority in hyperspectral image classification tasks. However, the emergence of adversarial examples poses a serious threat to their robustness. Research on adversarial training methods provides an effective defense strategy for protecting deep neural networks. However, existing adversarial training methods often require a large number of labeled examples to enhance the robustness of deep neural networks, which increases the difficulty of labeling hyperspectral image examples. In addition, a critical limitation of current adversarial training approaches is that they usually do not capture intermediate layer features in the target network and pay less attention to challenging adversarial samples. This oversight can lead to the reduced generalization ability of the defense model. To further enhance the adversarial robustness of hyperspectral image classification networks with limited labeled examples, this paper proposes a contrastive semi-supervised adversarial training method.Method First, the target model is pre-trained using a small number of labeled examples. Second, for a large number of unlabeled examples, the corresponding adversarial examples are generated by maximizing the feature difference between clean unlabeled examples and adversarial examples on the target model. Adversarial samples generated using intermediate layer features of the network exhibit higher transferability compared with those generated only using output layer features. In contrast, feature-based adversarial sample generation methods do not rely on example labels. Therefore, we generate adversarial examples based on the intermediate layer features of the network. Third, the generated adversarial examples are used to enhance the robustness of the target model. The defense capabilities of the target model for the challenging adversarial samples are enhanced by defining the robust upper bound and robust lower bound of the target network based on the pre-trained target model, and a contrastive adversarial loss is designed on both intermediate feature layer and output layer to optimize the model based on the defined robust upper bound and robust lower bound. The defined contrastive loss function consists of three terms： classification loss, output contrastive loss, and feature contrastive loss. The classification loss is designed to maintain the classification accuracy of the target model for clean examples. The output contrastive loss encourages the output layer of the adversarial examples to move closer to the pre-defined output layer robust upper bound and away from the pre-defined output layer robust lower bound. The feature contrastive loss pushes the intermediate layer feature of the adversarial example closer to the pre-defined intermediate robust upper bound and away from the pre-defined intermediate robust lower bound. The proposed output contrastive adversarial loss and feature contrastive loss help improve the classification accuracy and generalization ability of the target network against challenging adversarial examples. The training process of adversarial example generation and target network optimization is performed iteratively, and example labels are not required in the training process. By incorporating a limited number of labeled examples in model training, both the output layer and intermediate feature layer are used to enhance the defense ability of the target model against known and unknown attack methods.Result We compared the proposed method with five mainstream adversarial training methods, two supervised adversarial training methods and three semi-supervised adversarial training methods, on the PaviaU and Indian Pines hyperspectral image datasets. Compared with the mainstream adversarial training methods, the proposed method demonstrates significant superiority in defending against both known and various unknown attacks. Faced with six unknown attacks, compared with the supervised adversarial training methods AT and TRADES, our method showed an average improvement in classification accuracy of 13.3% and 16%, respectively. Compared with the semi-supervised adversarial training methods SRT, RST, and MART, our method achieved an average improvement in classification accuracy of 5.6% and 4.4%, respectively. Compared with the target model without defense method, for example on the Inception_V3, the defense performance of the proposed method in the face of different attacks improved by 34.63%–92.78%.Conclusion The proposed contrastive semi-supervised adversarial training method can improve the defense performance of hyperspectral image classification networks with limited labeled examples. By maximizing the feature distance between clean examples and adversarial examples on the target model, we can generate highly transferable adversarial examples. To address the limitation of defense generalization ability imposed by the number of labeled examples, we define the concept of robust upper bound and robust lower bound based on the pre-trained target model and design an optimization model according to a contrastive semi-supervised loss function. By extensively leveraging the feature information provided by a few labeled examples and incorporating a large number of unlabeled examples, we can further enhance the generalization ability of the target model. The defense performance of the proposed method is superior to that of the supervised adversarial training methods.

Universal detection method for mitigating adversarial text attacks through token loss information Chen Yuhan, Du Xia, Wang Dahan, Wu Yun, Zhu Shunzhi, Yan Yandoi:10.11834/jig.230432 16-07-2024 144 83
Abstract：Objective In recent years, adversarial text attacks have become a hot research problem in natural language processing security. An adversarial text attack is an malicious attack that misleads a text classifier by modifying the original text to craft an adversarial text. Natural language processing tasks, such as smishing scams （SMS）, ad sales, malicious comments, and opinion detection, can be achieved by creating attacks corresponding to them to mislead text classifiers. A perfect text adversarial example needs to have imperceptible adversarial perturbation and unaffected syntactic-semantic correctness, which significantly increases the difficulty of the attack. The adversarial attack methods in the image domain cannot be directly applied to textual attacks due to discrete text limitation. Existing text attacks can be categorized into two dominant groups： instance-based and learning-based universal non-instance attacks. For instance-based attacks, a specific adversarial example is generated for each input. For learning-based universal non-instance attacks, universal trigger （UniTrigger） is the most representative attack, which reduces the accuracy of the objective model to near zero by generating a fixed sequence of attacks. Existing detection methods mainly tackle instance-based attacks but are seldom studied in UniTrigger attacks. Inspired by the logit-based adversarial detector in computer vision, we propose a UniTrigger defense method based on token loss weight information.Method For our proposed loss-based detect universal adversarial attack （LBD-UAA）, we generalize the pre-training model to transform token sequences into word vector sequences to obtain the representation of token sequences in the semantic space. Then, we remove the target to compute the token positions and feed the remaining token sequence strings into the model. In this paper, we use the token loss value （TLV） metric to obtain the weight proportion of each token to build a full-sample sequence lookup table. The token sequences of non-UniTrigger attacks have less fluctuation than the adversarial examples in the variation of the TLV metric. Prior knowledge suggests that the fluctuations in the token sequence are the result of adversarial perturbations generated by UniTrigger. Hence, we envision deriving the distinct numerical differences between the TLV full-sequence lookup table and clean samples, as well as adversarial samples. Subsequently, we can employ the differential outcomes as the data representation for the samples. Building upon this approach, we can set a differential threshold to confine the magnitude of variations. If the magnitude exceeds this threshold, then the input sample will be identified as an adversarial instance.Result To demonstrate the efficacy of the proposed approach, we conducted performance evaluations on four widely used text classification datasets： SST-2, MR, AG, and Yelp. SST-2 and MR represent short-text datasets, while AG and Yelp encompass a variety of domain-specific news articles and website reviews, making them long-text datasets. First, we generated corresponding trigger sequences by attacking specific categories of the four text datasets through the UniTrigger attack framework. Subsequently, we blended the adversarial samples evenly with clean samples and fed them into the LBD-UAA for adversarial detection. Experimental results across the four datasets indicate that this method achieves a maximum detection rate of 97.17%, with a recall rate reaching 100%. When compared with four other detection methods, our proposed approach achieves an overall outperformance with a true positive rate of 99.6% and a false positive rate of 6.8%. Even for the challenging MR dataset, it retains a 96.2% detection rate and outperforms the state-of-the-art approaches. In the generalization experiments, we performed detection on adversarial samples generated using three attack methods from TextBugger and the PWWS attack. Results indicate that LBD-UAA achieves strong detection performance across the four different word-level attack methods, with an average true positive rate for adversarial detection reaching 86.77%, 90.98%, 90.56%, and 93.89%. This finding demonstrates that LBD-UAA possesses significant discriminative capabilities in detecting instance-specific adversarial samples, showcasing robust generalization performance. Moreover, we successfully reduced the false positive rate of short sample detection to 50% by using our proposed differential threshold setting.Conclusion In this paper, we follow the design idea of adversarial detection tasks in the image domain and, for the first time in the general text adversarial domain, introduce a detection method called LBD-UAA, which leverages token weight information from the perspective of token loss measurement, as measured by TLV. We are the first to detect UniTrigger attacks by using token loss weights in the adversarial text domain. This method is tailored for learning-based universal category adversarial attacks and has been evaluated for its defensive capabilities in sentiment analysis and text classification models in two short-text and long-text datasets. During the experimental process, we observed that the numerical feedback from TLV can be used to identify specific locations where perturbation sequences were added to some samples. Future work will focus on using the proposed detection method to eliminate high-risk samples, potentially allowing for the restoration of adversarial samples. We believe that LBD-UAA opens up additional possibilities for exploring future defenses against UniTrigger-type and other text-based adversarial strategies and that it can provide a more effective reference mechanism for adversarial text detection.

Sparse adversarial patch attack based on QR code mask Ye Yixuan, Du Xia, Chen Si, Zhu Shunzhi, Yan Yandoi:10.11834/jig.230453 16-07-2024 133 90
Abstract：Objective Convolutional neural networks （CNNs） and other deep networks have revolutionized the field of computer vision, particularly in the area of image recognition, leading to significant advancements in various visual tasks. Recent studies have unequivocally demonstrated that the performance of deep neural networks is significantly compromised in the presence of adversarial examples. Maliciously crafted inputs can cause a notable decline in the accuracy and reliability of deep learning models. Traditional adversarial attacks based on adversarial patches tend to concentrate a significant amount of perturbations in the masked regions of an image. However, crafting imperceptible perturbations for patch attack is highly challenging. Adversarial patches consist solely of noise and are visually redundant, lacking any practical significance in their existence. To address this issue, this paper proposes a novel approach called quick response （QR） code-based sparse adversarial patch attack. A QR code is a square symbol consisting of alternating dark and light modules, extensively employed in images. It uses a specialized encoding technique to store meaningful information. Utilizing QR codes as adversarial patches not only inherits the robustness of traditional adversarial patches but also increases the likelihood of evading suspicion. A crucial detail to highlight is that global-based perturbations can potentially disrupt the integrity of the valuable information stored in the QR code. Particularly when attacking robust images, excessive superimposed perturbations can significantly affect the white background of the QR code, thus ultimately rendering the generated adversarial QR code unscannable, preventing its successful detection and decoding. In this regard, we hope to ensure the integrity of QR code by limiting the amount of noise. Inspired by sparse attacks, we integrate the QR code patch with sparse attack techniques to control the sparsity of adversarial perturbations. By doing so, our proposed method effectively limits the number of noise points, minimizing the influence of noise on the QR code pixels and ensuring the robustness of the encoded information. Furthermore, our approach exhibits attack performance and maintains a certain level of imperceptibility, making it a compelling solution.Method Building upon the aforementioned analysis, our proposed method follows a step-by-step approach. First, we gather the prediction information of the target model on the input image. Next, we calculate the gradient that steers the prediction result away from the category with the highest probability. Simultaneously, we create a mask to confine the perturbation noise within the colored area of the QR code, thereby preserving the original information. Taking inspiration from recent advances, we employ the Lp-Box alternating direction method of multipliers algorithm to optimize the sparsity of added perturbation points. This optimization aims to ensure that the original information carried by QR codes remains intact even under the efficient conditions for successful adversarial attacks. By mitigating the impact of densely added high-distortion points, our approach achieves a balance between high attack success rates and preserving the inherent recognizability of QR codes. The final result is an adversarial patch that remains imperceptible to human observers.Result Experiments were conducted on the Inception-v3 and ResNet-50 models using the ImageNet dataset. Our method was compared against representative adversarial attacks in non-target scenarios, considering the attack success rate and L_p-norm perturbation. To assess the similarity between adversarial examples and the original image, we utilized several similarity measures （peak signal-to-noise ratio（PSNR）, erreur relative globale adimensionnelle de synthèse（ERGAS）, structural similarity index measure（SSIM）, spectral angle mapping（SAM）） to calculate the similarity scores and compared them with other attacks. We also evaluated the robustness of our attack after applying several defense algorithms as pre-processing steps. In addition, we investigated the impact of different QR code sizes on the attack success rate and L_p-norm of the perturbation of our method. Experimental results demonstrate that our approach achieves a balance between attack success rate and imperceptible noise in non-target scenarios. The adversarial examples generated by our method exhibit the smallest L₀ norm of perturbation among all the methods. Although our method may not always achieve the best similarity scores, visual results demonstrate that our crafted adversarial noise is optimally imperceptible. Moreover, even after pre-processing with various defense methods, our method continues to outperform other attacks. In the ablation study on QR code sizes for non-target attacks, we observed that reducing the QR code size from 55×55 pixels to 50×50 pixels led to a 3.8% decrease in the attack success rate. Conversely, increasing the size to 60×60 pixels resulted in a 2.7% improvement compared with 55×55 pixels. Similarly, reducing the size to 65×65 pixels led to a 1.1% decrease compared with 60×60 pixels, while increasing it to 70×70 pixels resulted in a 6.4% improvement compared with 65×65 pixels. With regard to the L_p-norm of perturbations, we found a positive correlation between the L₁-norm and the number of perturbation points, whereas the L₂-norm and L₀-norm perturbations exhibited a negative correlation with the number of perturbed points.Conclusion The proposed QR code-based adversarial patch attack is more reasonable for real attack scenarios. By utilizing sparsity algorithms, we ensure the preservation of the information carried by the two-dimensional code itself, resulting in the generation of more perplexing adversarial samples. This approach provides novel insights into the research of highly imperceptible adversarial patches.

Review

Few-shot SAR image classification： a survey Wang Ziqi, Li Yang, Zhang Rui, Wang Jiabao, Li Yunchen, Chen Yaodoi:10.11834/jig.230359 16-07-2024 9564 113
Abstract：Few-shot synthetic aperture radar （SAR） image classification aims to use a small number of training samples to classify new SAR images and facilitate subsequent vision tasks further. In recent years, it has received widespread attention in the field of image processing, especially playing a crucial role in tasks such as environmental monitoring, target reconnaissance, and geological exploration. Moreover, the growth of deep learning has been promoting deep learning-based few-shot SAR image classification. In particular, the improvement of few-shot learning algorithm, such as the attention mechanism, transfer learning, and meta learning, has led to a qualitative leap in few-shot SAR image classification performance. However, a comprehensive review and analysis of state-of-the-art deep learning-based few-shot SAR image classification algorithms for different complex scenes need to be conducted. Thus, we develop a systematic and critical review to explore the developments of few-shot SAR image classification in recent years. First, a comprehensive and systematic introduction of the few-shot SAR image classification field is presented from three aspects： 1） overview of early SAR image classification methods, 2） the existing dataset, and 3） the prevailing evaluation metrics. Then, the existing few-shot SAR image classification methods are categorized into four types： transfer learning, meta learning, metric learning, and comprehensive methods. The main contributions and the datasets used for each method are summarized. Therefore, we test the classification accuracy and runtime of 16 classic few-shot visible light image classification methods on the moving and stationary target acquisition and recognition （MSTAR） dataset. In this way, the evaluation benchmark for few-shot SAR image classification methods is supplemented for future research reference. Finally, the summary and challenges in the few-shot SAR image classification community are highlighted. In particular, some prospects are recommended further in the field of few-shot SAR image classification. First, starting from the classification criteria, SAR image classification methods can be divided into four categories based on the feature information used, whether manual sample labeling is required, technical methods, and processing objects. These traditional SAR image classification methods lay the foundation for subsequent few-shot SAR image classification methods. We briefly introduce the popular public datasets and prevailing evaluation metrics. The existing datasets for few-shot SAR image classification include MSTAR, OpenSARShip, COSMO-SkyMed, FuSAR-Ship, OpenSARUrban, and SAR-ACD. Among them, MSTAR is the most commonly used standard few-shot SAR image classification dataset. The evaluation indicators for method performance in few-shot SAR image classification tasks mainly include classification accuracy, precision, and recall. Precision and recall represent two different indicators, which is why intuitively reflecting the performance of the model is difficult. Therefore, the harmonic mean of these two indicators has become a direct indicator for judging the performance of the model. In addition, few-shot learning commonly uses top 1 and top 5 as evaluation indicators. Second, few-shot SAR image classification methods based on deep learning can be divided into three categories： transfer learning, meta learning, and metric learning. Transfer learning methods quickly adapt to the new class image classification by using the association between similar tasks to assist the model after completing the pre-training on a large number of base class data. This type of method can effectively overcome the problem of insufficient training samples in the field of SAR images. Meta learning methods aim to enable models to learn by training a meta learner to evaluate the dataset learning process and gain learning experience. Then, the model utilizes the acquired learning experience to complete relevant classification tasks on the target dataset. Metric learning methods are an end-to-end training approach that utilizes data from each K-shot category to learn a feature embedding space. In this feature embedding space, the model can more effectively measure the similarity between samples. This type of method relatively reduces the difficulty of training feature extractors, making the structure of the model more flexible and able to quickly adapt to the task of identifying new classes. As a result of the different imaging principles between SAR images and visible light images, some comprehensive methods guided by physical knowledge and domain knowledge have also been used in SAR image classification tasks and achieved great results. Therefore, in addition to the above three classification methods, some methods that combine deep learning and SAR image characteristics have been applied to solve the problem of few-shot SAR image classification. We summarize the limitations of different few-shot SAR image classification algorithms and provide some recommendations for further research. Third, we tested the classification performance of 16 visible light dataset methods migrating to SAR image datasets within a unified framework and comprehensively evaluated the transfer effect of few-shot learning models from two aspects： classification accuracy and runtime. This work can effectively supplement the evaluation benchmark for few-shot SAR image classification tasks. The experiment found that the few-shot learning method based on metric learning achieved good performance in the field of SAR image classification without comprehensive methods. Finally, the review summarizes the future development trends and challenges of few-shot SAR image classification based on a summary of existing methods.

Research progress of three-dimensional gait recognition Shen Shu, Zhang Wenhao, Ding Hao, Zhang Hao, Sha Chao, Wang Sen, Chen Shujundoi:10.11834/jig.230328 16-07-2024 6380 104
Abstract：Gait recognition is a new biometric identification technology that uses human walking posture and gait to determine a person’s identity. Face recognition, which is considered traditional biometric recognition technology, is widely used, but it has the following defects： 1） recognition distance is limited； 2） it is vulnerable to occlusion and light and other factors； and 3） the results are at risk of being attacked by using face photos, video playback, and three-dimensional masks. In contrast, gait has the following advantages： 1） it can be identified from a long distance； 2） it is less affected by occlusion and illumination； and 3） it is not easy to disguise and deceive. Therefore, gait recognition is playing an increasingly important role in public safety, biometric authentication, video surveillance, and other fields. Gait recognition is mainly divided into two categories according to the dimension of input data： 1） two-dimensional （2D） gait recognition based on 2D data and 2） three-dimensional （3D） gait recognition based on 3D data. At present, the research review in the field of gait recognition focuses on 2D gait recognition, usually from the perspective of traditional machine learning or deep learning. Gait recognition is moving from 2D to 3D. Compared with the inherent 2D information of the image, the 3D information restored by visual technology can more effectively predict human identity. In the field of 2D vision, traditional gait recognition methods have difficulty achieving better recognition performance because of the influence of object occlusion and view changes. On the basis of 3D human technology such as 3D human reconstruction and 3D human pose estimation, a series of progress has been made in the field of 3D gait recognition in recent years. To fully understand the existing research in the field of 3D gait recognition, this paper reviews and summarizes the research in this field. This paper discusses the research status, advantages, and disadvantages of identity recognition based on 3D gait； summarizes 3D gait recognition methods and 3D gait datasets； and provides potential research directions in the field of 3D identity recognition. This paper summarizes the different input data of existing 3D gait recognition methods and the recognition effect （recognition accuracy and speed） of these methods. These methods include multi-view-based, depth image-based, 3D skeleton-based, 3D point cloud-based, and 3D reconstruction-based recognition methods. This paper divides the 3D gait dataset into the indoor dataset and the outdoor dataset according to the acquisition environment. The 3D gait data include depth images, 3D skeleton, 3D human body grid, and 3D point cloud. In addition, this paper collates and compares the experimental results of different 3D gait recognition methods on various 3D gait datasets. Finally, this paper provides potential research directions for the field of 3D identity recognition. 1） Performance improvement and model optimization. Different from 2D gait, the performance of 3D gait is closely related to the 3D model. The 3D deep learning model needs to be optimized to improve the recognition performance of 3D gait in real scenes. For example, the training skills of vision Transformer （ViT） to improve performance are applied to 3D models such as 3D convolutional neural networks to improve the generalization and robustness of the model. The 3D model with the ViT concept is expected to learn more discriminative features from 3D gait data. 2） Collection and collation of 3D datasets. Compared with the 2D gait data set, the number of 3D gait datasets is small and the data types are not rich enough, which limits the development of 3D gait and requires further data collection by researchers. When the collected 3D gait dataset is sorted out, the training set and the test set can be divided in advance. For the test set, the registration set and the verification set are expected to be divided again, and the baseline algorithm that is easy to reproduce is used for evaluation. Rank-1 accuracy and mean average precision can be used as evaluation metrics. 3） Multi-modal fusion of 2D and 3D data. Compared with 2D data, 3D data contains more information, so the effective use of 3D data can improve the recognition performance in real scenes. In the field of gait recognition, current research mainly focuses on 2D data （human 2D skeleton, gait contour map, etc.） but has gradually shifted to 3D data （human 3D skeleton, human 3D mesh, depth image, etc.） in recent years. Future researchers can explore multidimensional gait recognition networks based on multimodal fusion to dynamically fuse 2D and 3D gait data. This fusion network combines the advantages of high 2D recognition efficiency and high 3D recognition accuracy and is expected to improve the performance of gait recognition in complex outdoor scenes. 4） Promotion and application of 3D vision technology. This paper mainly discusses the application of 3D vision technology in the emerging field of biometrics, particularly in gait recognition. Traditional biometrics, such as face recognition and fingerprint recognition, are also gradually transitioning from 2D to 3D. It is anticipated that this paper will aid researchers in understanding the latest advancements in 3D gait recognition and inspire the development of novel and advanced algorithms and models.

Image Processing and Coding

Cryptanalysis method for chaotic image encryption system Chang Xiaoqi, Wang Minghe, You Datao, Wu Xiangjundoi:10.11834/jig.230147 16-07-2024 84 78
Abstract：Objective Cryptography security analysis methods play a vital role in measuring and enhancing the security of chaotic image encryption systems. The existing ciphertext analysis methods for chaotic image encryption are generally divided into two categories. Although the evaluation methods based on numerical statistics, which are represented by key space analysis, sensitivity analysis of ciphertext to secret key, numbers of pixels change rate, and unified average changing intensity, have excellent versatility and consistency, the security of the test-passed encryption scheme cannot be ensured. While common attack methods in cryptography, which are represented by selective plaintext attack, can intuitively and effectively assess the security of chaotic encryption schemes, they lack versatility and consistency compared with security analysis methods, and different attack schemes need to be designed for different encryption schemes. To address the problem, this paper proposes a cryptanalysis method of chaotic image encryption system that is both versatile and effective based on denoising autoencoder.Method The ciphertext analysis method is improved based on the denoising autoencoder. It uses a cryptographic system with a known specific encryption steps to encrypt the plaintext image dataset and constructs the ciphertext analysis model, which takes the ciphertext image dataset as input and the original image dataset as the target data. The cryptanalysis model uses the encoder to obtain the depth representation of the diffusion ciphertext, scrambling the ciphertext and fully encrypted cipher image generated by the image encryption scheme, which is the structural features of images extracted from ciphertext images, and then uses the decoder to generate the different deciphered plaintext with the above depth representation as input. In this way, a deciphering model for a certain known encryption scheme can be trained, thereby achieving the purpose of ciphertext analysis. The effect of cryptanalysis is measured objectively and comprehensively by proposing three types of evaluation indicators suitable for ciphertext analysis based on peak signal to noise ratio（PSNR） and structural simitarity（SSIM）： max PSNR（MPSNR）, max SSIM（SSIM）, average of PSNR（APSNR）, avarage of SSIM（ASSIM）, cumulative distribution of PSNR（CDPSNR）, and cumulative distribution of SSIM（CDSSIM）, which measure the ability of an encryption scheme to resist popular attacks in cryptography by counting the structural similarity between the generated deciphered plaintext and real plaintext. This evaluation indicator, in addition to the subjective perception of human eyes, can visually display the differences between plaintext images and deciphered images by real data and complete the evaluation of the security of encryption schemes. For the one encryption scheme, not only the fully encrypted ciphertext but also one of the ciphertexts in the scrambling stage and the diffusion stage must be completely undecipherable； otherwise, the encryption scheme has serious security flaws. In addition, the ciphertext dataset is a key factor that affects the effectiveness of the above method. A correlation ciphertext generation method that generates three kinds of ciphertext sets——scrambled ciphertext, diffused ciphertext, and encrypted ciphertext—is proposed to address this issue. This generation method makes full use of the characteristics of chaotic image encryption systems and plaintext sensitive keys to ensure the authenticity and effectiveness of the generated ciphertext and the proposed evaluation method. When cryptanalyzing different chaotic image encryption schemes, changing only the generation scheme of ciphertext in each encryption stage based on the encryption algorithm is necessary. Without changing the training stage, testing stage, and model, the cryptanalysis and security evaluation of different chaotic encryption schemes can be completed.Result This paper takes Arnold scrambling, 2D-SCL image encryption scheme, and quantum image encryption scheme based on 2D Sine2-Logistic chaotic map as examples to verify the proposed ciphertext evaluation method. The datasets used in the experiment are MNIST and Fashion-MNIST. Experimental results show that the proposed cryptanalysis model has a different analysis ability for the ciphertexts generated by the above encryption scheme and their various stages. For the ciphertexts of Arnold scrambling, 2D-SCL’s diffusion, and bite scrambling in quantum encryption, the SSIM values between the decrypted images and the real plain-images are all greater than 0.6. The cryptanalysis model can learn low-dimensional structural features, same as the equivalent keys, to restore the ciphertext image. Although the effect of cryptanalysis in other stages is lower, it can also decipher some key plaintext information, showing a high degree of structural similarity. This finding also indicates that, for an encryption scheme, a high plaintext sensitivity of the secret key corresponds to a high security of the chaotic sequence, and a strong plaintext sensitivity of its equivalent key corresponds to a reduced likelihood that it can be cracked.Conclusion The proposed ciphertext image analysis method can evaluate the security of encryption schemes comprehensively and effectively by using objective data as the evaluation index, which provides an intuitive and effective quantitative basis for improving the security of chaotic image encryption methods, and has high guiding significance.

Progressive iteration network for hole filling in virtual view rendering Liu Jiaxi, Zhou Yang, Lin Kun, Yin Haibing, Tang Xianghongdoi:10.11834/jig.230290 16-07-2024 6375 92
Abstract：Objective Depth image-based rendering （DIBR） makes full use of the depth information in a reference image and can combine color image and depth information organically, which is faster and less complex than the general rendering method. Therefore, DIBR is selected by ISO as the primary virtual view rendering technology in 3D multimedia video. The principal challenge associated with virtual view rendering technology is the 3D warping of the reference view, which leads to exposure of the background that was previously obstructed by the foreground. As a result, certain areas appear as holes in the virtual view due to the absence of pixel values. The search for an effective solution to address missing regions in the rendered view image is a critical challenge in virtual view rendering technology. The traditional algorithms mainly fill the holes based on the space-domain consistency and time-domain consistency methods. Filtering can effectively remove the cracks and some of the holes but cannot handle the large-area holes. The patch-based method can fill large-area holes, but the process is tedious, the amount of data is too large, and the accuracy of searching for the best matching patch is not high, which may lead to the texture belonging to the foreground being incorrectly filled to the hole area belonging to the background. Based on the time-domain consistency method, a model is developed to reconstruct the vacant part of the background using various models, and the foreground part is repositioned to the virtual viewpoint location to reduce the computational complexity and increase the adaptability to the scene. However, the moving camera scene contains both stationary and moving objects, which easily causes some parts of the foreground to be modeled as the background, resulting in the mixing of foreground and background pixels. Therefore, a deep learning model is applied to the field of hole filling in virtual view rendering, and a progressive iterative network for hole filling in virtual view rendering is proposed to address the problem of traditional algorithms leading to pixel blending and blurring in large hole regions.Method In this study, a progressive iterative network based on convolutional neural network is built. The network model mainly consists of a knowledge consistent attention module, a contextual feature propagation loss module, and a weighted merging module. First, partial convolutions are used in the initial stage of the network for progressive repair of large area holes. The partial convolutions are operated using only the valid pixels in the hole region, and the updated masks are retained throughout the iterations until they are reduced and updated in the next iteration, which is beneficial to the extraction of shallow valid features. Then, the U-Net network is used as the backbone to codify and decode the empty regions and cascade the shallow and deep information by introducing skip connections to tackle the problem of missing information. To select effective features in the network, we embed a knowledge consistent attention module. One benefit of this attention module is that it measures the attention score by weighing the current score with the score obtained from the previous iteration, which establishes the correlation between the front and back frame patches and effectively avoids the problem of foreground and background pixel blending in the traditional algorithm. The contextual feature propagation loss module is also used in a progressive iterative network with an attention module. This module plays a complementary role to the knowledge consistent attention module, reducing the difference between the reconstructed images in the encoder and decoder and enhancing the robustness of the network matching process. In addition, it allows for the creation of semantically consistent patches to fill in background holes by utilizing auxiliary images as guidance. Furthermore, we employ a pre-trained Visual Geometry Group （VGG-16） feature extractor to facilitate the joint guidance of our model using L₁ loss, perceptual loss, style loss, and smoothing loss, ultimately enhancing the resemblance between reference and target views. Lastly, the feature maps produced in each successive iteration are integrated via a weighted merging approach. This process involves the development of an adaptive map through the learning process. Specifically, through the concatenation of soft weight maps and the output feature maps of adaptive merging, the method provides an adaptive map that preserves original feature information with soft weight map assistance and protects early features from corruption, thus preventing gradient erosion.Result The experiments were quantitatively and qualitatively evaluated on multi-view 3D video sequences provided by Microsoft Labs and four 3D high efficiency video coding （3D-HEVC） sequences. Peak signal-to-noise ratio （PSNR） and structural similarity （SSIM） metrics were used to measure the algorithm’s performance, and a set of hole masks suitable for virtual view rendering were collected for training. Our experimental results demonstrate that our model yields the most reasonable images in terms of subjective perceptual quality. Furthermore, compared with the model with the second-highest performance, our model outperforms in terms of PSNR and SSIM, improving 1.302 dB, 1.728 dB, 0.068 dB, and 0.766 dB, and 0.007, 0.002, 0.002, 0.002, and 0.033 on the Ballet, Breakdancers, Lovebird1, and Poznan_Street datasets, respectively. Meanwhile, compared with the deep learning model, the PSNR and SSIM increased by 0.418 dB and 0.793 dB, and 0.011 and 0.007, respectively, in the Newspaper and Kendo datasets. In addition, we conducted a series of ablation experiments to verify the effectiveness of each module in our model, including the knowledge consistent attention module, the contextual feature propagation loss module, the weighted merging module, and the number of iterations.Conclusion In this study, we apply deep learning to the field of hole filling in virtual view rendering. Our proposed progressive iterative network model was validated through experimental demonstration. We observed that our model performs exceptionally well in terms of avoiding tedious processes and minimizing foreground texture infiltration, ultimately leading to superior filling outcomes. However, our model exhibits some limitations. While it can focus on effective texture features, its overall efficiency still requires further improvement. Moreover, depth maps associated with 3D video sequences can be utilized as a guide, enabling the convolutional neural network to comprehend more intricate structural aspects and enhancing the model’s overall performance. In future research, we may consider merging frame interpolation and inpainting techniques to concentrate on the motion-related information of objects over time.

Image Analysis and Recognition

Chemical structure recognition method based on attention mechanism and encoder-decoder architecture Zeng Shuiling, Li Zhaoxian, Zhang Jiaxiong, Ding Longfei, Zhao Cairongdoi:10.11834/jig.230367 16-07-2024 5426 116
Abstract：Objective Emerging digital and intelligent technologies have ushered in a new era of text recognition and interpretation. These advancements have greatly facilitated the ability to recognize and comprehend textual content originating from a variety of sources, including paper documents, photographs, and diverse contexts. One particularly noteworthy application of these technologies is in the field of chemical structure image recognition, where portable devices such as mobile phones and tablet PCs have become indispensable tools, playing a vital role in converting hand-drawn chemical structure images into machine-readable formats. They translate these intricate structures into human-readable representations, simultaneously highlighting relevant physical properties, chemical characteristics, and elemental compositions. These innovative models for chemical structure recognition serve as a bridge between hand-drawn representations and machine-interpretable data. This capability has made it feasible to electronically document complex scenarios, such as those encountered in classrooms and academic meetings. Notably, ongoing research has focused on developing encoder-decoder-based methods for mathematical expression recognition, which have shown promising results. However, the pivotal role of the quality and quantity of training data in shaping the performance of deep neural networks needs to be acknowledged. The current challenge lies in the absence of a comprehensive, high-quality dataset that is specifically tailored for chemical structure image recognition. This data deficiency poses a significant hurdle, impacting the optimization, generalization, and robustness of the models. Furthermore, the computational demands of real-time offline recognition on mobile devices remain a practical limitation.Method To address the aforementioned issues, we developed a chemical structure recognition model based on an encoder-decoder architecture. This model is capable of generating corresponding character representations, such as SMILES, from given chemical structure images. In the context of image-related tasks, the effectiveness of the encoder in extracting features from images and the decoder’s ability to decode feature sequences directly impact the performance of the recognition task. The encoder is designed to efficiently model the input images, while the decoder should be able to comprehensively extract various features from the images, obtain accurate feature distributions, and encode them to establish feature maps. Therefore, we designed a feature extraction network based on ResNet-50 in the encoder, which adequately captures the two-dimensional structural information of chemical structure images. Furthermore, to enhance the effectiveness of information in feature maps, we introduced a row encoder based on bi-directional long-short term memory（BLSTM）, reinforcing the spatial feature distribution weight through row encoding of feature maps. The decoder should be capable of accurately decoding the sequence information from the encoder’s output. To align input sequence information with output characters and improve the model’s memory and decoding capabilities for long sequences, we incorporated a coverage-attention mechanism into the decoder. Ultimately, the model can generate corresponding representations from input chemical structure images.Result For an objective evaluation of the performance of our model in this study, we conducted training on the Image2Mol and ChemPix models using the CASIA-CSDB （Institute of Automation, Chinese Academy of Sciences Chemical Structure Database） dataset. Subsequently, we performed performance testing on a range of datasets, including Indigo, ChemDraw, Conference and Labs of the Evaluation Forum（CLEF）, Japanese Patent Office（JPO）, University of Birmingham（UOB）, United States Patent and Trademark Office（USPTO）, Stacker, American Chemistry Society（ACS）, CASIA-CSDB, and Mini CASIA-CSDB. Results demonstrated that our model achieved higher recognition accuracy when trained on small datasets and exhibited robust generalization capabilities. Furthermore, we compared our model with untrainable models such as SwimOCSR, MSE-DUDL, ChemGrapher, Image2Graph, and MolScribe. The comparison revealed that our model also exhibited commendable performance when compared with models trained on millions of images.Conclusion A chemical structure recognition method is introduced based on an encoder-decoder architecture. The method allows for the generation of SMILES strings from given chemical structure images. Experimental results demonstrate that the model achieves higher recognition accuracy when trained on small datasets and exhibits strong generalization capabilities.

Multi-object tracking using adaptive-IoU loss and hierarchical association Guo Wen, Liu Qigui, Ding Xinmiaodoi:10.11834/jig.230390 16-07-2024 1299 117
Abstract：Objective Multiple object tracking （MOT） is a mainstream task in computer vision, which aims mainly to estimate the tracklets of multiple objects in videos and has important applications in the fields of autonomous driving, human-computer interaction, and human activity recognition. A large number of methods focus on improving the tracking performance based on the given detection results. Re-ID based trackers can be divided into two categories： separate detection and embedding （SDE） tracking models and joint detection and embedding （JDE） tracking models. The SDE tracking model tunes the detection model and the Re-ID model separately to optimize the model, but this leads to the disadvantage of the SDE tracking model being unable to perform real-time detection. The JDE tracking model performs object detection while outputting the object location and appearance embedding information for the next step of object association, thus improving the algorithm’s operational speed. However, the JDE tracking method suffers from the problem of identity switching due to ambiguous pedestrian features and the degradation of tracking accuracy due to occlusion between objects in complex scenes. An adaptive intersection-over-union （AIoU）-tracker multi-object tracking algorithm is proposed to address these issues.Method First, we utilize the backbone network detection head to design a special AIoU regression loss function that measures the overlap area, center point distance, and aspect ratio. This approach helps alleviate the problem caused by identity switching due to ambiguous pedestrian features. Second, we propose a simple and effective hierarchical association method to leverage the embedding information around association failure detection frames for Re-ID. The high-score detection frames and low-score detection frames are associated separately, improving the association accuracy of multi-object tracking under occlusion conditions. We utilize a variant of the DLA-34 network architecture as the backbone network. The model parameters are trained on the common objects in context （COCO） dataset and used to initialize the model. The experiments are conducted on a system running Ubuntu 16.04 with 64 GB of memory and a GTX2080Ti GPU. The software configuration includes CUDA 10.2. We train the model using the Adam optimizer for 30 epochs, with an initial learning rate of 10^-4. The learning rate is decayed to 10^-5 after 20 epochs, and the batch size is set to 16. We apply standard data augmentation techniques, including rotation, scaling, and color jittering. The input image size is adjusted to 1 088×608 pixels, and the feature map resolution is set to 272×152 pixels. We evaluate our approach on the MOT Challenge benchmark, specifically the MOT16 and the MOT17 datasets. The experiments utilize various datasets, including CrowdHuman, MIX dataset （ETH, CityPerson, CUHKSYSU, Caltech, and PRW）. The ETH and CityPerson datasets only provide bounding box annotations, so we only train the detection branch on these datasets. The Caltech, MOT17, CUHKSYSU, and PRW datasets provide both bounding box positions and ID annotations, allowing for training of both branches. To ensure a fair comparison, we remove the overlapping videos between the ETH dataset and the MOT17 test dataset. The CrowdHuman dataset only contains bounding box annotations, so we perform self-supervised training on it. To evaluate the tracking performance, we use several well-defined metrics, including higher-order tracking accuracy （HOTA）, multi-object tracking accuracy （MOTA）, ID F1 score （IDF1）, false positive, false negative, and number of identity switches （IDs）. MOTA primarily assesses the performance of the detection branch, IDF1 evaluates identity preservation, focusing on the association performance, and HOTA provides a comprehensive evaluation of both the detection branch and the data association performance.Result The performance of our method is compared with that of existing methods on two datasets. The comparison results are as follows： 1） our HOTA value is 59.8% on the MOT16 dataset, which is increased by 1.5% compared with the FairMOT. Our MOTA value is 74.4% on the MOT16 dataset, which is increased by 5.1% compared with the FairMOT. Our IDF1 value is 73.1% on the MOT16 dataset, which is increased by 0.5% compared with the FairMOT. 2） The HOTA value is 59.9% on the MOT17 dataset, which is increased by 0.6% compared with the FairMOT. The IDF1 value is 72.9% on the MOT17 dataset, which is increased by 1.6% compared with the FairMOT. In addition, we conduct ablation studies on the MOT17 dataset to verify the effectiveness of different components in our method, which demonstrates that the proposed method significantly outperforms the competition in multiple object tracking. In the ablation studies, we observe a decrease in the number of identity switches through the added AIoU regression loss function. We also visualize the predicted Re-ID feature extraction positions, bounding box size feature, heat map feature, and center point offset feature. The visualization results show that our method is more robust than FairMOT. Moreover, our hierarchical association method makes the association more robust. For example, even after two frames, obscured IDs can still be associated.Conclusion The proposed feature balancing tracking method achieves better balance among the bounding box size feature, heat map feature, and center point offset feature during training and testing, resulting in more accurate multi-object tracking results. In this study, we propose two improvement measures for the FairMOT framework. First, we design an AIoU regression loss module to optimize the detection branch, enabling it to optimize targets based on the current optimal distance and extract more accurate appearance features. Second, we optimize the Re-ID branch through a hierarchical association strategy module, utilizing three-level matching to enhance the tracking system’s association performance. Experimental results demonstrate significant improvements on the MOT17 dataset, with HOTA increasing to 59.9%, IDF1 increasing to 72.9%, and MOTA increasing to 70.8%. However, a competition issue exists between the detection and Re-ID branches in the JDE tracking model, which can lead to a decrease in MOTA. Future research will focus on investigating this competition in the JDE tracking model.

Integrating similarity and interaction force between objects for multiple object tracking Wang Kai, Dai Fang, Guo Wenyan, Wang Junfeng, Wang Xiaoxiadoi:10.11834/jig.230340 16-07-2024 2330 112
Abstract：Objective In the field of computer vision, object tracking is a critical task. Currently, many different types of multi-object tracking algorithms have been proposed, which usually include the following steps： object detection, feature extraction, similarity calculation, data association, and ID assignment. In this process, the object in the video sequence is first detected and a rectangular box is drawn to label the specific object detected. Then, the features of each object are extracted, such as location and appearance features. Then, the similarity of the object is determined by calculating the probability that the object in the adjacent video frames is the same object. Finally, through data association, the objects belonging to the same object in adjacent frames are associated and an ID is assigned to each object precisely. This paper mainly focuses on the feature extraction and data association stage of the object, using combined features to represent the characteristics of the object and then increasing the interaction force between the objects for enhanced data association to address the problem of mistracking in object tracking and thus improve the accuracy of object tracking.Method First, the multi-object tracking problem is transformed into a maximum a posteriori probability problem. Second, the maximum a posteriori probability problem is mapped to the network flow and the minimum cost flow is used to find the optimal path. To calculate the cost between the object nodes in the network flow, we consider two aspects. First, we calculate the similarity between the objects by combining the appearance, motion, and position information of the objects. Second, we consider the interaction between objects and objects, referring to the attraction between individuals in the social force model to calculate the force between object nodes.Result The experimental evaluation on three public datasets MOT15, MOT16, and MOT17 and a comparison with the latest 12 methods show that the proposed algorithm performs well in multiple object tracking accuracy, mostly tracked tracklets, mostly lost tracklets, false positives, false negatives, and other indicators； these indicators are significantly better than those of online association by continuous-discrete appearance similarity measurement, spatial-temporal mutual representation learning, identity-quantity harmonic multi-object tracking, graph convolutional neural network match （GCNNMatch）, and other typical algorithms. Ablation experiments were carried out on three video sequences of TUD-Stadtmitte, ETH-Bahnhof, and PETS09-S2L₁ in the MOT15 dataset to verify the data association results after increasing the object force. The ablation experimental results show that the object tracking accuracy and other indicators can be improved after increasing the object force, especially in video sequences where occlusion is not obvious.Conclusion In this paper, the force between the target nodes is added based on the target multi-feature, which strengthens the data association between the targets, reduces the number of misfollowed targets, and effectively improves the accuracy of target tracking.

Video anomaly detection with long-and-short-term time series correlations Zhu Xinrui, Qian Xiaoyan, Shi Yuzhou, Tao Xudong, Li Zhiyudoi:10.11834/jig.230406 16-07-2024 90 94
Abstract：Objective Video anomaly detection has been applied in many fields such as manufacturing, traffic management and security monitoring. However, detailed annotation of video data is labor intensive and cumbersome. Consequently, many researchers have started to employ weakly supervised learning methods to address this issue. Unlike the supervised learning method, the weakly supervised learning only requires video-level labels in the training stage, which greatly reduces the workload of dataset labeling, and only frame-level labeling information is required for the test dataset. Multiple instance learning （MIL） has been recognized as a powerful tool for addressing weakly supervised video abnormal event detection. Abnormal behavior in video is highly correlated with video context information. The traditional MIL method uses convolutional 3D network to extract video features, uses the ordering loss function, and introduces sparsity and time smoothing constraints into the ordering loss function to integrate time information into the ordering model. Introducing time concern only into the loss function is not enough. The use of temporal convolutional network to extract video context information further enhances the effect of video anomaly detection network. However, this global introduction of time information cannot sufficiently separate abnormal video clips from normal video clips. Therefore, the attention MIL builds time-enhancing networks to learn motion features while using the attention mechanism to incorporate temporal information into the ranking model. The learned attention weights can help better distinguish between abnormal and normal video clips. The spatiotemporal fusion graph network constructs spatial similarity graphs and temporal continuity graphs separately for video segments, which are then fused to generate a spatiotemporal fusion graph. This approach strengthens the spatiotemporal correlations among video segments, ultimately enhancing the accuracy of abnormal behavior detection. Multiple instance self-training framework uses pseudo-label training, which is an effective training strategy to improve model quality in weakly supervised learning. It constructs a two-stage training network and uses the pseudo-label trained by the first-stage MIL to guide the training of the second-stage self-guided attention feature extractor, providing a general idea to improve model quality. However, these approaches do not fully exploit temporal correlations, as the feature representation of the instances lacks fusion with neighboring and global features. Abnormal events often exhibit characteristics such as sparsity, suddenness, and local continuity, and the insufficient temporal correlations between video segments result in an inadequate separation between normal and abnormal segments. To address this issue, this paper proposes a two-stage abnormal detection network with long-and-short-term time series association.Method The first stage involves a long-and-short-term time series association abnormal detection network （LSC-transMIL） that applies the Transformer structure to MIL methods. It consists of two layers, each containing a local temporal sequence correlation attention module and a global instance correlation attention module. The former learns information in the temporal dimension between individual instances and neighboring instances, while the latter focuses on the association between individual instances and global information. Combining local and global attention mechanisms makes it possible to establish meaningful information correlations among instances, highlighting the distinctions between local and global features in the video. This approach makes it easier to distinguish abnormal video segments from normal ones. This module generates new instance features, which are then fed into the ranking model to generate video abnormal scores and pseudo-labels. In the second stage, a spatiotemporal attention mechanism-based abnormal detection network is constructed. The SlowFast backbone network is employed to extract video features, and the slow and fast pathway features are weighted and fused using spatiotemporal attention. The slow branch pays attention to the spatiotemporal information of the video frame using the spatiotemporal attention module, while the fast branch guides the attention to the temporal information through the time-dimensional attention module, and then the two branch features are spliced to obtain the final video features. The abnormal scores generated in the first stage are used as fine-grained pseudo-labels to train the abnormal event detection network by using a pseudo-labeling strategy. Furthermore, the backbone network is fine-tuned to enhance the adaptive capability of the abnormal event detection network.Result Extensive experiments were conducted on two large-scale public datasets （UCF-crime and ShanghaiTech） to compare the proposed two-stage abnormal detection model with similar methods. The two-stage model achieved area under the curve scores of 82.88% and 96.34% on the UCF-crime and ShanghaiTech datasets, respectively, demonstrating an improvement of 1.58% and 0.58% compared with other two-stage methods. Sufficient ablation experiments were conducted on the two datasets, and the effects of the proposed LSC-transMIL, traditional MIL method, and attention MIL method were compared under three backbone networks, proving the effectiveness of LSC-transMIL. Qualitative and quantitative explanations are given for the ablation experiments of global attention and global local attention, and the effectiveness of combining local and global attention is proved. The role of local and global time correlation is visualized using heat maps.Conclusion This paper applies the Transformer to time series-based MIL and introduces long-and-short-term attention to highlight the differences between local abnormal events and normal events. The proposed two-stage abnormal detection network utilizes the abnormal scores generated in the first stage as pseudo-labels, trains a network based on the SlowFast backbone network and spatiotemporal attention modules, and fine-tunes the backbone network to enhance the adaptive capability of the abnormal detection network. The proposed approach effectively improves the accuracy of abnormal event detection.

Multi-modal hierarchical classification for power equipment defect detection Bai Yanfeng, Wang Libiao, Gao Weidong, Ma Yinglongdoi:10.11834/jig.230269 16-07-2024 2585 105
Abstract：Objective Safety state detection of power equipment is a fundamental task to ensure the safe operation of power systems. The state detection and fault maintenance of power equipment are the basic prerequisites for ensuring the normal operation of the power system. With the growing diversities and complexity of defects in substations, the current defect recognition and power detection has increasingly been required to handle multi-label classification tasks based on a large number of closely related defect labels. However, due to the complex types of power equipment defects in most substations, most existing approaches for power equipment defect detection are inefficient at multi-label defect detection because the defect category labels often have different granularities in their semantic concepts and are often closely related with each other. All these problems cause existing defect detection methods to have difficulty meeting the requirements of multi-label classification-based defect detection tasks of power equipment. To address these problems, this paper proposes a multi-modal hierarchical classification for power equipment defect detection, which is suitable for defect detection in complex power equipment environments.Method We propose a multi-modal hierarchical classification method, which fuses the feature information of defect images, hierarchical structure information, and the semantic information of category labels. First, defect images of power equipment from multiple substations are collected and preprocessed with manual annotation, data enhancement, and normalization to construct a power equipment defect image dataset with a hierarchical label structure. Then, a hierarchical classification model based on multi-modal feature fusion and hierarchical fine-tuning techniques is proposed, which uses the ResNet50 network to extract features from images, and a region proposal network to locate object and predict the foreground and background. The region of interest align（ROI Align） method is further used to continuously generate the position coordinates to avoid introducing errors in quantifying the position coordinates generated by the region proposal network. Finally, the hierarchical structure of power equipment to be detected is used to embed the parent category labels into the current layer’s object feature representation for layer-by-layer defect classification. The final defect detection result is obtained in the final layer.Result Comparative experiments are conducted on the real-world power equipment defect dataset and the PASCAL VOC2012 benchmark dataset against the current multi-label classification-based power equipment defect detection methods and the popularly used object detection algorithms. Experimental results show that the proposed method achieved the best detection accuracy for most equipment defect categories, with a mean average precision of 86.4%. Compared with the second-best performing model, the accuracy improved by 5.1%, and the mean average precision on the benchmark dataset increased by 1.1% to 3%. The proposed method can be executed in a relevantly shorter time than the compared methods.Conclusion Our method achieves superior detection accuracy performance against the compared methods while maintaining a lower computational cost. It can improve the accuracy of power equipment defect detection through a hierarchical classification model based on multi-modal feature fusion by fully utilizing the semantic relationship between equipment defect labels.

Image Understanding and Computer Vision

Facial video-based heart rate measurement against irregular motion artifacts Cheng Juan, Yin Chenchu, Song Rencheng, Fu Jing, Liu Yudoi:10.11834/jig.230428 16-07-2024 1658 109
Abstract：Objective Heart rate （HR） is one of the most important physiological parameters that can reflect the physical and mental status of individuals. Various methods have been developed to estimate HR values using contact and noncontact sensors. The advantage of noncontact methods is that they provide a more comfortable and unobtrusive way to estimate HR and avoid discomfort or skin allergy caused by conventional contact methods. The pulse-induced subtle color variations of facial skins can be measured from consumer-level cameras. Thus, camera-based non-contact HR detection technology, also called remote photoplethysmograph （rPPG）, has been widely used in the fields of mobile health monitoring, driving safety, and emotion awareness. The principle of camera-based rPPG measurement is similar to that of traditional PPG measurement, that is, the pulsatile blood propagating in cardiovascular systems changes blood volumes in microvascular tissue beds beneath skins with each heartbeat, thus producing a fluctuation. However, such technology is susceptible to motion artifacts due to weak amplitudes of the physiological parameter information it carries. For instance, subjects’ heads may move involuntarily during interviews, presentations, and other socially stressful situations, thus degrading rPPG-based HR detection performance. Accordingly, this paper proposes a novel motion-robust rPPG method that combines nonnegative matrix factorization （NMF） and independent vector analysis （IVA）, termed as NMF-IVA, to remove irregular motion artifacts.Method First, the whole facial region of interest （RoI） is divided into several sub RoIs （SRoIs）, among which three optimal SRoIs are selected based on three indicators： average light intensity, light intensity variation of a certain SRoI, and signal-to-noise ratio （SNR） of the green-channel signal derived from the SRoI. Afterwards, three green-channel time series are derived from the corresponding three optimal SRoIs. Second, the three channels of time series are detrended, bandpass filtered, and then sent to the proposed NMF-IVA as input. After the NMF-IVA operation, three source signals are extracted and then processed by power spectral density analysis. The one with the highest peak SNR and the corresponding dominant frequency falling within the interested HR range will be identified as the blood volume pulse （BVP） signal, whose dominant frequency is identified as that of the estimated HR.Result We compare the proposed NMF-IVA method with seven typical rPPG methods on two publicly available datasets（UBFC-RPPG and UBFC-PHYS） as well as one in-house dataset. On the UBFC-RPPG dataset, compared with the second-best performance of the single channel filtering （SCF） method, the proposed NMF-IVA achieves better performance, with an improved root mean square error （RMSE） of HR measurement by 1.39 beat per minute （bpm）, an improved mean absolute error （MAE） by 1.25 bpm, and a higher Pearson’s correlation coefficient （PCC） by 0.02. Although both the MAE and the RMSE achieved by the proposed NMF-IVA method are lower than those of deep learning-based methods, the PCC of the NMF-IVA is comparable to that of deep learning-based ones, which demonstrates the effectiveness of the proposed NMF-IVA method. As for the UBFC-PHYS dataset when compared with traditional rPPG methods, during the T1 condition, the performance of the proposed NMF-IVA method is better than that of the second-best SCF method, with an improved RMSE by 6.45 bpm, an improved MAE by 2.53 bpm, and a higher PCC by 0.18. When compared with deep learning-based ones, the proposed NMF-IVA method achieves the second-best performance. The performance improvement of the proposed NMF-IVA is most noticeable during the T2 condition on the UBFC-PHYS dataset. Specifically, when compared with the second-best performance of IVA, the above three metrices are improved by 16.42 bpm, 9.91 bpm, and 0.64, respectively. As for the UBFC-PHYS dataset, when during the T3 condition, the best performance is still achieved by the proposed NMF-IVA method. When compared with the second-best performance of the independent component analysis method, the corresponding three metrices are improved by 8.54 bpm, 6.14 bpm, and 0.37, respectively. The performance of the proposed NMF-IVA method can be comparable to that of deep learning-based ones both in T2 and T3 conditions. As for the in-house dataset, the proposed NMF-IVA method achieves better performance compared with the traditional methods, except for deep learning-based methods.Conclusion The proposed NMF-IVA method achieves the best results on all the three datasets when compared with traditional rPPG methods, and the performance improvement is most noticeable during irregular motion artifact conditions involving head motions with large amplitudes. However, the performance of the proposed NMF-IVA method is slightly poorer than that of deep learning-based methods possibly because deep learning technology has excellent abilities in learning and extracting effective features. However, sufficient training samples and generalization should be considered when adopting deep learning-based methods. In addition, before the high-quality BVP source is derived, upsampling is employed, which leads to a relatively large time consumption. In the future, the HR estimation performance and the upsampling rate should be traded off. The proposed NMF-IVA method has advantages in extracting regular signals. Thus, our study can provide a new solution for promoting the practical application ability of rPPG technology.

Virtual viewpoint image synthesis using neural radiance fields with depth information supervision Liu Xiaonan, Chen Chunyi, Hu Xiaojuan, Yu Haiyangdoi:10.11834/jig.221188 16-07-2024 162 92
Abstract：Objective Viewpoint synthesis techniques are widely applied to computer graphics and computer vision. In accordance with whether they depend on geometric information or not, virtual viewpoint synthesis methods can be classified into two distinct categories： image-based rendering and model-based rendering. 1） Image-based rendering typically utilizes input data from camera arrays or light field cameras to achieve higher-quality rendering outcomes without the need to reconstruct the geometric information of the scene. Among the image-based rendering methods, depth map-based rendering technology is currently a popular research topic for virtual viewpoint rendering. However, this technology is prone to be affected by depth errors, leading to challenges such as holes and artifacts in the generated virtual viewport image. In addition, obtaining precise depth information for real-world scenes poses difficulties in practical applications. 2） Model-based rendering involves 3D geometric modeling of real-world scenes. This method utilizes techniques such as projection transformation, cropping, fading, and texture mapping to synthesize virtual viewpoint images. However, quickly modeling real-world scenes is a significant disadvantage of this approach. With the emergence of neural rendering technology, the neural radiance fields technique employs a neural network to represent the 3D scene and combines it with volume rendering technology for viewpoint synthesis, thus producing photo-realistic viewpoint synthesis results. However, this approach is heavily reliant on the appearance of the view and requires a substantial number of views to be input for modeling. As a result, this method may be capable of perfectly explaining the training images but generalizes poorly to novel test views. Depth information is introduced for supervision to reduce the dependence of the neural radiance fields on the view appearance. However, structure from motion produces sparse depth values with inaccuracy and outliers due to the limited number of view inputs. Therefore, this study proposes a virtual viewpoint synthesis algorithm for supervising the neural radiance fields by using dense depth values obtained from a depth estimation network and introduces an embedding vector in the fitting function of the neural radiance fields to improve the virtual viewport image quality.Method First, the camera’s internal and external reference matrices were calibrated for the input view. The 3D point cloud data in the world coordinate system were then converted to 3D point cloud data in the camera coordinate system by using the camera’s external reference matrix. After that, the 3D point cloud data in the camera coordinate system were projected onto the image plane by using the camera’s internal reference matrix to obtain the sparse depth value. Next, the RGB view was input into the new conditional random fields （CRFs） network to obtain an estimated depth value, and the standard deviation between the estimated depth value and the sparse depth value was calculated. The new CRFs network used the FC-CRFs module, which was constructed using a multi-headed attention mechanism, as the decoder and used the visual converter as the encoder to construct a U-shaped codec structure to estimate the depth value. Finally, the training of the neural radiance fields was supervised using the estimated depth values and the computed standard deviations. The training process of the neural radiance fields began by emitting camera rays on the input view to determine the sampling locations and the sampling point parameterization scheme. The re-parameterized sample point locations were then fed into the network for fitting, and the network outputted the volume density and color values to calculate the rendered color values and rendered depth values by using the volume rendering technique. The training process was supervised using the color loss between the rendered color value and the true color value and the depth loss between the predicted depth value and the rendered depth value.Result Experiments were conducted on the NeRF Real dataset, which comprises eight real-world scenes captured by forward-facing cameras. The evaluation involved the comparison of the proposed method with other algorithms, including the neural radiance field （NeRF） method that only uses RGB supervision and the method that employs sparse depth information supervision. The assessment criteria included peak signal-to-noise ratio, structural similarity index, and learned perceptual image patch similarity. Results indicate that the performance of proposed method surpassed that of the NeRF method that relied solely on RGB supervision and the method that employed sparse depth information supervision in a limited number of view synthesis experiments in terms of graphical quality and effectiveness. Specifically, the proposed method achieved a 24% improvement in peak signal-to-noise ratio over the NeRF method and a 19.8% improvement over the sparse depth information supervision method. In addition, the proposed method exhibited a 36% improvement in structural similarity index over the NeRF method and a 16.6% improvement over the sparse depth information supervision method. The data efficiency of the algorithm was evaluated by comparing the peak signal-to-noise ratio achieved by the same number of iterations. The proposed method demonstrated a significant improvement compared with the NeRF method.Conclusions In this study, we proposed a method for synthesizing virtual viewport images by using neural radiance fields supervised by dense depth. The method uses the dense depth values outputted by the depth estimation network to supervise the training of the neural radiance fields and introduced embedding vector during training fitting function. The experiments demonstrated that our approach effectively addresses the issue of sparse depth values resulting from insufficient views or inconsistent view colors and can achieve high-quality synthesized images, particularly when the number of input views is limited.

Remote Sensing Image Processing

Two-discriminators-deep residual GAN hyperspectral image pan-sharpening Zhou Qingze, Guo Qing, Wang Hairong, Li Andoi:10.11834/jig.220932 16-07-2024 2884 108
Abstract：Objective Hyperspectral image （HS） pan-sharpening can obtain the fused image with both high spatial resolution and high spectral resolution by using the complementary information between the high spectral resolution HS and the high spatial resolution multi-spectral image （MS）. However, HS and MS images both have multi-bands, which is different from the traditional MS pan-sharpening where one band panchromatic image （PAN） with multi-band MS image. This many-to-many band relationship is not convenient for the implementation of pan-sharpening method. Moreover, for those bands where HS exceeds the spectral range of MS, there will be obvious spectral distortion in the fused image due to the lack of strict physical complementary information. In order to solve the above problems, this paper exploits the data-driven advantages of deep learning and proposes the two discriminators deep residual generative adversarial network （2DDRGAN） for hyperspectral image pan-sharpening based on Gaofen-5 （GF-5） HS image and Sentinel-2 MS image.Method Considering the spectral range relationship between HS and MS, this paper adopts the grouping pan-sharpening strategy to transform the many-to-many problem into multiple one-to-many problems by using the correlation between bands of HS and MS. For the HS band within the spectral range of the high spatial resolution MS image, the band of HS image is directly assigned to the group corresponding to the high spatial resolution MS image. For the HS bands outside the spectral range, the correlation coefficient is used as the band grouping standard to group the bands of HS and MS. It solves the problem that two multi-band images are not easy to be fused directly, and indirectly improves the spectral fidelity of the fused image. The proposed 2DDRGAN consists of one generation network and two discrimination networks. The generation network mainly extracts the deep spectral and spatial features by establishing the deep residual module. Two discrimination networks judge the spatial quality and the spectral quality of the fused image, respectively, to improve the quality of the output fused image from the generation network. The main task of the spatial discrimination network is to compare the results of the generated network with the high spatial resolution MS image, so as to ensure that the generated fused images have high spatial resolution. The main task of spectral discrimination network is to compare the results of the generated network with the HS images, to ensure the generated fused images have high spectral resolution. Moreover, the deep learning pan-sharpening methods do not have real high spatial resolution and high spectral resolution images as the fusion result labels. Most of them currently are based on the simulated data made by Wald protocol. In this paper, the deep learning network does not need to create additional fusion result labels. The images to be fused themselves are labels, which greatly reduces the workload of hyperspectral fusion with a huge amount of data, and is also a fundamental change in the current deep learning fusion.Result The fusion results of the proposed 2DDRGAN method in different scenarios are compared with traditional methods and existing deep learning methods. The experimental results show that the 2DDRGAN method has high spectral fidelity while improving the spatial resolution. The spectral curve evaluation also verifies shows that the 2DDRGAN network has good spectral fidelity for hyperspectral image bands beyond the spectral range of high spatial resolution images.Conclusion This method extracts the spectral features of hyperspectral images and the spatial features of high spatial resolution images through the depth residual module, and introduces the double discrimination network, so that the fusion results can better improve the spatial information while maintaining the spectral information.

Parallel channel shuffling and Transformer-based denoising for hyperspectral images Hu Shuai, Gao Feng, Gong Zhuoran, Tao Shengen, ShangGuan Xinyu, Dong Junyudoi:10.11834/jig.230381 16-07-2024 4925 114
Abstract：Objective With the increasing availability and advancement of hyperspectral imaging technology, hyperspectral images have become an invaluable resource in various fields, including agriculture, environmental monitoring, and remote sensing. However, these images are often prone to noise contamination, which can significantly degrade their quality and hinder accurate analysis and interpretation. As a result, denoising hyperspectral images has become a crucial task in the field of remote sensing image processing, attracting significant attention from researchers worldwide. The challenges associated with denoising hyperspectral images are multifaceted. First, the inherent characteristics of hyperspectral data, such as high dimensionality and complex spectral information, pose significant difficulties for traditional denoising approaches. The presence of noise in hyperspectral images can obscure valuable information embedded within the spectral bands, making it essential to develop advanced denoising techniques that can effectively restore the original signal while preserving the rich texture and spatial details. Furthermore, the development of deep learning techniques, particularly convolutional neural networks （CNNs）, has revolutionized the field of image processing, including denoising tasks. CNN-based approaches have shown promising results in denoising various types of images. However, when it comes to hyperspectral data, traditional CNN architectures face limitations in capturing the global contextual information necessary for accurate denoising. The fixed-size receptive fields of CNNs restrict their ability to exploit the spatial and spectral correlations present in hyperspectral images, thereby reducing their overall denoising performance. To overcome these limitations, recent research has explored the integration of Transformers, which were originally designed for natural language processing tasks, into the field of computer vision, including hyperspectral image denoising. Transformers are capable of capturing long-range dependencies and global contextual information, making them an attractive alternative to CNNs for denoising tasks. However, directly applying Transformer-based models to hyperspectral data requires careful consideration of the specific challenges posed by the unique characteristics of hyperspectral images.Method In this study, we propose a novel denoising model for hyperspectral images that combines the strengths of Transformers and parallel convolution operations. Our model comprises three key modules： channel shuffling module, block downsampling global enhancement module, and adaptive bidirectional feature fusion module. These modules work synergistically to address the challenges encountered in denoising hyperspectral images. The channel shuffling module exploits the inter-channel relationships within hyperspectral data by incorporating channel-mixing operations. By fusing information across different spectral channels, the module enhances the representation power of the network and enables more comprehensive feature extraction. This approach effectively addresses the limitation of traditional CNN-based methods in fully utilizing the global information available in hyperspectral images, ultimately improving the model’s denoising performance. In the block downsampling global enhancement module, we leverage a block downsampling strategy to capture global contextual information. By reducing the spatial resolution of the input hyperspectral image, the module enlarges the receptive fields, allowing the model to incorporate larger-scale information during the denoising process. This mechanism enhances the model’s understanding of the overall structure of the image, facilitating more effective noise suppression and accurate restoration of spatial details. The adaptive bidirectional feature fusion module is designed to strike a balance between local and global feature extraction, leveraging the complementary strengths of CNNs and Transformers. This module introduces a mechanism for adaptively fusing features from local and global contexts, enabling the model to effectively combine local details with global information. By considering the intricate relationship between spatial and spectral features, our proposed approach improves the denoising performance and preserves the rich texture information inherent in hyperspectral images.Result To evaluate the effectiveness of our proposed model, extensive experiments were conducted on publicly available hyperspectral image datasets, including ICVL and Pavia. Experimental results demonstrated the superior denoising performance of our approach compared with that of current state-of-the-art methods. Our model consistently outperformed existing techniques in various noise scenarios, effectively removing noise while preserving the fine spatial details and rich texture information of hyperspectral images. The experimental evaluation involved quantitative metrics such as peak signal-to-noise ratio （PSNR）, structural similarity index （SSIM）, and spectral angel mapping （SAM）. Our proposed model achieved significantly higher PSNR values and SSIM scores compared with the baseline methods, indicating improved denoising accuracy and visual quality of the restored images. In addition, the SAM values obtained using our model were consistently lower, indicating higher spectral similarity. Moreover, we conducted a comprehensive analysis of the computational efficiency of our model. With the increasing volume and complexity of hyperspectral data, developing denoising methods that are computationally efficient without sacrificing performance is crucial. Our proposed model demonstrated competitive computational efficiency, making it practical for real-world applications that involve large-scale hyperspectral image processing.Conclusion The success of our denoising model can be attributed to the synergistic combination of the Transformer-based architecture and the channel-mixing parallel convolution operations. The Transformer module enables effective capture of global contextual information, facilitating better understanding of the relationships between spectral bands and spatial features. By incorporating channel-mixing operations, our model exploits the inter-channel correlations and enhances the discriminative power of feature extraction, resulting in improved denoising performance. Furthermore, our model’s ability to handle diverse noise scenarios and maintain image quality can be attributed to the adaptive bidirectional feature fusion module. This module intelligently combines local and global features, enabling effective noise suppression while preserving the fine details and texture information specific to different regions of the hyperspectral images. The adaptability of the feature fusion mechanism ensures robust denoising performance across various noise levels and image characteristics. In conclusion, this study presents a novel denoising model for hyperspectral images based on the integration of Transformers and channel-mixing parallel convolution. The proposed model effectively addresses the limitations of traditional approaches in utilizing global information and captures the complex spatial-spectral correlations inherent in hyperspectral data. Experimental results demonstrate its superior denoising performance compared with that of state-of-the-art methods, with improved accuracy and preservation of fine details and texture information. The model’s computational efficiency further enhances its practicality for real-world applications. Future research directions may include exploring additional mechanisms for adaptive feature fusion and investigating the model’s performance on other hyperspectral image processing tasks such as classification and segmentation.

Geoinformatics

Vector map watermarking utilizing the ratio invariance of hybrid domain Xi Xu, Qu Chengyi, Hou Xuan, Du Jinglongdoi:10.11834/jig.230305 16-07-2024 6384 100
Abstract：Objective Traditional frequency domain-based vector map watermarking algorithms often embed watermarks by directly modifying transform coefficients, resulting in limited practicality due to unpredictable embedding locations and difficult-to-control embedding strengths. To address this issue, this paper explores the ratio of coefficients between discrete wavelet transform （DWT） and complex singular value decomposition （CSVD） as a new watermark embedding domain and proposes a robust vector map watermarking algorithm with controllable embedding strength by integrating coefficient amplification and quantization index modulation （QIM）. Through the development of watermark embedding domain mining and watermark embedding method, the overall performance of the frequency domain watermarking for vector maps is enhanced, hence improving the practical utility of watermarking in vector maps.Method In the process of data preprocessing, the watermarking scheme uses the Douglas-Peucker algorithm to extract feature points from the vector map and constructs a complex sequence based on the feature points. The complex sequence is then subjected to a two-level DWT to obtain low-frequency coefficients （AC2） and high-frequency coefficients （DC2）. On the basis of this approach, the singular values of AC2 and DC2 are calculated using CSVD, and the ratio of singular values is used as the watermark embedding domain. Theoretical analysis determined that the ratio can remain unaffected by geometric transformations of rotation, translation, and scaling, and has a high degree of robustness； likewise, the singular value is relatively stable and insensitive to changes in a few coordinate points. In the watermark embedding stage, the coefficient ratio is appropriately amplified （the amplified coefficient is 10⁶ in this study）, and the QIM method （the interval step size in QIM is set as 10 in this study） is used to modulate the amplified ratio of singular values to control the embedding error and achieve blind extraction of watermark information.Result We compared the proposed watermarking with three state-of-the-art saliency schemes, namely, DWT-CSVD mixed watermarking algorithm, single watermarking algorithm based on DWT, and the mixed watermarking algorithm based on discrete Fourier-transform frequency domain and spatial domain. After watermark information is embedded into various datasets with different watermarking algorithms, watermarked datasets are subjected to different degrees of geometric attacks, coordinate point attacks, clipping attacks, and combination attacks, and watermark information are extracted under various attack modes. Experimental results show that the proposed algorithm has good invisibility and comprehensive robustness. The difference between the original datasets and the watermarked datasets is difficult to perceive with the naked eye, and the greatest coordinate disturbance induced by watermark embedding is less than 3.92×10^-3, indicating a solid error control effect. Under various degrees of common geometric attacks, cropping, simplification, and coordinate point editing, the proposed watermarking algorithm can always extract the watermark images with an NC value of 1. Even under multiple attack modes involving multiple random combinations, the proposed algorithm is able to extract clearly identifiable watermark images. With the traditional frequency domain vector map watermarking approach, achieving such high performance would be challenging.Conclusion The proposed vector map watermarking algorithm exploits the ratio of multiple frequency transforms as the watermark embedding domain, which is secure and robust and can provide a technical reference for copyright protection of vector maps.

Ranking

About the Journal

Cover&Content

Trusted AI

Review

Image Processing and Coding

Image Analysis and Recognition

Image Understanding and Computer Vision

Remote Sensing Image Processing

Geoinformatics

Submit Paper

Submit Cover

Copyright and Licensing

Ethical Standards

Open Access Policy

Format of Papers