最新刊期

    29 8 2024

      Fusion and Intelligent Interpretation for Multi-source Remote Sensing Data

    • Cao Liqin,Wang Du,Xiong Haiyang,Zhong Yanfei
      Vol. 29, Issue 8, Pages: 2089-2112(2024) DOI: 10.11834/jig.230738
      摘要:Longwave infrared (LWIR) hyperspectral remote sensing images offer a wealth of spectral information alongside land surface temperature (LST) data, which make them invaluable for discerning solid-phase materials and gases. This capability holds significant implications across diverse domains, including mineral identification, environmental monitoring, and military applications. However, the underlying phenomenology and environmental interactions of emissive regions within the LWIR spectrum significantly diverge from those observed in reflective regions. This divergence impacts various facets of thermal infrared (IR) hyperspectral image (HSI) analysis, which span from sensor design considerations to data exploitation methodologies. Compounding this complexity are the intertwined influences of factors such as LST, emissivity, atmospheric profiles, and instrumental noise, which lead to challenges such as subtle distinctions between background noise and target signals, as well as inaccuracies in signal separation within thermal IR hyperspectral observation data. Consequently, the effective extraction of thermal IR HSI information poses formidable challenges for practical application implementation. In this study, we systematically review methods for LWIR HSI information extraction by drawing upon ongoing research progress and addressing prevailing challenges in LIWR hyperspectral remote sensing. Our examination encompasses four primary areas: 1) LST and emissivity inversion: LWIR hyperspectral remote sensing serves as a potent tool for large-scale LST and land surface emissivity (LSE) monitoring. However, the accurate retrieval of LST and LSE is fraught with complexity owing to their intricate coupling with atmospheric components, as delineated by the radiative transfer equation. We discuss two broad approaches, namely, the two-step method and the integration method, for mitigating this ill-posed problem. The former entails atmospheric compensation (AC) and temperature and emissivity separation (TES), where AC filters out atmospheric influences to isolate ground-leaving radiance from at-sensor radiance. Subsequently, TES methods are employed to estimate LST and LSE. Given the propensity for inaccurate AC to introduce accumulation errors and compromise retrieval accuracy, integration methods capable of simultaneous AC and TES are also reviewed, with deep learning-driven methods exemplifying a typical integration approach. 2) LWIR hyperspectral mixed spectral decomposition: spectral mixture analysis (SMA) involves identifying and extracting endmember spectra in a scene to determine the abundance of each endmember within each pixel. Most applications of LWIR SMA focus on mineral detection and classification. Unlike the reflectance of a mixed pixel, which is defined as a linear combination within the pixel, the emissivity of a mixed pixel is not as straightforward to define because the measured radiance depends on the emissivity and temperature of each material. The mixed spectral decomposition methods for isothermal and non-isothermal pixels are discussed. When a pixel is isothermal, the isothermal mixture model is identical to the mixture for reflectance after removing the temperature component. However, as a pixel becomes non-isothermal, unmixing methods are necessary to handle the nonlinearity resulting from temperature variations. This study summarizes all the methods and highlights the challenges associated with SMA. 3) Classification: classification tasks in the LWIR domain involve successfully modeling and classifying background materials, such as minerals and vegetation mapping. However, the scarcity of prominent spectral features in the LWIR spectrum complicates the remote differentiation of natural land surfaces. For instance, various materials such as paints, water, soil, road surfaces, and vegetation exhibit spectral emissivities ranging between 0.8 and 0.95. Moreover, although spectral emissivity variations exist among different materials, they are less conspicuous compared with the reflective region. In addition, the uncertainty associated with retrieved emissivity challenges the classification of different materials. This study reviews traditional machine learning classification methods, including spectral-based, spatial-based, and spectral-spatial integration-based methods, alongside deep learning approaches, by summarizing the advancements in these processing methods. 4) Target detection: target detection encompasses solid- and gas-phase targets. The algorithms employed for LWIR HSI analysis, which is similar to those used for visible-near IR and shortwave IR HIS, have reached a mature stage and are reviewed in this work, including matched filter-type algorithms, among others. Challenges in target detection mirror those encountered in classification tasks, given that spectral emissivities in the LWIR tend to be smaller than the corresponding spectral reflectance variations observed in the reflective region. Consequently, the performance of solid-phase target detection algorithms in real-world applications is impacted, particularly by their sensitivity to target-background model mismatch arising from similar emissivities and other errors. By contrast, gas detection in the LWIR domain relies on selective absorption and emission phenomena, particularly by chemical vapors, which exhibit narrow spectral features. Although gas plumes may span a large number of pixels, their detection depends on factors such as concentration, signature strength, and temperature contrast with the background materials. Consequently, chemical detection applications necessitate rigorous physical processes in airborne (down-looking) and standoff (side-looking) configurations. In addition to employing strict physical and statistical models, gas detection methodologies are increasingly integrating deep learning models. This trend reflects the recognition of the potential of deep learning in enhancing the capabilities of gas detection algorithms. Our discourse concludes with a discussion on the future trajectory and research direction of extracting information from thermal IR hyperspectral remote sensing images. Despite the continual integration of new technical methodologies such as deep learning, the computational intricacies inherent in LWIR hyperspectral remote sensing underscore the necessity of approaches that combine physical mechanisms with machine learning models. These hybrid methodologies hold promise in addressing the multifaceted challenges associated with LWIR HSI analysis, which paves the way for enhanced information extraction and practical application implementations.  
      关键词:infrared hyperspectral;land surface parameter retrieval;spectral unmixing;classification;target detection   
      6
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496051 false
      发布时间:2024-08-13
    • Zhang Tao,Wang Binfeng,Fu Ying,Liu Songrong,Ye Jichao,Shan Peihong,Yan Chenggang
      Vol. 29, Issue 8, Pages: 2113-2136(2024) DOI: 10.11834/jig.230747
      摘要:The goal of spectral image super-resolution technology is to recover images with high spatial resolution and spectral resolution from images with low spatial resolution and spectral resolution. Images of high spatial and spectral resolution are widely used in remote sensing fields such as vegetation survey, geological exploration, environmental protection, anomaly detection, and target tracking. With the rise of deep learning, spectral image super-resolution algorithms based on deep learning have emerged. In particular, the emergence of technologies such as end-to-end neural networks, generative adversarial networks, and deep unfolding networks has made a qualitative leap in spectral image super-resolution performance. This study comprehensively discusses and analyzes cutting-edge deep learning algorithms under different spectral image super-resolution task scenarios. First, we introduce the basic concepts of spectral image super-resolution and the definitions of different super-resolution scenarios. Focusing on the two major scenarios of single-image super-resolution and fusion super-resolution, the basic concepts of various methods are elaborated from multiple perspectives such as super-resolution dimensions, super-resolution data types, basic frameworks, and supervision methods, and their characteristics are discussed. Second, this study summarizes the limitations of various algorithms and proposes directions for further improvement. Furthermore, the commonly used datasets in different fusion scenarios are briefly introduced, and the specific definitions of various evaluation indicators are clarified. For each super-resolution task, this study comprehensively compares the performance of representative algorithms from multiple perspectives such as qualitative evaluation and quantitative evaluation. Finally, this study summarizes the research results and discusses some serious challenges faced in the field of spectral image super-resolution, while also looking forward to possible future research directions. First, from the perspective of super-resolution scenarios, the existing spectral image super-resolution algorithms can be divided into two categories, namely, single image super-resolution and fusion-based super-resolution. Specifically, single spectral image super-resolution is designed to generate high-resolution output images from a single low-resolution input image. According to the direction of super-resolution, single image super-resolution can be divided into spatial super-resolution, spectral super-resolution, and spatial-spectral super-resolution. Fusion-based spectral image super-resolution is designed to fuse images of different modes into a single image with high spatial and spectral resolution. According to the different modes of fusion images, fusion-based spectral image super-resolution can be divided into pansharpening and multispectral and hyperspectral images fusion. Moreover, deep learning-based spectral image super-resolution methods can be categorized into end-to-end neural network based (E2EN-based) spectral image super-resolution framework, generative adversarial network-based (GAN-based) spectral image super-resolution framework, and deep unfolding network-based (DUN-based) spectral image super-resolution framework according to the network architecture. The E2EN-based spectral image super-resolution framework designs various network structures to mine nonlinear mapping relationships between low-resolution and high-resolution images. According to the basic computing unit of network structure, it can be divided into convolutional neural network-based method and Transformer-based method. The GAN-based spectral image super-resolution framework realizes the spectral image super-resolution through the game between the generator and the discriminator. The DUN-based spectral image super-resolution framework combines traditional optimization algorithms and deep learning, and it unfolds iterative optimization steps to form deep neural networks. From the perspective of supervision paradigm, the deep learning algorithms can also be classified into unsupervised and supervised categories. The supervised approaches minimize the distance between super-resolved spectral image and ground truth, while unsupervised algorithms design loss function through the similarity between super-resolved and input images or through the game of the generator and the discriminator. Our critical review describes the main concepts and characteristics of each approach for different spectral image super-resolution tasks according to the network architecture and supervision paradigm. Second, we introduce the representative datasets and evaluation metrics. We divide the datasets into categories of single spectral image super-resolution datasets and fusion-based spectral image super-resolution datasets. Furthermore, the evaluation metrics can be grouped into full-reference metrics and no-reference metrics. Some full-reference metrics are widely used for the quantitative evaluation of spectral image super-resolution, including peak signal-to-noise, structural similarity, spectral angle mapper, and relative dimensionless global error in synthesis. Third, we provide the quantitative and qualitative experimental results of different spectral image super-resolution tasks. Finally, we summarize the challenges and problems in the study of deep learning-based spectral image super-resolution and conduct forecasting analysis, such as high-quality spectral image super-resolution dataset, model-driven and deep learning combined spectral image super-resolution method, real-time spectral image super-resolution, and comprehensive evaluation metrics. The methods and datasets mentioned are linked at https://github.com/ColinTaoZhang/DL-based-spectral-super-resolution.  
      关键词:deep learning;super-resolution;spectral image;single image super-resolution;fusion-based super-resolution   
      4
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496052 false
      发布时间:2024-08-13
    • Zhu Bai,Ye Yuanxin
      Vol. 29, Issue 8, Pages: 2137-2161(2024) DOI: 10.11834/jig.230737
      摘要:The advent of new infrastructure construction and the era of intelligent photogrammetry have facilitated the rapid development of global aerospace and aviation remote sensing technology. Numerous multi-sensors integrating stereoscopic observation facilities have been launched from spaceborne, airborne, and terrestrial platforms, and the types of sensors have also developed from traditional single-mode sensors (e.g., optical sensors) to a new generation of multimodal sensors (e.g., multispectral, hyperspectral, light detection and ranging(LiDAR), and synthetic aperture radar(SAR) sensors). These advanced sensor devices can dynamically provide multimodal remote sensing images with different spatial, temporal, and spectral resolutions. They can obtain more reliable, comprehensive, and accurate observation results than single-modal sensors through joint processing of spaceborne, airborne, and terrestrial multimodal data. Therefore, investigating multimodal remote sensing image registration has great scientific significance. Multi-level and multi-perspective Earth observation can be effectively achieved only by fully integrating and utilizing various multimodal remote sensing images. In order to promote the development of multimodal remote sensing image registration research technology, we systematically sort out, analyze, introduce, and summarize the current mainstream registration methods for multimodal remote sensing images. We first sort out the research development and evolution process from single-modal to multimodal remote sensing image registration. We then analyze the core ideas of representative algorithms among area-based, feature-based, and deep-learning-based pipelines, while the contribution of the author team in the field of multimodal remote sensing image registration is introduced. Area-based registration (template matching) pipeline mainly includes two types: information theory-based and structural feature-based registration methods. The structural feature-based method consists of sparse structural features and dense structural features. From the perspective of the robustness and efficiency of comprehensive registration, dense-structure-feature-based methods have obvious effectiveness and advantages in handling significant nonlinear radiation differences between multimodal remote sensing images and can meet many current application needs. By contrast, area-based registration pipeline generally relies on geo-referencing of remote sensing images to predict the rough range of template matching. Feature-based registration methods can be refined into three categories: feature registration based on gradient optimization, local self-similarity (LSS), and phase consistency. The feature registration of gradient optimization usually designs consistent gradients for specific multimodal images. The generalization of this type of method based on gradient optimization is generally poor, and it has difficulty maintaining the same performance on other types of multimodal images. The feature registration of LSS also has limitations, given that the relatively low discriminative power of LSS descriptors may result in the inability to maintain robust matching performance in the presence of complex nonlinear radiation differences. The feature registration of phase consistency has high computational complexity, and the registration process is generally time consuming. Feature-based registration pipeline utilizes the local spatial relationship between adjacent pixels to construct a high-dimensional information feature vector for each feature point. Compared with template matching methods, they usually face a heavy computational burden, and inevitable serious outliers are prone to occur in matching, especially in multimodal registration situations where scale, rotation, and radiation differences exist simultaneously. In general, the registration robustness of feature-based methods is not as stable as that of area-based methods. The deep-learning-based pipeline can be divided into modular and end-to-end registration methods. The most common strategy for modular registration methods is to embed deep networks into feature-based or region-based methods. This approach takes advantage of the complete data-driven and high-dimensional deep feature extraction capability of deep learning to generate more robust features or more effective descriptors or similarity measures, which improves the robustness of image registration. Modular registration methods can be subdivided into three categories: learning-based template matching, learning-based feature matching, and style transfer-based modal unification. Modular registration methods are easy to train and have strong flexibility, but it has difficulty avoiding the error accumulation problem that easily occurs in multi-stage tasks and may fall into local optimality. The end-to-end registration methods directly estimate the geometric transformation parameters or deformation field to achieve image registration by directly constructing an end-to-end neural network structure. The training objectives of the end-to-end network are consistent and can obtain the global optimal solution. However, some problems arise, such as high training difficulty and poor interpretability. Moreover, no complete and comprehensive database containing all types of multimodal remote sensing image pairs is available to date, and the lack of training and testing data greatly limits the development of deep learning-based registration methods. Furthermore, we share existing public registration datasets of multimodal remote sensing images, as well as supplement by a small number of registration datasets in the field of computer vision. Finally, the existing problems and challenges in the current research on high-precision registration of multimodal remote sensing images are analyzed. A forward-looking outlook on the development trend of future research is given, which aims to promote further breakthroughs and innovations in the field of multimodal remote sensing image registration.  
      关键词:remote sensing;sensors;multimodal images;image registration;registration datasets   
      11
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496050 false
      发布时间:2024-08-13
    • Wang Shengke,Wang Xiandong,Qu Liang,Yao Fengqin,Liu Yingying,Li Conghui,Wang Yuzhen,Zhong Guoqiang
      Vol. 29, Issue 8, Pages: 2162-2174(2024) DOI: 10.11834/jig.230670
      摘要:ObjectiveThe coastal ecosystem is a natural system composed of biological communities and their interactions with the environment, including typical coastal ecosystems such as mangroves, salt marshes, coral reefs, seagrass beds, oyster reefs, sandy shores, as well as complex ecosystems like estuaries and bays. These ecosystems play a crucial role in maintaining high-quality ecological environments and fostering rich marine biodiversity. A healthy coastal ecosystem is not only a crucial support for the sustainable economic development of China but also an essential component of the ecological security of the country.The application of semantic segmentation techniques in remote sensing imagery has provided an effective means for the precise monitoring of coastal ecosystems, which offers scientists, ecologists, and decision makers clear and highly comprehensive information to understand the current state and changing trends of coastal ecosystems. However, a significant challenge persists, that is, a specialized, comprehensive, and fine-grained data support system for coastal ecosystems is lacking, which causes difficulty in accurately understanding the distribution, area, and changes in ecosystems such as salt marshes, seagrass beds, and reed beds. This challenge has become a pressing issue in the current national marine ecological conservation efforts. Currently, the monitoring of coastal ecosystems relies primarily on satellite remote sensing and traditional surveying methods. Satellite remote sensing, with its unique advantages of all-weather, all-day, large-scale, and long-time observation, is widely used for monitoring marine ecology and resources through the analysis of satellite data. However, the spatial resolution of satellite remote sensing images has limitations, which introduces errors in cases that require fine-scale monitoring. For example, in narrow rivers, small wetland areas, or islands, the limited spatial resolution may result in unclear visibility of small features in the images. Traditional surveying methods often require professionals to conduct field surveys, which poses safety hazards in complex and risky environments. In addition, these methods are susceptible to human and natural factors, which leads to challenges in precise positioning and depiction, coupled with extended and inefficient monitoring cycles. Field surveys and mapping work for coastal ecosystems face significant challenges using traditional methods.MethodThis study utilized unmanned aerial vehicles (UAVs) to capture, collect, and annotate data from typical coastal ecosystems in real time for addressing urgent issues in marine ecological conservation. This effort led to the establishment of the OUC-UAV-SEG dataset, which includes various typical vegetation types found in coastal ecosystems, such as reed, seagrass beds, spartin, and crucially, addresses a key marine event—oil spills.ResultIn contrast to previous research that predominantly focused on individual dataset scenarios, such as spartin or oil spills, using traditional remote sensing methods for data analysis, this study stands out by using statistical methods to conduct a detailed quantitative analysis of OUC-UAV-SEG, which covers various categories and their respective quantities. In discussing the challenges posed by the dataset, including the discrete clustered morphology of eelgrass, interlaced spiderweb-like oil spills, fragile tubular Sargassum, and mottled tufted seagrass beds, existing segmentation algorithms exhibited suboptimal performance in handling these distinctive features. Finally, classical visual semantic segmentation algorithms were applied to evaluate OUC-UAV-SEG by conducting benchmark tests to assess the performance of currently available semantic segmentation algorithms on this dataset and revealing their limitations.ConclusionThe establishment of the OUC-UAV-SEG dataset marks a significant step forward in providing a novel resource for monitoring coastal ecosystems. This dataset empowers scientists, ecologists, and decision makers with a more comprehensive understanding of the current conditions and evolving trends in coastal ecosystems. These insights are crucial for furnishing more accurate information to support marine ecological conservation and management efforts, which holds positive significance. Future efforts will focus on expanding the diversity of the dataset by collecting data on marine ecological environments under more extensive geographic and meteorological conditions to comprehensively reflect ecosystem diversity. Furthermore, by regularly organizing workshops and challenges, the aim is to advance cutting-edge research in the field of coastal ecosystem monitoring. Finally, active use of social media and other channels will be employed to engage with the public for raising awareness of marine environmental issues and inspiring more people to participate in the monitoring and protection of coastal ecosystems. For comprehensive details about this dataset, please refer to the project link:https://github.com/OucCVLab/OUC-UAV-SEG.  
      关键词:coastal ecosystem;remote sensing;unmanned aerial vehicle (UAV);benchmark dataset;semantic segmentation   
      5
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496053 false
      发布时间:2024-08-13
    • Ma Xiaorui,Ha Lin,Shen Dunbin,Mei Liang,Wang Hongyu
      Vol. 29, Issue 8, Pages: 2175-2187(2024) DOI: 10.11834/jig.230503
      摘要:ObjectiveHyperspectral image is a 3D data cube of “space-spectrum integration”. Its high-resolution spectral information can realize fine-grained land cover identification, and its wide-coverage spatial information can complete accurate land cover mapping. Therefore, hyperspectral images with spatial information and spectral information are widely used in tasks related to land cover classification. Thus, hyperspectral image classification is an important research topic in remote sensing field and a key supporting technology for many Earth observation tasks, such as smart cities, precision agriculture, and modern national defense. In recent years, many classification methods for hyperspectral images have been proposed to mine spatial and spectral information based on different deep networks, and they have achieved unprecedented high-precision classification results. However, due to various factors such as the change in acquisition environment and the difference in imaging sensors, the feature distribution of different hyperspectral images is shifted, which leads to the difficulty of cross-dataset classification. For this reason, existing classification methods usually retrain the model to deal with new hyperspectral image dataset, which is label intensive and time consuming. In the era of remote sensing big data, developing classification methods for cross-dataset hyperspectral images is important. Therefore, this study investigates classification methods for cross-dataset hyperspectral images to achieve large-scale Earth observation missions.MethodThis study proposes an unsupervised classification method for cross-dataset hyperspectral images based on feature optimization. The proposed method consists of three main modules. First, a feature balancing strategy is proposed to optimize the intra-dataset features independently. During the adversarial domain adaptation process, the transferability and discriminability of features are contradictory, and most existing methods sacrifice the feature discriminability of target dataset, which results in blurred class boundaries and affects classification performance. In the proposed method, a regularization term with singular value of the feature vectors from the source and target datasets is minimized to enhance the transferability and discriminability of the learned features. By extracting better features, this method achieves more accurate classification results on the target dataset. Second, a feature matching strategy is proposed to optimize the inter-dataset features collaboratively. No labeled sample is available in the target dataset, and feature discrepancies are obvious between the source dataset and the target dataset. Thus, the model cannot accurately match the two datasets, which leads to inadequate generalization. In the proposed method, an implicit feature augmentation strategy is performed to guide the source features to the target space, which improves the generalization performance of the model. By utilizing the underlying relationships between different datasets, this method adapts better to the target dataset and improves the overall performance of the classification model. Finally, an adversarial learning framework based on implicit discriminator is designed to optimize inter-dataset class-level features. Existing adversarial learning methods often construct an additional discriminator or use a binary classifier as a discriminator. The former focuses only on feature confusion between datasets and ignores class-level information, and the latter considers only class-level differences, which leads to ambiguous predictions. In the proposed method, by reusing the task classifier as an implicit discriminator, inter-dataset alignment and cross-dataset class recognition are achieved. By further optimizing the adversarial learning method, this approach can further enhance the classification performance of the model on the target dataset.ResultAll experiments in this study are executed on a desktop computer with Intel Core i7 4.0 GHz CPU, GeForce GTX 1080Ti GPU, and 32 GB memory. PyTorch, which is a widely used deep learning framework, is used in the experiment. The experiments are conducted on Pavia and HyRANK datasets. The evaluation indexes include overall accuracy (OA), average accuracy (AA), and к coefficient. In addition, the classification results are intuitively represented by classification maps. The experimental results are compared with various recent classification methods for cross-dataset hyperspectral images trained with all labeled source samples and unlabeled target samples. The proposed method is optimized by a small-batch SGD optimizer with a momentum of 0.9. The learning rates for the Pavia and HyRANK datasets are set to 0.000 1 and 0.001, respectively. The maximum number of iterations is set to 2 000. In the Pavia datasets, OA, AA, and к values increase by 1.75%, 3.55%, and 2.17%, respectively, compared with the model with the second performance. In the HyRANK datasets, OA, AA, and к values increase by 6.58%, 13.10%, and 7.96%, respectively, compared with the second-ranked model. Experimental results show that, compared with other methods, the classification maps produced by the proposed method are closer to the ground truth, and evaluation indexes of some categories are significantly improved. Moreover, the ablation experiment is conducted to study the effect of each module of the proposed method, which proves that each module is effective in improving the cross-dataset classification effect of hyperspectral images.ConclusionIn this study, a feature-optimized classification method for cross-dataset hyperspectral images is proposed. The proposed method provides a novel solution for unsupervised classification of cross-dataset hyperspectral images. By combining feature equalization, feature matching, and adversarial learning techniques, the method improves the generalization capability and classification performance of the model. Thus, it is an effective approach for cross-dataset image classification tasks. The proposed method is verified on two hyperspectral datasets, and the experimental results show that the proposed method can significantly improve the accuracy of cross-dataset hyperspectral images under unsupervised conditions compared with related methods.  
      关键词:hyperspectral image classification;cross-dataset classification;feature optimization;domain adaptation;unsupervised classification;domain adversarial network   
      4
      |
      3
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496049 false
      发布时间:2024-08-13
    • Jia Meng,Zhao Qin,Lu Xiaofeng
      Vol. 29, Issue 8, Pages: 2188-2204(2024) DOI: 10.11834/jig.230497
      摘要:ObjectiveHeterogeneous remote sensing images from different sensors are quite different in imaging mechanism, radiation characteristics, and geometric characteristics. Thus, they reflect the physical properties of the ground target at different levels. Therefore, no relationship exists between the observed values of the same object, which usually leads to “pseudo changes”. As a result, the change detection task has more difficulty obtaining accurate change information of the observed ground objects. Efforts have been made toward unsupervised detection of changes in heterogeneous remote sensing images by designing various methods to obtain change information. However, traditional image difference operators based on the difference or ratio of radiation measurement is no longer applicable. Therefore, transferring the bitemporal images in a common space is a convenient way to calculate differences. Considering the excellent and flexible feature learning capability of deep neural networks, they have been widely applied in change detection tasks for heterogeneous images to effectively alleviate the influence of “pseudo changes”. Moreover, by fully utilizing the characteristics of deep neural networks, they can be designed to transform heterogeneous remote sensing images into the same feature domain. Then, the change information can be accurately represented. Inspired by the paradigm of image translation, heterogeneous images are transformed into a common domain with consistent feature representations to enable direct comparisons of data. The key point of this task is learning a suitable one-to-one mapping to build a relationship between distinct appearances of images and exclude the effect of interference factors. Therefore, this study proposes a bipartite adversarial autoencoder network with clustering (BAACL) to detect changes between heterogeneous remote sensing images.MethodA bipartite adversarial autoencoder network is constructed to reconstruct the bitemporal images and achieve the transformation of heterogeneous images to common domain. Appropriate reconstruction loss regularization should be applied to obtain good data mapping from the constructed neural network optimization. Next, the bipartite convolutional autoencoders can be jointly trained to encode the input images and reconstruct them with high fidelity in output. Meanwhile, for the image-to-image translation task, additional structural consistency loss and adversarial loss terms are designed to constrain the network training process for converting heterogeneous images to the common data domain. The structural consistency loss term is designed to describe the internal structural relationships of images before and after translation. For this purpose, affinity matrixes are adopted to express the structural self-similarity within an image. The heterogeneous images can be conveniently compared in a new affinity space. Notably, based on the paradigm of image translation, our model can be viewed as learning two “transformation functions”, which are adopted to transform each of the heterogeneous images to the opposite data space. Furthermore, such a setup can regarded as a special case of an “adversarial mechanism”, which is formulated by an adversarial loss to train autoencoders for matching the opposite image style. Considering the disadvantage of these changed pixels to the adversarial loss term in network optimization, the pseudo-difference image generated by two pairs of homogeneous images mapped in the common data domain is analyzed by clustering in an unsupervised manner. Thereafter, the obtained semantic information is adopted to further constrain the adversarial loss term. The overall performance of the BAACL network is illustrated on four sets of publicly available datasets of heterogeneous remote sensing images. Five popular traditional and deep learning-based detection methods for changes in heterogeneous remote sensing images are selected and compared with this method to verify its effectiveness.ResultResults obtained from the Italy, California, Tianhe, and Shuguang datasets can also illustrate the performance of the proposed BAACL, with the overall detection accuracies of up to 0.970 5, 0.938 2, 0.994 7, and 0.982 6, respectively. Meanwhile, the proposed method is superior to the five compared methods in terms of visual results of the final change map. A set of ablation experiments is designed to verify the influence of the improved adversarial loss term on network optimization performance. the performance of the BAACL network does not change dramatically with various proportions of sample changing regions due to the semantic regularization. It verifies the effectiveness of the proposed semantic information-based adversarial loss term for network optimization.ConclusionThe semantic information-based adversarial loss term is designed to narrow the distance of the unchanged regions of the bitemporal images for image style consistency. The reason is that, for adversarial loss term without semantic regularization, a larger proportion of the change region of the current training sample will result in a worse trained network effect. Therefore, an accurate definition of the network constraint term will result in good network optimization and further affect the change detection performance. In view of the difficulty and high false alarm rate of detection methods for changes in heterogeneous remote sensing images caused by factors such as seasons and data heterogeneity, the proposed bipartite adversarial autoencoder network for the detection of changes in heterogeneous remote sensing images not only can fully utilize network characteristics and semantic information to improve image style consistency but also can realize completely unsupervised change detection process. The change detection performance can be greatly improved as well.  
      关键词:domain transformation;cluster analysis;semantic information;unsupervised change detection;heterogeneous remote sensing images   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496203 false
      发布时间:2024-08-13
    • Xue Jie,Huang Hong,Pu Chunyu,Yang Yinming,Li Yuan,Liu Yingxu
      Vol. 29, Issue 8, Pages: 2205-2219(2024) DOI: 10.11834/jig.230699
      摘要:ObjectiveIn recent years, the development of remote sensing technology has enabled the acquisition of abundant remote sensing images and large datasets. Scene classification tasks, as key areas in remote sensing research, aim to distinguish and classify images with similar scene features by assigning fixed semantic labels to each scene image. Various scene classification methods have been proposed, including handcrafted feature- and deep learning-based methods. However, handcrafted feature-based methods have limitations in describing scene semantic information due to high requirements for feature descriptors. Meanwhile, deep learning-based methods for scene classification of remote sensing images have shown powerful feature extraction capabilities and have been widely applied in scene classification. However, current scene classification methods mainly focus on remote sensing images with high spatial resolution, which are mostly three-channel images with limited spectral information. This limitation often leads to confusion and misclassification in visually similar categories such as geometric structures, textures, and colors. Therefore, integrating spectral information to improve the accuracy of scene classification has become an important research direction. However, existing methods have some shortcomings. For example, convolutional operations have translation invariance and are sensitive to local information, which causes difficulty in capturing remote contextual information. Meanwhile, although Transformer methods can extract long-range dependency information, they have limited capability in learning local information. Moreover, combining convolutional neural networks (CNNs) and Transformer methods incurs high computational complexity, which hinders the balance between inference efficiency and classification accuracy. This study proposes a high spectral scene classification method called the spatial-spectral model distillation (SSMD) network to address the aforementioned issues.MethodIn this study, we utilize spectral information to improve the accuracy of scene classification and overcome the limitations of existing methods. First, we propose a spatial-spectral joint self-attention mechanism based on ViT (SSViT) to fully exploit the spectral information of hyperspectral images. SSViT integrates spectral information into the Transformer architecture. By exploring the intrinsic relationships between pixels and between spectra, SSViT extracts richer features. In the spatial-spectral joint mechanism, SSViT leverages the spectral information of different categories to identify the differences between them, which enables fine-grained classification of land cover and improves the accuracy of scene classification. Second, we introduce the concept of knowledge distillation to further enhance the classification performance. In the framework of teacher-student models, SSViT is used as the teacher model, and a pretrained model, that is, Visual Geometry Group 16(VGG16), is used as the student model to capture contextual information of complex scenes. The teacher model extracts spectral information and global features among samples, while the student model focuses on capturing local features. The student model can learn and mimic the prior knowledge of the teacher model, which improves the discriminative ability of the student model. The joint training of the teacher-student models enables comprehensive extraction of land cover features, which improves the accuracy of scene classification. Specifically, the image is divided into 64 image patches in the spatial dimension, and 32 spectral bands in the spectral dimension. Each patch and band can be regarded as a token. Each patch and band are flattened into row vectors and mapped to a specific dimension through a linear layer. The learned vectors are concatenated with the embedded samples for the final prediction of image classification of the teacher model. A position vector is generated and directly concatenated with the token mentioned above as the input to the Transformer. The multi-head attention mechanism outputs encoded representations containing information from different subspaces to model global contextual information, which improves the representation capacity and learning effectiveness of the model. Finally, feature integration is performed through a multi-layer perceptron and a classification layer to achieve classification. The process of knowledge distillation consists of two stages. The first stage optimizes the teacher and student models by minimizing the loss function with distillation coefficients. In the second stage, the student model is further adjusted using the loss function, which leverages the supervision from the performance-excellent complex model to train the simple model. This adjustment aims for higher accuracy and better classification performance. The complex model is referred to as the teacher model, while the simpler model is referred to as the student model. The training mode of knowledge distillation provides the student model with more informative content, which allows it to directly learn the generalization ability of the teacher model.ResultWe compare our model with 10 models, including 5 traditional CNN classification methods and 5 latest scene classification methods on 3 public datasets, namely, OHID-SC (Orbita hyperspectral image scene classification dataset), OHS-SC (another Orbita hyperspectral scene classification dataset), and HSRS-SC (Hyperspectral remote sensing dataset for scene classification). The quantitative evaluation metrics include overall accuracy, standard deviation, and confusion matrix, and the confusion matrix on the three datasets is provided to clearly display the classification results of the proposed algorithm. Experimental results show that our model outperforms all other methods on OHID-SC, OHS-SC, and HSRS-SC datasets, and the classification accuracies on OHID-SC, OHS-SC, and HSRS-SC datasets are improved by 13.1%,2.9%, and 0.74%, respectively, compared with the second-best model. Meanwhile, comparative experiments on OHID-SC dataset show that the proposed algorithm can effectively improve the classification accuracy of hyperspectral scenes.ConclusionIn this study, the proposed SSMD network not only effectively utilizes the target spectral information of hyperspectral data but also explores the feature relationship between global and local levels, synthesizes the advantages of traditional and deep learning models, and produces more accurate classification results.  
      关键词:hyperspectral scene classification;convolutional neural network (CNN);Transformer;spatial-spectral joint self-attention mechanism;knowledge distillation (KD)   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496201 false
      发布时间:2024-08-13
    • You Xueer,Su Yuanchao,Jiang Mengying,Li Pengfei,Liu Dongsheng,Bai Jinying
      Vol. 29, Issue 8, Pages: 2220-2235(2024) DOI: 10.11834/jig.230393
      摘要:ObjectiveIn hyperspectral remote sensing, mixed pixels often exist due to the complex surface of natural objects and the limitation of spatial resolution of instruments. Mixed pixels typically refer to the situation where a pixel in the hyperspectral images usually contains multiple spectral features, which hinders the application of hyperspectral images in various fields such as target detection, image classification, and environmental monitoring. Therefore, the decomposition (unmixing) of mixed pixels is a main concern in the processing of hyperspectral remote sensing images. Spectral unmixing aims to overcome the limitations of image spatial resolution by extracting pure spectral signals (endmembers) representing each land cover class and their respective proportions (abundances) within each pixel. It is based on a spectral mixing model at the sub-pixel level. The rise of deep learning has brought many advanced modeling theories and architecture tools to the field of hyperspectral mixed pixel decomposition and has also spawned many deep learning-based unmixing methods. Although these methods have advantages over traditional methods in terms of information mining and generalization performance, deep networks often need to combine multiple layers of stacked network layers to achieve optimal learning outcomes. Therefore, deep networks may cause damage to the internal structure of the data during the training process, which leads to the loss of important information in hyperspectral data and affects the accuracy of unmixing. In addition, most existing deep learning-based unmixing methods focus only on spectral information, but the exploit of spatial information is still limited to surface processing stages such as filtering and convolution. In recent years, autoencoder has been one of the research hotspots in the field of deep learning, and many variant networks based on autoencoder networks have emerged. Transformer is a novel deep learning network with an autoencoder-like structure. It has garnered considerable attention in various fields such as natural language processing, computer vision, and time series analysis due to its powerful feature representation capability. The Transformer, as a neural network primarily based on the self-attention mechanism, can better explore the underlying relationships among different features and more comprehensively aggregate the spectral and spatial correlations of pixels. This way enhances the ability of abundance learning and improves the accuracy of unmixing. Although the Transformer network has recently been used to design unmixing methods, using unsupervised Transformer models directly to obtain features can lose many local details and cause difficulty in exploiting the long-range dependency properties of Transformers effectively.MethodTo address these limitations, the study proposes a deep embedded Transformer network (DETN) based on the Transformer-in-Transformer architecture. This network adopts an autoencoder framework that consists of two main parts: node embedding (NE) and blind signal separation. In the first part, the input hyperspectral image is first uniformly divided twice, and the divided image patches are mapped into sub-patch sequences and patch sequences through linear transformation operations. Then, the sub-patch sequences are processed through an internal Transformer structure to obtain pixel spectral information and local spatial correlations, which are then aggregated into the patch sequences for parameter and information sharing. Finally, the local detail information in the patch sequences is retained, and the patch sequences are processed through an external Transformer structure to obtain and output pixel spectral information and global spatial correlation information containing local information. In the second part, the input NE is first reconstructed into an abundance map and smoothed during this process using a single layer of 2D convolution layer to eliminate noise. A SoftMax layer is used to ensure the physical meaning of the abundances. Finally, a single-layer 2D convolution layer is used to reconstruct the image, which optimizes and estimates the endmembers in the convolution layer.ResultTo evaluate the effectiveness of the proposed method, experiments are conducted using simulated datasets and some real hyperspectral datasets, including the Samson dataset, the Jasper Ridge dataset, and a part of the real hyperspectral farmland data in Nanchang City, Jiangxi Province, obtained by the Gaofen-5 satellite provided by Beijing Shengshi Huayao Technology Co., Ltd. In addition, resources from the ZY1E satellite provided by Beijing Shengshi Huayao Technology Co., Ltd. are used to obtain partial hyperspectral data of the Marseille Port in France for comparative experiments with different methods. The experimental results are quantitatively analyzed using spectral angle distance (SAD) and root mean square error (RMSE). In addition, the method evaluates the proposed DETN compared with several state-of-the-art deep learning-based unmixing algorithms: fully strained least squares (FCLS), deep autoencoder networks for hyperspectral unmixing (DAEN), autoencoder network for hyperspectral unmixing with adaptive abundance smoothing (AAS), the untied denoising autoencoder with sparsity (uDAS), hyperspectral unmixing using deep imageprior (UnDIP), and hyperspectral unmixing using Transformer network (DeepTrans-HSU). Results demonstrate that the proposed method outperforms the compared methods in terms of spectral angle distance (SAD), root mean square error (RMSE), and other evaluation metrics.ConclusionThe proposed method effectively captures and preserves the spectral information of pixels at local and global levels, as well as the spatial correlations among pixels. This method results in accurate extraction of endmembers that match the ground truth spectral features. Moreover, the method produces smooth abundance maps with high spatial consistency, even in regions with hidden details in the image. These findings validate that the DETN method provides new technical support and theoretical references for addressing the challenges posed by mixed pixels in hyperspectral image unmixing.  
      关键词:remote sensing image processing;hyperspectral remote sensing;hyperspectral unmixing;deep learning;Transformer network   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496290 false
      发布时间:2024-08-13

      Review

    • Xu Yuxiong,Li Bin,Tan Shunquan,Huang Jiwu
      Vol. 29, Issue 8, Pages: 2236-2268(2024) DOI: 10.11834/jig.230476
      摘要:Speech deepfake technology, which employs deep learning methods to synthesize or generate speech, has emerged as a critical research hotspot in multimedia information security. The rapid iteration and optimization of artificial intelligence-generated content technologies have significantly advanced speech deepfake techniques. These advancements have significantly enhanced the naturalness, fidelity, and diversity of synthesized speech. However, they have also presented great challenges for speech deepfake detection technology. To address these challenges, this study comprehensively reviews recent research progress on speech deepfake generation and its detection techniques. Based on an extensive literature survey, this study first introduces the research background of speech forgery and its detection and compares and analyzes previously published reviews in this field. Second, this study provides a concise overview of speech deepfake generation, especially speech synthesis (SS) and voice conversion (VC). SS, which is commonly known as text-to-speech (TTS), analyzes text and generates speech that aligns with the provided input by applying linguistic rules for text description. Various deep models are employed in TTS, including sequence-to-sequence models, flow models, generative adversarial network models, variational auto-encoder models, and diffusion models. VC involves modifying acoustic features, such as emotion, accent, pronunciation, and speaker identity, to produce speech resembling human-like speech. VC algorithms can be categorized as single, multiple, and arbitrary target speech conversion depending on the number of target speakers. Third, this study briefly introduces commonly used datasets in speech deepfake detection and provides relevant access links to open-source datasets. This study briefly introduces two commonly used evaluation metrics in speech deepfake detection: the equal error rate and the tandem detection cost function. This study analyzes and categorizes the existing deep speech forgery detection techniques in detail. The pros and cons of different detection techniques are studied and compared in depth, focusing primarily on data processing, feature extraction and optimization, and learning mechanisms. Notably, this study summarizes the experimental results of existing detection techniques on the ASVspoof 2019 and 2021 datasets in tabular form. Within this context, the primary focus of this study is to investigate the generality of current detection techniques in the field of speech deepfake detection without focusing on specific forgery attack methods. Data augmentation involves a series of transformations on the original speech data. These include speech noise addition, mask enhancement, channel enhancement, and compression enhancement, each aiming to simulate complex real-world acoustic environments more effectively. Among them, one of the most common data processing methods is speech noise addition, which aims to interfere with the speech signal by adding noise to simulate the complex acoustic environment of a real scenario as much as possible. Mask enhancement is the masking operation on the time or frequency domain of speech to achieve noise suppression and enhancement of the speech signal for improving the accuracy and robustness of speech detection techniques. Transmission channel enhancement focuses on solving the problems of signal attenuation, data loss, and noise interference caused by changes in the codec and transmission channel of speech data. Compression enhancement techniques address the problem of degradation of speech quality during data compression. In particular, the main data compression methods are MP3, M4A, and OGG. From the perspective of feature extraction and optimization, speech deepfake detection can be divided into handcrafted feature-, hybrid feature-, deep feature-, and feature fusion-based methods. Handcrafted features refer to speech features extracted with the help of certain prior knowledge, which mainly include constant-Q transform, linear frequency cepstral coefficients, and Mel-spectrogram. By contrast, feature-based hybrid forgery detection methods utilize the domain knowledge provided by handcrafted features to mine richer information about speech representations through deep learning networks. End-to-end forgery detection methods directly learn feature representation and classification models from raw speech signals, which eliminates the need for handcrafted feature extraction. This way allows the model to discover discriminative features from the input data automatically. Moreover, these detection techniques can be trained using a single feature. Alternatively, feature-level fusion forgery detection can be employed to combine multiple features, whether they are identical or different. Techniques such as weighted aggregation and feature concatenation are used for feature-level fusion. The detection techniques can capture richer speech information by fusing these features, which improves performance. For the learning mechanism, this study explores the impact of different training methods on forgery detection techniques, especially self-supervised learning, adversarial training, and multi-task learning. Self-supervised learning plays an important role in forgery detection techniques by automatically generating auxiliary targets or labels from speech data to train models. Fine-tuning the self-supervised-based pretrained model can effectively distinguish between real and forged speech. Then, adversarial training-based forgery detection enhances the robustness and generalization of the model by adding adversarial samples to the training data. In contrast to binary classification tasks, the forgery detection based on multi-task learning captures more comprehensive and useful speech feature information from different speech-related tasks by sharing the underlying feature representations. This approach improves the detection performance of the model while effectively utilizing speech training data. Although speech deepfake detection techniques have achieved excellent performance in some datasets, their performance is less satisfactory when testing speech data from natural scenarios. Analysis of the existing research shows that the main future research directions are to establish diversified speech deepfake datasets, study adversarial samples or data enhancement methods for enhancing the robustness of speech deepfake detection techniques, establish generalized speech deepfake detection techniques, and explore interpretable speech deepfake detection techniques. The relevant datasets and code mentioned can be accessed from https://github.com/media-sec-lab/Audio-Deepfake-Detection.  
      关键词:speech deepfake;speech deepfake detection;speech synthesis(SS);voice conversion(VC);artificial intelligence-generated content(AIGC);self-supervised learning;adversarial training   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496224 false
      发布时间:2024-08-13
    • Hu Shiyu,Zhao Xin,Huang Kaiqi
      Vol. 29, Issue 8, Pages: 2269-2302(2024) DOI: 10.11834/jig.230498
      摘要:Single object tracking (SOT) task, which aims to model the human dynamic vision system and accomplish human-like object tracking ability in complex environments, has been widely used in various real-world applications like self-driving, video surveillance, and robot vision. Over the past decade, the development in deep learning has encouraged many research groups to work on designing different tracking frameworks like correlation filter (CF) and Siamese neural networks (SNNs), which facilitate the progress of SOT research. However, many factors (e.g., target deformation, fast motion, and illumination changes) in natural application scenes still challenge the SOT trackers. Thus, algorithms with novel architectures have been proposed for robust tracking and to achieve better performance in representative experimental environments. However, several poor cases in natural application environments reveal a large gap between the performance of state-of-the-art trackers and human expectations, which motivates us to pay close attention to the evaluation aspects. Therefore, instead of the traditional reviews that mainly concentrate on algorithm design, this study systematically reviews the visual intelligence evaluation techniques for SOT, including four key aspects: the task definition, evaluation environments, task executors, and evaluation mechanisms. First, we present the development direction of task definition, which includes the original short-term tracking, long-term tracking, and the recently proposed global instance tracking. With the evolution of the SOT definition, research has shown a progress from perceptual to cognitive intelligence. We also summarize challenging factors in the SOT task to help readers understand the research bottlenecks in actual applications. Second, we compare the representative experimental environments in SOT evaluation. Unlike existing reviews that mainly introduce datasets based on chronological order, this study divides the environments into three categories (i.e., general datasets, dedicated datasets, and competition datasets) and introduces them separately. Third, we introduce the executors of SOT tasks, which not only include tracking algorithms represented by traditional trackers, CF-based trackers, SNN-based trackers, and Transformer-based trackers but also contain human visual tracking experiments conducted in interdisciplinary fields. To our knowledge, none of the existing SOT reviews have included related works on human dynamic visual ability. Therefore, introducing interdisciplinary works can also support the visual intelligence evaluation by comparing machines with humans and better reveal the intelligence degree of existing algorithm modeling methods. Fourth, we review the evaluation mechanism and metrics, which encompass traditional machine–machine and novel human–machine comparisons, and analyze the target tracking capability of various task executors. We also provide an overview of the human–machine comparison named visual Turing test, including its application in many vision tasks (e.g., image comprehension, game navigation, image classification, and image recognition). Especially, we hope that this study can help researchers focus on this novel evaluation technique, better understand the capability bottlenecks, further explore the gaps between existing methods and humans, and finally achieve the goal of algorithmic intelligence. Finally, we indicate the evolution trend of visual intelligence evaluation techniques: 1) designing more human-like task definitions, 2) constructing more comprehensive and realistic evaluation environments, 3) including human subjects as task executors, and 4) using human abilities as a baseline to evaluate machine intelligence. In conclusion, this study summarizes the evolution trend of visual intelligence evaluation techniques for SOT task, further analyzes the existing challenge factors, and discusses the possible future research directions.  
      关键词:intelligence evaluation technique;competitions and datasets;visual tracking ability;single object tracking (SOT);object tracking algorithms   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496294 false
      发布时间:2024-08-13
    • Yu Liu,Wu Xiaoqun
      Vol. 29, Issue 8, Pages: 2303-2318(2024) DOI: 10.11834/jig.230478
      摘要:Texture optimization is an important task in the field of computer graphics and computer vision, and it plays an essential role in creating realistic 3D reconstructed scenes, which aims to enhance the quality of texture mapping for 3D reconstructed scenes. It has various applications in digital entertainment, heritage restoration, smart cities, virtual/augmented reality, and other fields. To achieve complete texture mapping, a single image is often insufficient, and multiple angle color images are required. By combining these images and projecting them onto the 3D scene, high-quality texture mapping can be achieved, ideally with consistent luminosity across all images. However, the accuracy of 3D reconstruction can be impacted by errors in camera pose estimation and geometry, which result in misaligned projected images. This limitation restricts the use of 3D scenes in various fields, which makes 3D scene texture optimization a crucial task. The texture mapping process involves projecting a 2D image onto a 3D surface to create a realistic representation of the scene. However, this process can be challenging due to the complexity of the 3D data and the inherent inaccuracies in the 3D reconstruction process. Texture optimization algorithms attempt to reduce the errors in texture mapping and improve the visual quality of the resulting scene. Texture optimization algorithms have two main types: traditional and deep learning-based optimization algorithms. Traditional optimization algorithms typically aim to reduce the accumulated error of camera pose and the impact of reconstruction geometry accuracy on texture mapping quality. These algorithms often involve techniques such as image fusion, image stitching, and joint texture and geometry optimization. Image fusion-based optimization algorithms attempt to reduce blurring and artifacts in texture mapping by optimizing the camera pose and adding deformation functions to the images before weighting the color samples of multiple texture images from different views using a mixed weight function. This approach can help improve the visual quality of the scene and reduce the impact of reconstruction errors on texture mapping. Image stitching-based optimization algorithms select an optimal texture image for each triangular slice on the model and then deal with the seams caused by stitching multiple images. This approach can help reduce the impact of reconstruction errors on texture mapping and improve the visual quality of the resulting scene. Joint texture and geometry optimization algorithms further improve the quality of texture mapping by jointly optimizing the camera pose, geometric vertex position, and texture image color. This approach can help minimize the impact of reconstruction errors on texture mapping and achieve more realistic texture mapping. However, traditional optimization algorithms can be computationally expensive due to the complexity of 3D data. To address this challenge, researchers have developed deep learning-based optimization algorithms, which use neural networks to optimize 3D scene textures. These algorithms can be further classified into convolutional neural network-based optimization algorithms, generative adversarial network (GAN)-based optimization algorithms, neural textures, and text-driven diffusion model optimization algorithms. Convolutional neural network-based optimization algorithms use deep neural networks to learn the texture features of the input images and generate high-quality texture maps for 3D reconstructed scenes. GAN-based optimization algorithms use a GAN to generate high-quality texture maps that are visually indistinguishable from real images. Neural textures use a neural network to synthesize new textures that can be applied to 3D reconstructed scenes. Text-driven diffusion model optimization algorithms use a diffusion model to optimize the texture of large missing 3D scenes based on a given text description. Deep learning-based optimization algorithms have shown promising results in improving the visual quality of texture mapping and reducing computational costs. They can effectively combine prior knowledge with neural networks to infer the texture of large missing 3D scenes. This feature not only reduces computation but also ensures texture details and greatly enhances visual realism. In addition to texture optimization algorithms, various datasets and evaluation metrics have been developed to evaluate the performance of these algorithms. The commonly used datasets include the bundle fusion dataset and the ScanNet dataset. The evaluation metrics used to evaluate the performance of texture optimization algorithms include peak signal-to-noise ratio, structural similarity index, and mean squared error. In conclusion, texture optimization of 3D reconstructed scenes is a challenging research topic in the fields of computer graphics and computer vision. Traditional and deep learning-based optimization algorithms have been developed to address this challenge. The future trend of texture optimization is likely to be deep learning-based optimization algorithms, which can effectively reduce computational costs and enhance visual realism. However, many challenges and opportunities exist in this field, and researchers need to continue exploring new approaches to improve the quality of texture mapping for 3D reconstructed scenes.  
      关键词:scene reconstruction;texture optimization;image fusion;image stitching;joint optimization   
      0
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496292 false
      发布时间:2024-08-13
    • Li Qi,Duan Pengsong,Cao Yangjie,Zhang Dalong,Yang Xiaohan,Wang Yujing
      Vol. 29, Issue 8, Pages: 2319-2332(2024) DOI: 10.11834/jig.230310
      摘要:The popularization and development of the Internet technology have facilitated extensive research on the protection of user’s data and privacy. Cyberspace security defense has developed from passive defense to active Defense in recent years, and the performance and success rate of the new defense technologies have been significantly improved. Typical applications for passive defense are known as access control, firewall, and virtual local area network; those for active defense are honeypot technology, digital watermarking, intrusion detection, and flow cleaning. However, the traditional passive defense and active defense are shell defense loosely coupled with function and security, and their defense performance against unknown attacks is poor. Its defects can be summarized as the “impossible triangle”, which means that a traditional defense system cannot simultaneously meet the three defense elements of dynamics, variety, and redundancy. The three elements can be combined in pairs to form a defensive domain. The typical technical representative of DV domain is mobile target defense, DR domain is dynamic isomorphic redundancy, and VR domain is non-similar redundancy architecture. Our research aims to find a defense technology that can reach the DVR domain. Cyberspace mimic defense (CMD) was proposed by Academician Wu Jiangxing in 2016. It aims to address the issue of cyberspace mimic security, which is an implementation form of network endogenous security developed from traditional cybersecurity defense methods. Its core architecture is a dynamic heterogeneous redundant architecture, which mainly consists of four parts: a set of heterogeneous execution entities, a distributor, a mimetic transformer, and a voter. It is also based on the three theorems of CMD and the theorem of network security incomplete intersection as the theoretical foundation. Among them, the heterogeneity of the system is increased through heterogeneous execution entities, and the voting algorithm determines the individuals which go online and offline in the heterogeneous execution entities. The heterogeneous strategy can be divided into four areas: single source closed, single source open, multi source closed, and multi source open. This classification depends on whether the system is open source and whether the source code has been modified. In the selection of heterogeneous components, similarity should be avoided as much as possible. Thus, system redundancy will be improved to prevent collaborative attacks from breaking through mimic defense and causing damage to the system. The hybrid heterogeneous method can serve as a direction for further research on heterogeneous methods. It utilizes cloud computing resources to break through the limitations of single computer software and hardware, and it consolidates the diversity and reliability of heterogeneous systems. The core idea of the mimic voting method is that the mimic system needs to monitor the “process data and process element resources” of the execution entity, discover the attacked execution entity through voting, and determine the final result value output by the system to the user I/O. The evolution process of voting algorithms is mainly reflected in the use of diverse modules to repeatedly verify the voting results to improve their credibility, and multimodal adjudication is also an important guarantee for the dynamics of simulated systems. At the end of the mimic defense process, the scheduling algorithm completes the online and offline process of the execution entities in the system. For scheduling algorithms, the standard of whether the system obtains historical data is adopted. This division divides algorithms into two categories: open-loop external feedback algorithms and closed-loop self-feedback algorithms. A positive external feedback scheduling algorithm can improve performance to a certain extent. However, the lack of analysis of the historical state of a system will reduce its sensitivity to attacks that have occurred, which weakens the dynamics of the mimic system. Therefore, scheduling strategies with self-feedback algorithms have better effectiveness and performance in adversarial experimental results. This study mainly starts from the historical evolution of cyberspace security development, compares the differences between traditional defense methods and mimic defense, focuses on introducing the specific implementation forms of heterogeneous strategies, scheduling strategies, and voting strategies in mimic architecture, and lists application examples that integrate mimic defense ideas in practice. The mainstream mimic defense applications are mimic router, mimic Web server, mimic distributed application, and mimic Internet of Things. Mimic defense has now gained a wide application foundation in various fields, and research based on this foundation can advance the existing network security system to a new stage.  
      关键词:network security;endogenous security;mimic defense;redundancy;dynamic isomerism   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496562 false
      发布时间:2024-08-13

      Image Processing and Coding

    • Xie Bin,Xu Yan,Wang Guanchao,Yang Shumin,Li Yanwei
      Vol. 29, Issue 8, Pages: 2333-2349(2024) DOI: 10.11834/jig.230392
      摘要:ObjectiveImage decolorization converts a color image into a grayscale one. It aims to maintain important features of the original color image as much as possible. It is normally used in the preprocessing stage of pattern recognition and image processing systems, such as extracting fingerprint features, face recognition, and medical imaging. Given that image decolorization maps color images in three dimensions into grayscale images in one dimension, it is a dimensionality reduction process essentially. Thus, the loss of information is inevitable. Accordingly, the major task for researchers can be expressed as preserving important features such as contrast, detail information, and hierarchy of the original color image as much as possible. In the process of acquiring grayscale images, the most basic method is to extract the luminance channel of color images directly. However, acquiring only the luminance information tends to lose the contrast details and appearance features of the input image. In recent years, researchers have focused more on whether grayscale images match the perception of color of the human visual system, and many classical image decolorization algorithms have been proposed. These color-to-gray conversion methods generally fall into two main categories: local mapping methods and global mapping methods. In the local mapping method, the mapping function is spatially varying, that is, the mapping functions of pixels at different spatial locations often differ. The grayscale image obtained by the local mapping method can keep part of the image structure information and local contrast information. However, such methods tend to lose the overall information of the image. They can also produce certain phenomena such as halo and noise enhancement. Some researchers have proposed the global mapping method to address the shortcomings of the abovementioned methods. In the global mapping method, the same mapping function is used for the entire image, and the grayscale value is only related to the color of the original color image. However, many of these methods focus excessively on pixels with large amplitude values, which tend to produce visual overfitting and cause difficulty in retaining information in areas with small contrast. A novel image decolorization method is proposed, in combination with t-distributed stochastic neighbor embedding (t-SNE), to address the problems of insufficient contrast maintenance, blurred details, and lack of hierarchy in grayscale images obtained by the traditional color-to-gray conversion method. This method is simpler and more efficient.MethodIn this study, considering that t-SNE can better preserve the important features and intrinsic connections of the original high-dimensional spatial data points, we introduce the idea of t-SNE dimensionality reduction into the process of color-to-gray conversion. A new model of color image grayscale is also designed based on t-SNE maximization. In this model, the similarity between any pixel in the original color image is represented by the joint Gaussian probability, while the similarity between any pixel in the corresponding grayscale image is represented by the joint Gaussian probability. The energy function is maximized such that the contrast areas of the original color image with less contrast can be appropriately enlarged or maintained after grayscale. Accordingly, the grayscale image can better maintain the contrast characteristics and the sense of hierarchy of the original color image. In addition, an adaptive contrast retention strategy is designed in this study and introduced into the new model to better preserve the details and contrast information of the original color image. The strategy defines a contrast retention coefficient that can adaptively adjust the graying intensity of different regions of the original color image based on the color contrast information. Given that the proposed maximization model is highly nonlinear, the solution using traditional methods such as gradient descent and iterative method will be particularly complex and time consuming. For this reason, we adopt an efficient discrete search strategy to solve the model. It transforms the continuous parameter optimization problem into a discrete parameter space search to quickly obtain grayscale images, which greatly improves the overall efficiency of the model.ResultTo demonstrate the effectiveness of the proposed method, we conducted an experimental comparison with various other traditional color-to-gray conversion methods on three different types of datasets, namely, Cadik, complex scene decolorization dataset (CSDD), and Color250. Experimental results show that the proposed method can better maintain the contrast, detail features, and hierarchy of the original color image. For the color contrast preserving ratio (CCPR) metric, the average CCPR values of the proposed method are 0.874, 0.862, and 0.864 for the three datasets when the contrast threshold values range from 1 to 15, respectively, which are better than the average CCPR values of the conventional grayscale method. Comparing the efficiency of different grayscale methods on the same hardware shows that the method has the shortest running time when inputting a raw color image of size 396 × 386 pixels, which is real-time speed. In addition, volunteers were randomly invited to complete preference experiments and accuracy experiments to more comprehensively evaluate the visual effects of grayscale images obtained by different grayscale methods. The experiments showed that the volunteers preferred the grayscale images obtained by the proposed method in this study. Therefore, the proposed method in this study was more consistent with the visual perception of human eyes.ConclusionIn this study, we analyze the problems of traditional color-to-gray conversion methods and propose an adaptive decolorization method based on t-SNE maximization. The experimental data based on three datasets show that the proposed method in this study not only maintains the contrast, detail characteristics, and hierarchy of the original color image better but also has better performance in subjective and objective evaluations.  
      关键词:color-to-gray conversion;t-distributed stochastic neighbor embedding (t-SNE);contrast preserving;discrete search;detail preserving   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496560 false
      发布时间:2024-08-13

      Image Analysis and Recognition

    • Xu Ke,Liu Xinpu,Wang Hanyun,Wan Jianwei,Guo Yulan
      Vol. 29, Issue 8, Pages: 2350-2363(2024) DOI: 10.11834/jig.230495
      摘要:ObjectiveIn recent years, considerable attention has been given to the object detection algorithm that utilizes the fusion of visible and infrared dual-modal images. This algorithm serves as an effective approach for addressing object detection tasks in complex scenes. The process of object detection algorithms can be roughly divided into three stages. The first stage is feature extraction, which aims to extract geometric features from the input data. Next, the extracted features are fed into the neck network for multi-scale feature fusion. Finally, the fused features are input into the detection network to output object detection results. Similarly, dual-modal detection algorithms follow the same process to achieve object localization and classification. The difference lies in the fact that traditional object detection focuses on single-modal visible images, while dual-modal detection focuses on visible and infrared image data. The dual-modal detection algorithm aims to simultaneously utilize information from infrared and visible images. It merges these images to obtain more comprehensive and accurate target information, which enhances the accuracy and robustness of the object detection process. Traditional fusion methods encompass pixel-level fusion and feature-level fusion. Pixel-level fusion employs a straightforward weighted overlay technique on the two types of images, which enhances the contrast and edge information of the targets. Meanwhile, feature-level fusion extracts features from the infrared and visible images and combines them to enhance the representation capability of the targets. However, the feature fusion process of existing dual-modal detection algorithms faces two major issues. First, the feature fusion methods employed are relatively simple, which involves the addition or parallel operation of individual feature elements. Consequently, these methods yield unsatisfactory fusion effects that limit the performance of subsequent object detection. Second, the algorithm structure solely focuses on the feature fusion process, which neglects the crucial feature selection process. This deficiency results in the inefficient utilization of valuable features.MethodIn this study, we introduce a visible and infrared image fusion object detection algorithm that employs dynamic feature selection to address the two issues mentioned above. Overall, we propose enhancements to the conventional YOLOv5 detector through modifications to its backbone, neck, and detection head components. We select CSPDarkNet53 as the backbone, which possesses an identical structure for visible and infrared image branches. The algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. The proposed algorithm includes embedding the dynamic fusion layer in the backbone network, which utilizes the Transformer structure for multiple feature fusions in multi-source image feature maps to enrich feature expression. Moreover, it employs the dynamic selection layer in the neck network, which uses three attention mechanisms (i.e., scale, space, and channel) to improve multi-scale feature maps and screen useful features. These mechanisms are implemented with SENet and deformable convolutions. In line with standard practices in target detection algorithms, we utilize the detection head of YOLOv5 to generate detection results. The loss function employed for algorithm training is the combined sum of bounding box regression loss, classification loss, and confidence loss, which are implemented with generalized intersection over union, cross entropy, and squared-error functions, respectively.ResultIn this study, we validate our proposed algorithm through experimental evaluation on three publicly available datasets: FLIR, visible-infrared paired dataset for low-light vision (LLVIP), and vehicle detection in aerial imagery (VEDAI). We use the mean average precision (mAP) for evaluation. Compared with the baseline model that adds features individually, our algorithm achieves improvements of 1.3%, 0.6%, and 3.9% in mAP50 scores and 4.6%, 2.6%, and 7.5% in mAP75 scores. In addition, our algorithm demonstrates enhancements of 3.2%, 2.1%, and 3.1% in mAP scores on the respective datasets, which effectively reduces the probability of object omission and false alarms. Moreover, we conduct ablation experiments on two innovative modules: the dynamic fusion layer and the dynamic selection layer. The complete algorithm model, which incorporates the two layers, achieves the best performance on all three test datasets. This performance validates the effectiveness of our proposed algorithm. We also compare the network model size and computational efficiency of these state-of-the-art algorithms, and experiments show that our algorithm can significantly improve algorithm performance while slightly increasing parameter computation. Furthermore, we visualize the attention weight matrices of the three dynamic fusion layers in the backbone to better reveal the mechanism of the dynamic fusion layer. The visual analysis confirms that the dynamic fusion layer effectively integrates the feature information from visible and infrared images.ConclusionIn this study, we propose a visible and infrared image fusion-based object detection algorithm using dynamic feature selection strategy. This algorithm incorporates two innovative modules: dynamic fusion layer and dynamic selection layer. Through extensive experiments, we demonstrate that our algorithm effectively integrates feature information from visible and infrared image modalities, which enhances the performance of object detection. However, the proposed algorithm has a little increasing computational complexity and requires pre-registration of the input visible and infrared images, which limits some application scenarios of the algorithm. The research on lightweight fusion modules and algorithms capable of processing unregistered dual light images will be the focus of future research in the field of multimodal fusion target detection.  
      关键词:infrared image;object detection;attention mechanism;feature fusion;deep neural network   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496558 false
      发布时间:2024-08-13
    • Zhang Hongying,Liu Tengfei,Luo Qian,Zhang Tao
      Vol. 29, Issue 8, Pages: 2364-2376(2024) DOI: 10.11834/jig.230523
      摘要:ObjectivePerson re-identification (ReID) is an important task in computer vision, and it aims to accurately identify and associate the same person between multiple visual surveillance cameras by extracting and matching features of pedestrian under different scenarios. Occluded person ReID is a challenging and specialized task in the existing person ReID problems. In real-world settings, occlusion is a common issue, and it impacts the practical application of person ReID technique to a certain extent. Recently, occluded person ReID has gradually attracted the attention of many researchers, and several methods have been proposed to address the issue of occlusion, which achieve impressive results. Currently, these methods primarily focus on the visible regions in images. Concretely, it first locates the visible regions in the image and then specially designs a model to extract discerning feature information from these regions, which achieves accurate person matching. These methods typically remove features coming from the occluded areas and then exploit discriminative features from the non-occluded regions for matching. Although these methods achieve impressive results, the influence of occluded regions and background interference in images are ignored, which results in the aforementioned solutions failing to effectively address the misclassification issue resulting from similar appearances in non-occluded regions. Consequently, merely relying on visible regions for subsequent recognition task leads to a sharp performance drop of the model, and the interference coming from image backgrounds also affects the further improvement in recognition accuracy. Some methods have been proposed to recover the occluded regions in images for overcoming the abovementioned issues. Specifically, these methods restore the occluded parts by utilizing the unobstructed image information at the image level. However, the restoration approaches may cause image distortion and introduce an excessive number of parameters.MethodWe propose a person ReID method based on pose guidance and multi-scale feature fusion to alleviate the aforementioned issues. This method can enhance the feature representation capability of the model and obtain more discriminative features. First, a feature restoration module is constructed to restore the occluded image features at the feature level while effectively reducing the parameters of the model. The module uses spatial contextual information from the non-occluded regions to predict the features of adjacent occluded regions, which restores the semantic information of the occluded regions in the feature space. The feature restoration module mainly consists of two subparts: the adaptive region division unit and the feature restoration one. The adaptive region division unit divides the image into six regions adaptively according to the predicted localization points to facilitate the clustering of similar feature information in different regions. The adaptive division in the module could effectively alleviate the misalignment caused by fixed division methods, and it could achieve more accurate position alignment. The feature restoration unit comprises of an encoder and a decoder. The encoder encodes the feature information coming from the divided regions of the image with similar appearances or close positions into a cluster. Meanwhile, the decoder assigns the cluster information to the occluded body parts in the image, which completes the feature restoration of missing body parts. Second, a pose estimation network is employed to extract pedestrian pose information. The pose estimation network is responsible for guiding the generation of keypoint heatmaps for the restored complete image features. Then, it implements the prediction of body keypoints with the heatmaps to obtain pose information. The pretrained pose estimation guidance model performs fusion learning on the global non-occluded regions and the restored regions to obtain more distinctive pedestrian feature information for more accurate pedestrian matching. Finally, a feature enhancement module is proposed to extract salient features from the image for eliminating the interference coming from background information while enhancing the learning capability for effective information. This module not only makes the network pay close attention to the valid semantic information in the feature maps but also reduces the interference coming from background noises, which could effectively alleviate the failure of feature learning caused by occlusion.ResultWe conducted several comparative experiments and ablation experiments on three publicly available datasets to validate the effectiveness of our method. We employed mean average precision (mAP) and Rank-1 accuracy as our evaluation metrics. Experiment results demonstrate that our method achieves mAP and Rank-1 of 88.8% and 95.5% on the Market1501 dataset, respectively. The mAP and Rank-1 are 79.2% and 89.3%, respectively, on the Duke multi-tracking multi-camera ReID (DukeMTMC-reID) dataset. On the occluded Duke multi-tracking multi-camera re-recognition (Occluded-DukeMTMC) dataset, the mAP and Rank-1 can reach 51.7% and 60.3%, respectively. Moreover, our method outperforms the PGMA-Net by 0.4% in mAP on the Market1501 dataset, by 0.8% in mAP and 0.7% in Rank-1 on the DukeMTMC-reID dataset, and by 1.2% in mAP on the Occluded-DukeMTMC dataset. At the same time, the ablation experiments confirm the effectiveness of the three proposed modules.ConclusionOur proposed method, pose-guided and multi-scale feature fusion (PGMF), could effectively recover the features of missing body parts, alleviate the issue of background interference, and achieve accurate pedestrian matching. Therefore, the proposed model effectively alleviates the misidentification caused by occlusion, improves the accuracy of person ReID, and exhibits robustness.  
      关键词:person re-identification (ReID);occlusion;pose guidance;feature fusion;feature restoration   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496517 false
      发布时间:2024-08-13
    • Lu Lidan,Xia Haiying,Tan Yumei,Song Shuxiang
      Vol. 29, Issue 8, Pages: 2377-2387(2024) DOI: 10.11834/jig.230410
      摘要:ObjectiveWhen communicating face to face, people use various methods to convey their inner emotions, such as conversational tone, body movements, and facial expressions. Among these methods, facial expression is the most direct means of observing human emotions. People can convey their thoughts and feelings through facial expression, and they can also use it to recognize others’ attitudes and inner world. Therefore, facial expression recognition belongs to one of the research directions in the field of affective computing. It can obviously be applied to many fields, such as fatigue driving detection, human–computer interaction, students’ listening state analysis, and intelligent medical services. However, in complex natural situations, facial expression recognition suffers from direct occlusion issues such as masks, sunglasses, gestures, hairstyles, or beards, as well as indirect occlusion issues such as different lighting, complex backgrounds, and pose variation. All these concerns can pose great challenges to facial expression recognition in natural scenes, where extracting discriminative features is difficult. Thus, the final recognition results are poor. Therefore, we propose an attention-guided joint learning method for local features in facial expression recognition to reduce the interference of occlusion and pose variation problems.MethodOur method is composed of a global feature extraction module, a global feature enhancement module, and a joint learning module for local features. First, we use ResNet-50 as the backbone network and initialize the network parameters using the MS-Celeb-1M face recognition dataset. We think that the rich information available in the face recognition model can be used to complement the contextual information needed for facial expression recognition, especially the middle layer features such as eyes, nose, and mouth. Thus, the global feature extraction module is used to extract the global features of the middle layer, which consists of a 2D convolutional layer and three bottleneck residual convolutional blocks. Second, most of the facial expression features are concentrated in localized key regions such as eyes, nose, and mouth. Accordingly, the overall face information can be ignored and the expression categories can be directly recognized correctly with the help of local key information. Given that face recognition requires overall facial information, the face recognition pretraining model introduces some unimportant features for expression recognition. Therefore, we utilize a global feature enhancement module to suppress the redundant features (e.g., features in the nose region) brought by the pretrained model for face recognition and enhance the semantic information of global face image that is most relevant to the emotion. This module is implemented by the effcient channel attention(ECA) attention mechanism, which strengthens the channel features that contribute to the classification and weakens the weights of the channel features that are detrimental to the classification through cross-channel interactions between high-level semantic channel features. Finally, we divide the output features of the global feature enhancement module into four non-overlapping local regions uniformly in terms of spatial dimensions. This method exactly distributes the eye and mouth regions in most of the face images in four sub-image blocks. The global facial expression analysis problem is split into multiple local regions for calculation. Then, the fine-grained salient features of different localized regions of the face are learned through the mixed-attention mechanism. The local feature joint learning module learns information from complementary contexts, which reduces the negative effects of occlusion and pose variations. Considering that our method integrates four classifiers for local feature learning, a decision-level fusion strategy is used for the final prediction. That is, after summing the output probability results of the four classifiers, the category corresponding to the maximum probability is the model prediction category.ResultRelevant experimental validation was performed on two in-the-wild expression datasets, namely, real-world affective faces database (RAF-DB) and face expression recognition plus (FERPlus) datasets. The results of the ablation experiments show that the gains of our method compared with the base model on the two datasets are 1.89% and 2.47%, respectively. In the RAF-DB dataset, the recognition accuracy is 89.24%, which has a performance improvement of 0.84% compared with global multi-scale and local attention network (MA-Net). In the FERPlus dataset, the recognition accuracy is 90.04%, which is comparable to the performance of FER framework with two attention mechanisms (FER-VT). Therefore, our method has good robustness. We test the model trained on the RAF-DB dataset by incorporating it with the FED-RO dataset with real occlusion and achieved an accuracy of 67.60%. We also use Grad-CAM++ to visualize the attention heatmap of the proposed model for demonstrating the effectiveness of the proposed method more intuitively. The visualization of the joint learning module for local features illustrates that the module can direct the overall model to focus on the features in each individual local image block that are useful for classification.ConclusionIn general, the proposed method is guided by the attention mechanism, which enhances the global features first and then learns the salient features in the local region. This approach effectively reduces the interference of the local occlusion problem through the learning sequence of global enhancement followed by local refinement. Experiments on two natural scene datasets and the occlusion test set prove that the model is simple, effective, and robust.  
      关键词:facial expression recognition;attention mechanism;partial occlusion;local salient feature;joint learning   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496408 false
      发布时间:2024-08-13

      Image Understanding and Computer Vision

    • Zhu Zhongjie,Zhang Rong,Bai Yongqiang,Wang Yuer,Sun Jiamin
      Vol. 29, Issue 8, Pages: 2388-2398(2024) DOI: 10.11834/jig.230430
      摘要:ObjectivePoint cloud semantic segmentation is a computer vision task that aims to segment 3D point cloud data and assign corresponding semantic labels to each point. Specifically, according to the location and other attributes of the point, point cloud semantic segmentation involves assigning each point to predefined semantic categories, such as ground, buildings, vehicles, and pedestrians. Existing methods for point cloud semantic segmentation can be broadly categorized into three types: projection-, voxel-, and raw point cloud-based methods. Projection-based methods project the 3D point cloud onto a 2D plane (e.g., an image) and then apply standard image-based segmentation techniques. Voxel-based methods divide the point cloud space into regular voxel grids and assign semantic labels to each voxel. Both methods require data transformation, which inevitably leads to some loss of feature information. By contrast, raw point cloud-based methods directly process the point cloud without any transformation, which ensures the integrity of the input algorithm network with the original point cloud data. The geometric and semantic feature information of each point in the point cloud scene needs to be fully considered and utilized to achieve accurate semantic segmentation tasks. Existing methods for point cloud semantic segmentation generally extract, process, and utilize geometric and semantic feature information separately, without considering their correlation. This approach leads to less precise local fine-grained segmentation. Therefore, this study proposes a new algorithm for point cloud semantic segmentation based on bilateral cross-enhancement and self-attention compensation. It not only fully utilizes the geometric and semantic feature information of the point cloud but also constructs offsets between them as a medium for information interaction. In addition, the fusion of local and global feature information is achieved, which enhances feature completeness and overall segmentation performance. This fusion process enhances the integrity of features and ensures the full representation and utilization of local and global contexts during the segmentation process. By considering the overall information of the point cloud scene, this algorithm demonstrates better performance in segmenting local fine-grained details and larger-scale structures.MethodFirst, the original input point cloud data are preprocessed to extract geometric contextual information and initial semantic contextual information. The geometric contextual information is represented by the original coordinates of the point cloud in 3D space, while the initial semantic contextual information is extracted using a multilayer perceptron. Next, a spatial aggregation module is designed, which consists of bilateral cross-enhancement and self-attention mechanism units. In the bilateral cross-enhancement units, local geometric and semantic contextual information is preliminarily extracted by constructing local neighborhoods for the preprocessed geometric contextual information and initial semantic contextual information. Then, offsets are constructed to facilitate cross-learning and enhancement of the local geometric and semantic contextual information by mapping it onto a common space. Finally, the enhanced local geometric and semantic contextual information is aggregated to local contextual information. Next, using the self-attention mechanism, global contextual information is extracted and fused with the local contextual information to compensate for the singularity of the local contextual information, which results in a comprehensive feature map. Finally, the multi-resolution feature maps obtained at different stages of the spatial aggregation module are fed into the feature fusion module for multi-scale feature fusion, which produces the final comprehensive feature map. Thus, high-performance semantic segmentation is achieved.ResultExperimental results on the Stanford 3D indoor spaces dataset(S3DIS) show a mean intersection over union (mIoU) of 70.2%, a mean class accuracy of (mAcc) 81.7%, and an overall accuracy (OA) of 88.3%, which are 2.4%, 2.0%, and 1.0% higher than those of the existing representative algorithm RandLA-Net. Meanwhile, for Area 5 of the S3DIS, the mIoU is 66.2%, which is 5.0% higher than that of RandLA-Net. In addition, visualizations of the segmentation results are achieved on the Semantic3D dataset.ConclusionBy utilizing the spatial aggregation module, the proposed algorithm maximizes the utilization of geometric and semantic contextual information, which enhances the details of local contextual information. In addition, the integration of local and global contextual information through self-attention mechanism ensures comprehensive feature representation. As a result, the proposed algorithm achieves a significant improvement in the segmentation accuracy of fine-grained details in point clouds. Visual analysis further validates the effectiveness of the algorithm. Compared with baseline algorithms, the proposed algorithm demonstrates clear superiority in the fine-grained segmentation of local regions in point cloud scenes. This result serves as partial evidence, which confirms the effectiveness of the proposed algorithm in addressing challenges related to point cloud segmentation tasks. In conclusion, the spatial aggregation module and its fusion of local and global contextual information significantly improve the segmentation accuracy of local details in point clouds. This approach offers a promising solution to enhance the segmentation accuracy of fine-grained details in point cloud local regions.  
      关键词:point cloud;semantic segmentation;bilateral cross enhancement;self-attention mechanism;feature fusion   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496364 false
      发布时间:2024-08-13
    • Zhou Hao,Qi Honggang,Deng Yongqiang,Li Juanjuan,Liang Hao,Miao Jun
      Vol. 29, Issue 8, Pages: 2399-2412(2024) DOI: 10.11834/jig.230568
      摘要:ObjectivePerception systems are integral components in modern autonomous driving systems. They are designed to accurately estimate the state of the surrounding environment and provide reliable observations for prediction and planning. 3D object detection can intelligently predict the location, size, and category of key 3D objects near the autonomous vehicle, and it is an important part of the perception system. In 3D object detection, common data types include images and point clouds. Compared with images, a point cloud is a dataset composed of many points in a 3D space, and the position information of each point is represented by coordinates in a 3D coordinate system. In addition to position information, information such as reflection intensity is usually included. In the field of computer vision, point clouds are often used to represent the shape and structure information of 3D objects. Therefore, the 3D object detection method based on point cloud has more real spatial information and often has more advantages in detection accuracy and speed. However, the point cloud is often converted into a 3D voxel grid due to the unstructured nature of the point cloud. Each voxel in the voxel grid is regarded as a 3D feature vector. Then, the 3D convolutional network is used to extract the feature of the voxel, which completes the 3D object detection task based on the voxel feature. In the voxel-based 3D object detection algorithm, the voxelization of the point cloud will lead to the loss of data information and structural information of part of the point cloud, which affects the detection effect. We propose a method that combines point cloud depth information to solve this problem. Our method uses point cloud depth information as fusion information to complement the information lost in the voxelization process. It also uses the efficient YOLOv7-Net network to extract fusion features, improve the detection performance and feature extraction capabilities of multi-scale objects, and effectively increase the accuracy of 3D object detection.MethodThe point cloud is first converted into a depth image through spherical projection to reduce the information loss of the point cloud during the voxelization process. The depth image refers to a grayscale image generated through the point cloud, which reflects the distance from each point to the origin of the coordinate system in 3D space. Then, the pixel gray value is used to represent the depth information of the point cloud. Therefore, the depth image of the point cloud can provide a rich feature representation for the point cloud, and the depth information of the point cloud can be used as fusion information to complement the information lost in the voxelization process. Thereafter, the depth image is fused with the feature map extracted by the 3D object detection algorithm to complement the information lost in the voxelization process. Given that the fusion features at this time are more in the form of pseudo-images, a more efficient backbone feature extraction network is selected to extract fusion features. The backbone feature extraction network in YOLOv7 uses an adaptive convolution module, which can adaptively adjust the size of the convolution kernel and the size of the receptive field according to the scale. This way improves the detection performance of the network for multi-scale objects. At the same time, the feature fusion module and feature pyramid pooling module of YOLOv7-Net further enhance the feature extraction ability and detection performance of the network. Therefore, we choose to use YOLOv7-Net to extract fusion features. Finally, the classification and regression network is designed, and the extracted fusion features are sent to the classification and regression network to predict the category, position, and size of the object.ResultOur method is tested on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) 3D object detection dataset and the DAIR-V2X Object Detection Dataset. Using average precision (AP) as the evaluation index, PP-Depth has improvements of 0.84%, 2.3%, and 1.77% in the categories of cars, pedestrians, and bicycles on the KITTI dataset compared with PointPillars. Using the simple difficulty of bicycles as an example, PP-YOLO-Depth has improvements of 5.15%, 1.1%, and 2.75% compared with PointPillars, PP-YOLO, and PP-Depth, respectively. On the DAIR-V2X dataset, PP-Depth has improvements of 17.46%, 20.72%, and 12.7% in the categories of cars, pedestrians, and bicycles compared with PointPillars. Using the simple difficulty of cars as an example, PP-YOLO-Depth has improvements of 13.53%, 5.59%, and 1.08% compared with PointPillars, PP-YOLO, and PP-Depth, respectively.ConclusionExperimental results show that our method achieves good performance on the KITTI 3D object detection dataset and the DAIR-V2X Object Detection Dataset. It reduces the information loss of the point cloud during the voxelization process and improves the ability of the network to extract fusion features and multi-scale object detection performance. Thus, it obtains more accurate object detection results.  
      关键词:autonomous driving;3D point cloud object detection;deep information fusion;point cloud voxelization;KITTI dataset   
      1
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496627 false
      发布时间:2024-08-13
    • Jia Di,Cai Peng,Wu Si,Wang Qian,Song Huilun
      Vol. 29, Issue 8, Pages: 2413-2425(2024) DOI: 10.11834/jig.230575
      摘要:ObjectiveIn recent years, the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision. Stereo matching is a classic and computationally intensive task in computer vision. It is commonly used in various advanced visual processing applications such as 3D reconstruction, autonomous driving, and augmented reality. Given a pair of distortion-corrected stereo images, the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity, also known as disparity. In recent years, many researchers have explored deep learning-based stereo matching methods, which achieving promising results. Convolutional neural networks are often used to construct feature extractors for stereo matching. Although convolution-based feature extractors have yielded significant improvements in performance, neural networks are still constrained by the fundamental operation unit of “convolution”. By definition, convolution is a linear operator with a limited receptive field. Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures. This limitation becomes particularly pronounced in stereo matching tasks. In stereo matching tasks, captured stereo image pairs inevitably contain large areas of weak texture. Substantial computational resources are required to obtain comprehensive global feature representations through repeated convolutional layer stacking. We build a dense feature extraction Transformer for the stereo matching tasks, which incorporates Transformer and convolution blocks, to address the abovementioned issue.MethodIn the context of stereo matching tasks, FET exhibits three key advantages. First, by addressing high-resolution stereo image pairs, the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation. This way addresses the issue of feature scarcity caused by local weak textures. Second, we utilize convolution and transposed convolution blocks for implementing subsampling and upsampling overlapping patch embeddings, which ensures that all points nearby features are captured as comprehensively as possible to facilitate fine-grained matching. Third, we experiment with employing a skip-query strategy for feature fusion between the encoder and decoder to efficiently transmit information. Finally, we incorporate the attention-based pixel matching strategy of stereo Transformer (STTR) to realize a purely Transformer-based architecture. This strategy truncates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values.ResultIn the experimental section, we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090. We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed. However, training a pure Transformer architecture in mixed precision proved to be unstable. The model experienced loss divergence errors after only a few iterations. We modified the order of computation for attention scores to suppress related overflows for addressing this issue. We also restructured the attention calculation method based on the additivity invariance of the softmax operation. Ablation experiments were conducted on the Scene Flow dataset. Results show that the proposed network achieves an absolute pixel distance of 0.33, an outlier pixel ratio of 0.92%, and a 98% overlap prediction intersection over union. Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios. In these experiments, the proposed method achieved an average outlier percentage of 1.78, which outperformed mainstream methods such as STTR. Moreover, in tests on the KITTI-2015, MPI-Sintel, and Middlebury-2014 datasets, the proposed model demonstrated strong generalization capabilities. Subsequently, considering the limited definition of weak texture levels in currently available public datasets, we employed a clustering approach to filter images from the Scene Flow test dataset. Each pixel in the images was treated as a sample, with RGB values serving as the feature dimensions. This clustering process resulted in quantifying the number of different pixel categories within each image, which provided a measure of the texture strength or weakness in the images. The images were then categorized into “difficult”, “moderate”, and “easy” cases based on the number of clusters. Through comparative analysis, our approach consistently outperformed existing methods across the three sample categories, with a particularly notable improvement observed in the “difficult” case category.ConclusionFor the stereo matching task, we propose a feature extractor based on the Transformer architecture. First, we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor, which effectively combines the inductive bias of convolutions with the global modeling capabilities of the Transformer. In addition, the Transformer-based feature extractor can capture a broader range of contextual representations, which partially alleviates region ambiguity issues caused by local weak textures. Furthermore, we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer, which mitigates semantic discrepancies between them. We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings, which keeps the attention computation of the model within linear complexity. Experimental results demonstrate a significant improvement in weak texture region prediction, occluded region prediction, and domain generalization when compared with relevant methods.  
      关键词:stereo matching;low-texture target;Transformer;spatial pooling windows;jump queries;truncated summation;Scene Flow;KITTI-2015   
      0
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496703 false
      发布时间:2024-08-13

      Medical Image Processing

    • Wang Zhen,Huo Guanglei,Lan Hai,Hu Jianmin,Wei Xian
      Vol. 29, Issue 8, Pages: 2426-2438(2024) DOI: 10.11834/jig.230595
      摘要:ObjectiveRetinal fundus images have important clinical applications in ophthalmology. These images can be used to screen and diagnose various ophthalmic diseases, such as diabetic retinopathy, macular degeneration, and glaucoma. However, the acquisition of these images is often affected by various factors in real scenarios, including lens defocus, poor ambient light conditions, patient eye movements, and camera performance. These issues often lead to quality problems such as blurriness, unclear details, and inevitable noise in fundus images. Such poor-quality images pose a challenge to ophthalmologists in their diagnostic work. For example, blurred images will lead to the absence of detailed information about the morphological structure of the retina, which causes difficulty for the physicians to accurately localize and identify abnormalities, lesions, exudations, and other conditions. Existing enhancement methods for fundus images have progressed in improving image quality. However, some problems still exist, such as image blurring, artifacts, missing high-frequency information, and increased noise. Therefore, in this study, we propose a convolutional dictionary diffusion model, which combines convolutional dictionary learning with conditional diffusion model. This algorithm aims to cope with the abovementioned problems of low-quality images to provide an effective tool for fundus image enhancement. Our approach can improve the quality of fundus images and enable physicians to increase diagnostic confidence, improve assessment accuracy, monitor treatment progress, and ensure better care for patients. This method will contribute to ophthalmic research and provide more opportunities for prospective healthcare management and medical intervention, which positively impacts patients’ ocular health and overall quality of life.MethodThe algorithm consists of two parts: simulation of diffusion process and inverse denoising process. First, random noise is gradually added to the input image to obtain a purely noisy image. Then, a neural network is trained to gradually remove the noise from the image until a clear image is finally obtained. This study takes the blurred fundus image as the conditional information to better preserve the fine-grained structure of the image. Collecting blurred-clear fundus image pairs is difficult. Thus, synthetic fundus dataset is widely used for training. Therefore, a Gaussian filtering algorithm is designed to simulate the defocus blur images. In the training process, the conditional information and the noisy image are first spliced and fed into the network, and the abstract features of the image are extracted by continuously reducing the image size through downsampling. This procedure can significantly reduce the time and space complexity of the sparse representation calculation. Then, the convolutional network is used to implement convolutional dictionary learning and obtain the sparse representation of the image. Given that the self-attention mechanism can capture non-local similarity and long-range dependency, this study adds self-attention to the convolutional dictionary learning module to improve the reconstruction quality. Finally, hierarchical feature extraction is achieved by feature concatenation to realize information fusion between different levels and better use local features in the image. The downsampled feature is recovered to the original image size by an inverse convolutional layer. The model minimizes the negative log-likelihood loss, which represents the difference in probability distribution between the generated image and the original image. After the model is trained, a clear fundus image is generated by gradually removing the noise from a noisy picture with a blurred image as conditional input.ResultThe proposed method was evaluated on EyePACS dataset, and multiple experiments were performed on synthetic datasets DRIVE (digital retinal images for vessel extraction), CHASEDB1 (child heart and health study in England), ROC (retinopathy online challenge), realistic datasets RF (real fundus) and HRF (high-resolution fundus) to demonstrate the generalizability of our model. Experimental results show that the evaluation metrics peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS) are improved on average by 1.992 9 and 0.028 9, respectively, compared with the original diffusion model (learning enhancement from degradation (Led)). Moreover, the proposed approach was used as a preprocessing module for downstream tasks. The experiment on retinal vessel segmentation is adopted to prove that our approach can benefit the downstream tasks in clinical application. The results of segmentation experiments on the DRIVE dataset show that all the segmentation metrics improve compared with the original diffusion model. Specifically, the area under the curve (AUC), accuracy (Acc), and sensitivity (Sen) are improved by 0.031 4, 0.003 0, and 0.073 8 on average, respectively.ConclusionThe proposed method provides a practical tool for fundus image deblurring and a new perspective to improve the quality and accuracy of diagnostic. This approach has a positive impact on patients and ophthalmologists and is expected to promote further development in the interdisciplinary research of ophthalmology and computer science.  
      关键词:fundus image enhancement;convolutional dictionary learning;sparse representation;diffusion model;conditional diffusion model   
      3
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 67496701 false
      发布时间:2024-08-13
    0