最新刊期

卷 28 ，期 11 ， 2023

Review

Deep learning-based quality enhancement for 3D point clouds： a survey

Chen Jianwen,Zhao Lili,Ren Lancao,Sun Zhuoqun,Zhang Xinfeng,Ma Siwei
Vol. 28, Issue 11, Pages: 3295-3319(2023) DOI: 10.11834/jig.221076

摘要：With the development of 3D detection technologies， point clouds have gradually become one of the most common data representations of 3D objects or scenes that are widely used in many applications， such as autonomous driving， augmented reality （AR）， and virtual reality （VR）. However， due to limitations in hardware， environment， and occlusion， the acquired point clouds are usually sparse， noisy， and uneven， hence imposing great challenges to the processing and analysis of point clouds. Therefore， point cloud quality enhancement techniques， which aim to process the original point cloud to obtain a dense， clean， and structurally complete point cloud， are of great significance. In recent years， with the development of hardware and machine learning technologies， deep-learning-based point cloud quality enhancement methods， which have great potential to extract the features of point clouds， have attracted the attention of scholars at home and abroad. Related works mainly focus on point cloud completion， point cloud upsampling （also known as super-resolution）， and point cloud denoising. Point cloud completion fills the incomplete point clouds to restore the complete point cloud information， while point cloud upsampling increases the point number of the original point cloud to obtain a denser point cloud， and point cloud denoising removes the noisy points in the point cloud to obtain a cleaner point cloud. This paper systematically reviews the existing point cloud quality enhancement methods based on deep learning to offer a basis for subsequent research. First， this study briefly introduces the fundamentals and key technologies that are widely used in point cloud analysis. Second， three types of point cloud quality enhancement technologies， namely， upsampling， completion， and denoising， are introduced， classified， and summarized. According to the types of input data， point cloud completion methods can be divided into voxel- and point-based algorithms， with the latter being further sub-divided into two types depending on whether the encoder-decoder structure is exploited or not. The encoder-decoder-structure-based algorithms can be further divided according to whether the generative adversarial network （GAN） structure is used. Point cloud upsampling methods can be classified into convolutional neural network （CNN）-based algorithms， GAN-based algorithms， and graph convolutional network （GCN）-based algorithms. Point cloud denoising methods can also be divided into two types based on whether the encoder-decoder structure is exploited or not. Third， the commonly used datasets and evaluation metrics in point cloud quality enhancement tasks are summarized. The performance evaluation metrics for geometry reconstruction mainly include chamfer distance， earth mover’s distance， Hausdorff distance， and point-to-surface distance. This paper then compares the state-of-the-art algorithms of point cloud completion and upsampling on common datasets and identifies the reasons for the differences in their performance. The recent progress and challenges in the field are then summarized， and future research trends are proposed. The findings are summarized as follows： 1） the point cloud features extracted by existing deep learning-based algorithms are highly global， which means that the local features related to the detailed structure cannot be captured well， thus resulting in poor detail reconstruction. Traditional geometric algorithms are known to effectively represent data features based on geometric information. Therefore， how to combine geometric algorithms with deep learning for point cloud quality enhancement is worth exploring. 2） Most algorithms are for dense point clouds of single objects， and only a few studies have focused on sparse LiDAR point clouds containing large-scale outdoor scenes. 3） Most of the related studies only consider the point cloud processing of a single frame and ignore the temporal correlation of point cloud sequences. Therefore， how to utilize the spatial-temporal correlation to improve quality enhancement performance warrants further investigation. 4） In existing methods， the proposed network models are often complex and the inference speed is relatively slow， which fail to meet the real-time requirements of several applications. Therefore， how to further reduce the scale of the model parameters and improve the inference speed is a research direction worth exploring. 5） Most of the existing methods only process the geometric information （3D coordinates） of point clouds and ignore the attribute information （e.g.， color and intensity）. Therefore， how to simultaneously enhance the quality of geometric and attribute information needs to be explored. Project page： https://github.com/LilydotEE/Point_cloud_quality_enhancement.

关键词：point cloud completion;point cloud upsampling;point cloud denoising;quality enhancement;deep learning

3

|

1

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020326 false

发布时间：2023-12-07
Progress in multi-modal image semantic segmentation based on deep learning

Zhao Shenlu,Zhang Qiang
Vol. 28, Issue 11, Pages: 3320-3341(2023) DOI: 10.11834/jig.220451

摘要：Unlike some low-level vision tasks， such as image deraining， dehazing， and deblurring， semantic segmentation aims to decompose a visual scene into different semantic category entities and achieve category prediction for each pixel in an image， which plays an indispensable role in many scene understanding systems. Most existing semantic segmentation models use visible red-green-blue（RGB） images to perceive the scene contents. However， visible cameras have poor robustness to changing illumination and are unable to penetrate through smoke， fog， haze， rain， and snow. Limited by their imaging mechanism， visible cameras hardly capture sufficient and effective scene information under poor lighting and horrible weather conditions. Furthermore， they cannot provide the spatial structures and 3D layouts of various scenes， thus preventing them from handling complex scenes with similar target appearances or multiple changing scene areas. In recent years， with the continuous development of sensor technologies， thermal infrared and depth cameras have been widely used in military and civil fields. Compared with visible cameras， depth cameras can acquire the physical distances between the objects and the optical centers of sensors in the scenes， while thermal infrared cameras reflect the thermal radiations of objects whose temperatures exceed absolute zero （-273 ℃） under various lighting and weather conditions， thus providing rich contour and semantic cues. However， depth and thermal infrared images usually lack colors， textures， and other details. Given the difficulty for unimodal images to provide complete information about complex scenes， multi-modal image semantic segmentation aims to combine the complementary characteristics of images from different modalities （i.e.， the images acquired by sensors based on different imaging mechanisms） to achieve comprehensive and accurate predictions. At present， there are many leading-edge works for multi-modal image semantic segmentation based on deep learning， but comprehensive reviews are scarce. In this paper， we provide a systematic review of the recent advances in multi-modal image semantic segmentation， including red-green-blue-thermal （RGB-T） and red-green-blue-depth （RGB-D） semantic segmentation. First， we summarize and analyze the current mainstream deep-learning-based algorithms for RGB-T and RGB-D semantic segmentation. Specifically， according to the different emphases of these algorithms， RGB-T semantic segmentation models based on deep learning are divided into three categories， namely， image-feature-enhancement-based methods， multi-modal-image-feature-fusion-based methods， and multi-level-image-feature-interaction-based methods. In image-feature-enhancement-based methods， unimodal or multi-modal fused image features are directly or indirectly enhanced by employing some attention mechanisms and embedding some auxiliary information. These models aim to mitigate the influence of interference information and mine such highly discriminative information from unimodal or multi-modal fused image features， thus effectively improving the semantic segmentation accuracy. Multi-modal-image-feature-fusion-based methods mainly focus on how to effectively exploit the complementary characteristics between RGB and thermal infrared features to give full play to the advantages of multi-modal images. Unlike unimodal image semantic segmentation， multi-modal image feature fusion is unique to multi-modal image semantic segmentation. Therefore， most of the existing RGB-T semantic segmentation methods are dedicated to designing fusion modules for integrating multi-modal image features. Receptive fields of different scales can extract the information of objects with different sizes in the scenes. With this in mind， the interactions among multi-level image features can help capture rich multi-scale contextual information that can significantly boost the performance of semantic segmentation models， especially in scenes containing multi-scale objects. Multi-level-image-feature-interaction-based methods have been widely used in unimodal image semantic segmentation， such as non-local networks and DeepLab. Similarly， some works have adopted these methods in RGB-T semantic segmentation and achieved satisfying results. Alternatively， according to the exploitation of depth information， RGB-D semantic segmentation models based on deep learning are divided into depth-information-extraction-based methods and depth-information-guidance-based methods， with the former being further subdivided into multi-modal-image-feature-fusion-based methods and contextual-information-mining-based methods. Similar to RGB-T semantic segmentation methods， depth-information-extraction-based methods regard the depth and RGB images as two separate input data that capture the discriminative information within RGB and depth features by extracting and fusing unimodal image features. In depth-information-guidance-based methods， depth information is embedded into the feature extraction of RGB images. By doing so， depth-information-guidance-based methods can make full use of the 3D information provided by depth images and reduce their model sizes to some extent. Second， we introduce some widely used evaluation criteria and public datasets for RGB-D/RGB-T semantic segmentation and compare and analyze the performance of various models. Specifically， for RGB-T semantic segmentation， graded-feature multilabel-learning network（GMNet） and multiscale feature fusion and enhancelment network（MFFENet） achieve the best performance in terms of mean intersection over union per class （mIoU）（57.3%） and mean accuracy per class （mAcc）（74.3%）， respectively， on the MFNet dataset. Meanwhile， on the PST900 dataset， GMNet still achieves the best performance in mIoU （84.12%）， but EGFNet achieves the best performance in mAcc （94.02%）. For RGB-D semantic segmentation， GLPNet achieves the best performance in mIoU （54.6%） and mAcc （66.6%） on the NYUD v2 dataset. On the SUN-RGBD dataset， GLPNet achieves the best performance in mAcc （63.3%）， but Zig-Zag achieves the best results in mIoU （51.8%）. We also point out some future development directions for multi-modal image semantic segmentation.

关键词：multi-modal image;semantic segmentation;feature enhancement;feature fusion;feature interaction;depth information extraction;depth information guidance

6

|

2

|

1

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020949 false

发布时间：2023-12-07
Survey of image semantic segmentation methods in the deep learning era

Yan Yi,Deng Chao,Li Lin,Zhu Lingkun,Ye Biao
Vol. 28, Issue 11, Pages: 3342-3362(2023) DOI: 10.11834/jig.220292

摘要：Introduced by Ohta in 1980， image semantic segmentation assigns each pixel in an image with a pre-defined label that represents its semantic category. Aiming to understand the different scenes of images， image semantic segmentation has received much research attention in the field of computer vision. In recent years， many research laboratories around the world have carried out research work on image semantic segmentation based on deep learning. Academic conferences in the fields of automation， artificial intelligence， and pattern recognition also reported research results on semantic segmentation. At the same time， semantic segmentation serves as the premise and basis of many computer vision tasks and has important application value in virtual reality， such as automatic driving and human-computer interaction. With the rapid development of deep learning technology， especially the emergence of convolutional neural networks， image semantic segmentation technology has made great progress and has far outperformed traditional methods in terms of accuracy and efficiency. First， this paper introduces the concept of semantic segmentation along with its background and basic process. In general， image semantic segmentation based on deep learning goes through three processing modules， namely， the feature extraction， semantic segmentation， and refinement processing modules. Second， this paper summarizes the open source 2D， RGB-D， and 3D datasets that have been used in recent years and their corresponding segmentation methods. The semantic segmentation methods for 2D data are divided into method based on candidate region， method based on fully supervised learning， and method based on weakly supervised learning. As RGB-D and 3D date， only a few semantic segmentation methods need to be classified， thus no further classification is performed. This paper describes in detail the network structure of several classical algorithms， the segmentation characteristics， advantages， and disadvantages of different networks， and their segmentation accuracy. Through this summary， this study reveals that most segmentation methods are based on fully supervised learning， which is an effective training method. Third， this paper introduces several authoritative performance evaluation indexes of algorithms， such as mean average precision （mAP） and mean intersection over union （mIoU）， and tests the segmentation accuracy and computing performance of the semantic segmentation method when applied in 2D-data-related experiments. The Experimental section shows that the DeepLab-V3+ network has good segmentation accuracy and speed， which attest to its high application value. The semantic segmentation performance for 2.5D and 3D data is also compared. The following key problems are highlighted in this section： some algorithms are not tested on authoritative datasets； some algorithms are not open source； and some experiments do not describe the relevant experimental parameters in detail. Therefore， considering the current situation of research at home and abroad， this paper highlights several challenges and proposes some new directions for future research. First， segmentation algorithms tend to prioritize either accuracy or real time while ignoring the other. Second， a segmented network usually needs large amounts of memory to realize reasoning and training， hence making it unsuitable for some devices. Third， the design of the segmentation algorithm adapted to 3D data is a current research focus， but high-quality 3D datasets are generally lacking， and the existing 3D datasets are patchwork datasets. Fourth， only a few segmentation algorithms are available for RGB-D and 3D data （particularly for 3D data）， and open-source algorithms generally have low accuracy. Fifth， sequence data have temporal consistency. Sixth， some methods solve the problem of video or sequence segmentation， while others do not use time series information to improve accuracy or segmentation efficiency. Seventh， some papers have proposed that face detection can be realized without training deep neural network and examined whether semantic segmentation can be realized without a training network. Through summary and analysis， this paper hopes to provide some valuable reference for future research on image semantic segmentation.

关键词：deep learning;image semantic segmentation （ISS）;convolutional neural network （CNN）;supervised learning;Deeplab-V3+ network

5

|

2

|

1

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020951 false

发布时间：2023-12-07
Review of deep learning methods for karyotype analysis

Luo Chunlong,Zhao Yi
Vol. 28, Issue 11, Pages: 3363-3385(2023) DOI: 10.11834/jig.221094

摘要：Chromosomal abnormalities can lead to serious diseases， such as chronic myeloid leukemia and down syndrome. Karyotyping can count chromosomes in metaphase images， segment them from the background， arrange them according to certain rules， and observe and issue diagnostic results. Therefore， karyotype analysis has been widely used in many modern clinical fields and scientific research. However， even an experienced cytogeneticist requires much time to complete karyotyping. Although machine learning or traditional geometric methods have tried to automate karyotype analysis， most of them have shown poor performance and do not satisfy clinical requirements， which means that cytogeneticists still require much time for manual intervention. While many deep-learning-based methods have been proposed， systematic reviews are lacking. This paper reviews the recent literature and summarizes them into chromosome counting， chromosome segmentation， chromosome cluster classification， chromosome preprocessing， chromosome classification， and chromosome anomaly. First， the chromosome counting methods are summarized based on bounding box detection to accurately identify each chromosome on the metaphase images. Specifically， these methods need to ﬁnd candidate object proposals， classify them into different classes， and reﬁne the locations. However， they must solve self-similarity problems， over-deletion problems， and inaccurate localization problems resulting from overlapping chromosomes. Researchers have also attempted to accelerate model inference speed through lightweight backbones. Methods for the chromosome segmentation task can be divided into semantic and instance segmentation methods. On the one hand， semantic segmentation methods can only solve the problem of segmenting chromosome clusters formed by two or more overlapping chromosomes， and some post-processing should be introduced to splice chromosomes. On the other hand， instance segmentation methods can automate chromosome segmentation， and additional supervision information， such as key points or orientation information， can further improve its performance. Given that some chromosome segmentation methods can only solve a specific type of chromosome cluster， the types of clusters should be identified. Existing methods roughly classify chromosome clusters according to two criteria， namely， based on the number of overlapping chromosomes and based on the interrelationship between the touching and overlapping chromosomes. However， from the methodological perspective， previous studies are mostly based on simple convolution neural networks （CNNs）. Therefore， further innovative studies on chromosome cluster classification are required. As for the chromosome preprocessing task， existing methods mainly address the two preprocessing tasks of metaphase image denoising and chromosome straightening. The metaphase image denoising task is solved in a segmentation manner， where the chromosomes are regarded as a whole area that needs to be segmented from the background and impurities present in an image. The existing chromosome straightening methods rely on generative adversarial networks to straighten curved chromosomes and generally follow the image translation or motion transformation framework. Benefiting from the booming development of deep-learning-based image classification networks， the chromosome classification task has also received much attention and development in karyotype-analysis-related tasks. According to their properties， the available methods can be divided into 1） simple CNN-based methods， which redesign the network aiming at chromosome instances instead of directly using the famous CNN model proposed for the ImageNet dataset； 2） feature-contrastive-based methods， which extract representative features in a contrastive manner and then classify them through a simple classifier； 3） image-preprocessing-based methods， where super-resolution methods are applied before classification to unify the size of chromosome images or enhance the banding pattern features using different filters； 4） global- and local-feature-fusion-based methods， which explicitly crop and extract features of the local but important image parts and then fuse them for final classification； and 5） complex-strategy-based methods， which solve the chromosome classification task by detecting chromosomes from metaphase images and improve performance using the ensemble learning framework. The final reviewed task is chromosome anomaly that includes detection and generation subtasks. Despite being a subject of concern for clinical experts， previous studies can only detect a specific type of chromosome anomaly through basic CNN or roughly discriminate between normal and abnormal chromosomes using the generative adversarial network framework. Meanwhile， the available approaches for generation subtasks are based on generative adversarial networks. At the end of this paper， the various tasks and main methodologies are summarized and reviewed， and then feasible future developments are proposed. First， to fulfill these tasks， multiple advanced solution paradigms， such as multi-modality and image question answering， should be introduced. Second， chromosomal abnormality diagnosis has not been addressed because it involves the extraction of band-level features and relational reasoning. Third， pretraining models in a self-supervised learning manner are worth further research. Despite the unavailability of high-quality labeled data for chromosomes， a large amount of clinically unlabeled data can still reduce the cost of data labeling and improve the performance of downstream tasks through the self-supervised learning paradigm. In sum， deep-learning-based automatic karyotyping methods should be reviewed further to draw additional research interest.

关键词：deep learning;computer aided diagnosis;chromosome karyotype;chromosome classification;chromosome segmentation

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020329 false

发布时间：2023-12-07

Dataset

SSRGFD：stereo super-resolution image general forensic dataset

Yin Chengxi,Zhang Bolin,Luo Junwei,Zhu Chuntao,Fu Jingqiao,Lu Wei
Vol. 28, Issue 11, Pages: 3386-3399(2023) DOI: 10.11834/jig.220508

摘要：ObjectiveWith the rapid development of computers and networks， images have become an important component of information transmission and sharing. However， the popularization of image editors reduces the cost of tampering with image content and destroying image semantics. To avoid the threat posed by malicious tampering images to social stability and security， the integrity and authenticity of images should be detected. The popularization of dual cameras and the development of stereo super-resolution methods have introduced stereo super-resolution images into many fields， such as photography， remote sensing， and intelligent robots. However， the characteristics of these images differ from those of monocular images. Faced with new imaging devices and algorithms， the effectiveness of existing image tampering detection algorithms needs to be revalidated， and the security of stereo super-resolution images should be further examined. Nevertheless， stereo super-resolution image tampering datasets are currently lacking， hence posing difficulties in meeting the needs of image forensics research. To this end， a high-quality stereo super-resolution image general forensic dataset （SSRGFD） is constructed in this paper.MethodForgery images in the SSRGFD dataset are all stereo super-resolution images generated using Adobe Photoshop and deep learning algorithms. Three common categories of forgeries are considered， namely， copy-move， splicing， and inpainting， of which inpainting includes removal and restoration. Stereo super-resolution images are initially generated from the Flickr1024 dataset using the stereo super-resolution method PASSRnet. To enrich the image content and conceal visible traces of tampering， several tampering standards are designed for different forgery categories to make the tampering images in line with real scenes. Afterward， the copy-move and splicing stereo super-resolution tampering images are tampered using Adobe Photoshop based on these standards. In the copy-move stage， one or more duplicated regions are copied and moved to other appropriate regions on the image. In the splicing stage， one or more regions are replaced with duplicated regions from different images to synthesize the tampering image. Meanwhile， image inpainting involves the removal of some objects and the recovery of the missing region on an image. When removing image objects， the pixel in a certain range is picked up according to the image content to simulate the image background and cover the target object using Photoshop tools. Moreover， the HiFill deep learning method is used to recover the missing regions on these images. Due to the differences between the training images and the images to be tampered with， tampering images with obvious traces of inpainting are removed manually. For each forgery category， various preprocessing and post-processing methods may be applied to cover visible traces of tampering according to the proposed standards. The masks of the tampered region corresponding to each tampering image are also provided.ResultThe subjective visual quality of the SSRGFD dataset is evaluated using the double stimulus continuous quality scale method， and the difference in the scores given by the evaluator between the tampering and real images is less than 1.5. Among them， copy-move has the lowest score difference due to the fact that the duplicated and tampering regions come from the same image and that only few tampering traces are visible. Three no-reference image quality evaluation methods， namely， blind/referenceless image spatial quality evaluator（BRISQUE）， natural image quality evaluator（NIQE）， and parent institute for quality education（PIQE） are also applied for an objective visual quality evaluation. The evaluation results for the real and forgery images of the SSRGFD dataset are very similar. These experiments illustrate that forgery images have promising visual quality. To demonstrate the challenge posed by stereo super-resolution images to existing image tampering detection and localization methods， experiments are conducted using four effective models， namely， manipulation tracing network（MantraNet）， ringed residual U-Net（RRU-Net）， QMPPNet， and dense fully convolutional network（DenseFCN）. These experiments use precision， recall， pixel-level F1 score， Matthews correlation coefficient （MCC）， and intersection over union （IoU） as evaluation parameters for the methods proposed above. QMPPNet performs best in all metrics by combining semantic segmentation and edge features. The results of QMPPNet in precision， recall， pixel-level F1 score， MCC， and IoU are 0.507 9， 0.899 8， 0.614 8， 0.643 3， and 0.480 2， respectively. However， compared with the monocular images forgery dataset， the performance of these methods on the SSRGFD dataset is greatly reduced.ConclusionThis paper constructs a tampering dataset on stereo super-resolution images. These images greatly affect the performance of the existing image tampering localization methods and dramatically reduce the capability of these methods on stereo super-resolution images. Several experiment results have verified the necessity of new image forensic methods adapted for stereo super-resolution images. However， the existing monocular datasets cannot meet research needs. The SSRGFD dataset with affluent content and considerable visual quality not only brings great challenges to image forensic methods but also provides effective data support for the image tampering detection and localization of stereo super-resolution images. The SSRGFD dataset is available at https：//github.com/YL1006/SSRGFD.

关键词：digital image forensic;image tampering detection;stereo super-resolution image tampering dataset;copy-move;splicing;inpainting;image visual quality evaluation

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020878 false

发布时间：2023-12-07

Image Processing and Coding

Self-adaptive semantic awareness network for blind image quality assessment

Chen Jian,Wan Jiaze,Lin Li,Li Zuoyong
Vol. 28, Issue 11, Pages: 3400-3414(2023) DOI: 10.11834/jig.220939

摘要：ObjectiveThe rapid development of imaging technology has been accompanied by continuous updates in acquisition equipment and related technologies over the past few decades. However， the quality of images is susceptible to interferences from various stages， including acquisition， processing， transmission， and storage， which eventually introduce different types （e.g.， JPEG2000 compression， JPEG compression， white Gaussian noise， Gaussian blur， fast fading distortion， and contrast distortion） and degrees of distortions that degrade image quality. Therefore， blind image quality assessment （BIQA） has practical significance in the field of image quality control and is helpful for subsequent image processing and analysis. Although many other methods have achieved reasonable results in the blind image quality assessment of degraded images， their image quality assessment accuracy warrants further improvement when dealing with the distortions of natural images. The challenges in assessing natural image distortions include the following： 1） natural image distortions are much more complex compared with synthetic image distortions because the former contains not only global distortion （e.g.， out of focus and Gaussian noise） but also local distortion （e.g.， overexposure and motion blur）， which increases the difficulty of image quality assessment； 2） among the different semantic features extracted by deep convolutional neural network （DCNN）， the lower-level semantic features contain less semantic information and cannot provide a comprehensive overview and understanding of the image information， thereby hindering networks from coping with the distortions of natural images with diverse contents； and 3） although the high-level semantic features obtained by DCNN contain rich semantic information， the lack of local detail information of the image easily makes the whole network overlook the local distortions. To address these problems， this paper proposes a blind image quality evaluation method called self-adaptive semantic awareness network （SSA-Net）.MethodFirst， images from different databases are not uniform in size and are prone to be large， and deep-learning-based networks usually require a fixed size for input images. Therefore， all input images are randomly cropped 25 times to represent the content of the original image. Second， to enable the network to extract rich semantic features， a 50-layer deep residual network （ResNet-50） with pre-trained weights obtained from ImageNet is leveraged for feature extraction and is used to capture the semantic features of the images at each stage. Third， a multi-head position attention （MPA） module is designed to address the content diversity of naturally degraded images， which would improve the understanding of image content and the accuracy of the subsequent perceptions of distortion types by adding absolute position encoding into the multi-head position attention to acquire fixed distortion position information. Fourth， the self-adaptive feature awareness （SFA） module is presented to address the diversity of distortion types in naturally degraded images. This module combines the understanding of image content and the use of pooling kernels with different sizes to capture the global and local distortions in images. Fifth， a multi-level supervision regression （MSR） network with learnable parameters that uses lower-level semantic features to assist the higher-level semantic features is proposed to derive prediction scores that are in line with the human visual system.ResultExperiments are conducted on 7 databases with 11 different methods for comparison. The proposed method achieves the best performance on four natural distortion image databases with Spearman rank order correlation coefficient （SRCC） values of 0.867， 0.877， 0.913， and 0.915 for LIVE in the Wild Image Quality Challenge （LIVEC） database， blurred image database （BID）， Konstanz authentic image quality 10k database （KonIQ-10k）， and smartphone photography attribute and quality （SPAQ） database， respectively. This method also obtains the highest Pearson linear correlation coefficient （PLCC） values of 0.886， 0.881， 0.923， and 0.921 on these databases. This method also obtains the top two SRCC values in two synthetic distortion image databases， including the laboratory for image & video engineering （LIVE） database and categorical subjective image quality （CSIQ） database. In the cross-validation， SSA-Net achieves competitive results in several natural distortion image quality databases and reasonable evaluation results in synthetic/natural image quality evaluation databases. SSA-Net also shows more desirable generalization performance than the self-adaptive hyper network and visual compensation restoration network on Waterloo Exploration database. Experimental results show that the proposed method outperforms the state-of-the-art methods in natural distortion image quality assessment databases and demonstrate stronger generalization performance.ConclusionThe proposed method acquires accurate image distortion information by combining the understanding of the image content with the perception of different distortion types. The network can fuse information from different stages through an improved deep supervision mechanism and by setting learnable parameters that can efficiently adapt to the distortion of natural images and subsequently improve the image quality assessment accuracy.

关键词：image quality assessment （IQA）;blind image quality assessment （BIQA）;deep learning;self-adaptive semantic awareness network （SSA-Net）;multi-level supervision regression （MSR）

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020697 false

发布时间：2023-12-07
Dual-branch low-light image enhancement network via YCbCr space divide-and-conquer

Yan Xiaoyang,Wang Huake,Hou Xingsong,Dun Yujie
Vol. 28, Issue 11, Pages: 3415-3427(2023) DOI: 10.11834/jig.221028

摘要：ObjectiveThe images acquired at night or backlight conditions always have poor visibility and have details that are hidden in the dark. Moreover， due to insufficient lighting and limited exposure time， the number of incident photons on these images decreases， thereby resulting in a large amount of non-negligible noise. Therefore， improving the contrast and removing noise from low-light images present a challenge. The existing low-light image enhancement algorithms usually enhance the contrast and suppress the noise in the RGB color space by way of enhancing and then denoising. However， due to the complex coupling relationship between brightness distortion and noise in the RGB space， enhancing-and-then-denoising methods usually amplify the noise that is originally hidden in the dark， thus increasing the difficulty of the denoising task， affecting the aesthetic quality of images， and constraining subsequent image processing tasks on the aspects of image classification， object detection， and recognition. To effectively deal with brightness distortion and noise， a dual-branch low-light image enhancement network based on YCbCr space is proposed in this paper to yield enhanced images with minimal color distortion and less noise.MethodThe YCbCr space can separate luminance information from chrominance information. Experiments show that brightness distortion is mostly observed in luminance information and that chrominance information is heavily polluted with noise. This paper designs a corresponding network structure based on the divide-and-conquer method to deal with different degradation modes， where different modules can effectively deal with a specific distortion and reduce the difficulty of network learning. A novel dual-branch network in YCbCr space is then developed to realize the decoupling of luminance distortion and noise， which can enhance the luminance information and denoise the chrominance information. First， the low-light images are transformed from the RGB space to the YCbCr space. Second， the images in the YCbCr space are inputted into the proposed dual-branch enhancement network for low-light image enhancement. This network comprises a luminance enhancement module， a noise removal module， and a fusion module. The luminance enhancement module performs contrast enhancement on the luminance information， the noise removal module denoises the chrominance information， and the fusion module combines the features of luminance and chrominance to obtain the final enhanced results. The luminance channel Y and chrominance channels Cb and Cr serve as inputs of the luminance enhancement and noise removal modules， respectively. Given that the luminance distortion is often non-local， the luminance enhancement module adopts a U-shaped network that can obtain the rich brightness context information of the image via an encoder and decoder. Specifically， in the U-Net network， the encoder expands the receptive fields of convolutions through a pooling operation， and at the bottleneck layer， the large receptive field can extract non-local luminance information for contrast enhancement. This non-local information can then be extended to the global scale by up-sampling at the decoder layer by layer. Given that noise damages the texture of low-light images， the noise removal module uses a multi-scale denoising network to enrich the details of the image and effectively learn its multi-scale features. The noise removal module consists of a multi-residual channel attention block， pixel detection， and pixel shuffle. The residual channel attention block， which is commonly found in several networks， is suitable for feature extraction. Instead of up-sampling and down-sampling， the pixel unshuffled and shuffle operations are used to obtain multi-scale input images， remove noise at multi-scale features， fuse multiple features， and eventually obtain valuable features for denoising while avoiding information loss in the images. The brightness supervision and chrominance supervision modules are also used to strengthen the functions of the brightness enhancement and noise removal modules， which can help boost contrast and remove noise effectively. The supervision modules ensure the features of brightness enhancement and noise removal. Several simple convolution operations are then applied to concatenate the outputs of the brightness enhancement and noise removal modules and obtain the final results in the YCbCr space. An additional transform operation is applied to convert the images from the YCbCr space to the RGB space.ResultThe effectiveness of the proposed algorithm is tested on several paired and unpaired public low-light image enhancement datasets， such as the low-light dataset （LOL）， vision enhancement in the low-light condition dataset， multi-exposure image fusion dataset， and naturalness preserved enhancement dataset. The proposed method is also compared with several classical low-light enhancement algorithms that have achieved excellent results in low-light image enhancement. These algorithms are implemented using public codes， and their parameters are set by referring to their original papers. These methods are then objectively evaluated based on their visual results and by using several quality metrics， including peak signal-to-noise ratio （PSNR）， structure similarity index， natural image quality evaluator， and learned perceptual image patch similarity. The images generated by the proposed method have richer details， more realistic color， and less noise compared with those generated by the other algorithms. Moreover， on the LOL dataset， compared with kindling the darkness （KinD++）， the PSNR of the proposed method increases by 3.09 dB， while compared with the Retinex-based deep unfolding network that uses Transformer as its basic module， the PSNR of the proposed method increases by 2.74 dB.ConclusionThe proposed spatial decoupling method effectively decouples the brightness distortion and noise of low-light images. The designed dual branch network can enhance the brightness and remove noise， thus effectively solving the complex coupling problem of brightness and noise in low-light images and obtaining noise-free brightness-enhanced images. Extensive experiments demonstrate the effectiveness and generalization of this method， which outperforms the state-of-the-art both qualitatively and quantitatively.

关键词：low-light image enhancement;YCbCr color space;dual-branch network;noise removal;divide and conquer

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020327 false

发布时间：2023-12-07
Chaotic image encryption algorithm based on elliptic curve and adaptive DNA coding

Xiao Dinghan,Yu Simin,Wang Qianxue
Vol. 28, Issue 11, Pages: 3428-3439(2023) DOI: 10.11834/jig.220807

摘要：ObjectiveWith the rapid development of computer networks， digital images， as an important part of information transmission， are also widely transmitted through the Internet. Digital images used in national administration， military defense， commercial intelligence， and other fields may contain some sensitive private information. Therefore， during its transmission， the security of an image must be considered. Chaotic systems are characterized by ergodicity， sensitivity to initial conditions， and long-term unpredictability. Therefore， many scholars use these systems in designing image encryption algorithms. To resist chosen-plaintext attacks， most traditional image encryption algorithms based on the chaotic system adopt symmetric encryption methods related to plaintext information in the process of key generation or encryption. This kind of algorithm uses the same key to encrypt and decrypt information. Before transmission， the relevant key needs to be transmitted to the receiver through a secret channel. This one-time pad mode means that the number of keys that need to be stored and transmitted increases along with the number and frequency of communications. These keys do not contain information and are redundant data that bring unnecessary burden to users. In addition， the high-frequency transmission of these keys can greatly increase the risk of exposing secret channels， thus leading to adverse effects. To solve these problems， this paper proposes a chaotic image encryption algorithm based on elliptic curve and adaptive DNA coding.MethodThe proposed algorithm adopts the public key cryptosystem of an elliptic curve to make the communication parties reach the key consensus without transmitting the secret private key. To reach the goal of the key consensus， the consensus element is transformed into the initial values of the improved four-dimensional hyperchaotic Lorenz system and is then combined with the four-dimensional hyperchaotic Lorenz system to generate the consensus chaotic key sequence for adaptive DNA coding encryption. This encryption process embeds dynamic diffusion and adaptive permutation structures in the diffusion process of DNA coding and decoding. The operation rules of dynamic diffusion are dynamically selected by a chaotic key sequence， and the intermediate ciphertext state is fed back in the process. Adaptive permutation reveals the characteristics of the intermediate ciphertext state， scrambles the chaotic key sequence， and then performs bit-level permutation. Therefore， this algorithm can resist segmentation and chosen-plaintext attacks. At the same time， the intermediate ciphertext state in the encryption process can be adaptively synchronized at the decryption end without additional transmission， thereby avoiding the problem of key redundancy.ResultThrough the simulation of three test images of different sizes， this paper tests and analyzes the security of the proposed algorithm in terms of key space， key sensitivity， adjacent pixel correlation， information entropy， NIST randomness， number of pixels change rate （NPCR）， and unified averaged changed intensity （UACI）. Analysis results show that the key space of the algorithm reaches 2²⁵⁶， which is sufficient to resist an exhaustive attack. The ciphertext image is extremely sensitive to the key， and a weak correlation is observed among the adjacent pixels of the ciphertext image. The information entropy of the three ciphertexts are 7.997 5， 7.999 3， and 7.999 8， which are very close to the ideal value of 8. The proposed algorithm passes all 15 sub-tests of NIST SP800-22. The test results of NPCR and UACI are also within the 0.05 confidence interval， thereby suggesting that the proposed algorithm can resist statistical and differential attacks. This algorithm is also compared with some latest chaotic image encryption algorithms. The relevant simulation tests and comparative analysis show that the chaotic image encryption algorithm has high practicability and security.ConclusionThe proposed algorithm combines the public key cryptosystem and the chaotic system of an elliptic curve and improves the symmetric encryption model of the traditional chaotic image encryption algorithm into an asymmetric encryption model. The special design of this algorithm solves the problem of key redundancy， greatly improves its feasibility， and ensures its security. The security of this algorithm is verified by various experimental simulations， and the results are compared with those of other algorithms. This algorithm is suitable for encrypting privacy gray images of various sizes to ensure information and data security in the process of information communication.

关键词：image encryption;elliptic curve;hyperchaotic system;adaptive DNA coding;dynamic diffusion

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020780 false

发布时间：2023-12-07
HDA-GAN： hybrid dual attention generative adversarial network for image inpainting

Lan Zhi,Yan Caiping,Li Hong,Zheng Yadan
Vol. 28, Issue 11, Pages: 3440-3452(2023) DOI: 10.11834/jig.220919

摘要：ObjectiveImage inpainting has been extensively examined as a basic topic in the field of image processing over the past two decades. Image inpainting attempts to fill in the missing or corrupted parts of an image with satisfactory and reasonable content. Given their inability to generate semantically compliant images， traditional techniques can succeed in certain straightforward situations but fall short when the missing region is large or complex. Image inpainting methods based on deep learning and adversarial learning have produced increasingly promising results in recent years. However， most of these methods produce distorted structures and hazy textures when the missing region is large. One primary cause of this problem is that these methods do not consider global or long-range structural information due to the locality of vanilla convolution operations， even with dilated convolution that enlarges the local receptive field.MethodTo overcome this issue， this study proposes a novel image inpainting network called hybrid dual attention generative adversarial network （HDA-GAN）， which captures both global structural information and local detailed textures. Specifically， HDA-GAN integrates two types of cascaded attention propagation modules， namely， cascaded channel-attention propagation and cascaded self-attention propagation， into different convolutional layers of the generator network. For the cascaded channel-attention propagation module， several multi-scale channel-attention blocks are cascaded into shallow layers to learn features from low-level details to high-level semantics. The multi-scale channel-attention block adopts the split-attention-merge strategy and residual-gated operations to aggregate multiple channel attention correlations for enhancing high-level semantics while preserving low-level details. For the cascaded self-attention propagation module， several positional-separated self-attention blocks are stacked into middle and deep layers. These blocks also adopt the same split-attention-merge strategy and residual-gated operations as the multi-scale channel-attention blocks but with some changes. The purpose of this design is to use the positional-separated self-attention blocks to maintain the details while learning long-range semantic information interaction. The design of these blocks also further reduces the computational complexity compared with original self-attention.ResultNumerous tests using the Paris Street View and CelebA-HQ datasets demonstrate that HDA-GAN can produce superior image inpainting in terms of quality and quantity with better restoration effects compared with several state-of-the-art algorithms. The Paris Street View dataset includes 15 000 street images of Paris， 14 900 training images， and 100 test images， while the CelebA-HQ dataset contains 30 000 high-quality human face images. The fine-grained texture synthesis of models may be evaluated using the high-frequency features of the hair and skin. Following a standard configuration， 28 000 images are used for training， and 2 000 are used for testing. In both training and testing， free-form masks are employed while adhering to the standard settings. Free-form masks are highly applicable to real-world settings and thus are used in many inpainting techniques. Following a standard setting， all images are resized to 512 × 512 pixels or 256 × 256 pixels for training and testing depending on the datasets. The mean squared error （MSE）， peak signal-to-noise ratio （PSNR）， and structural similarity index （SSIM） are introduced to evaluate the performance of different methods in filling holes with different hole-to-image region ratios. In the Paris Street View dataset， the PSNR of the proposed method increases by 1.28 dB， 1.13 dB， 0.93 dB， and 0.80 dB， while its SSIM increases by 5.2%， 8.2%， 10.6%， and 13.1% compared with the Edge-LBAM method as the hole-to-image region ratios increase. Meanwhile， in the CelebA-HQ dataset， the MSE value of the proposed method decreases by 2.2%， 5.4%， 11.1%， 18.5%， and 28.1%， while its PSNR increases by 0.93 dB， 0.68 dB， 0.73 dB， 0.84 dB， and 0.74 dB compared with the AOT-GAN method as the hole-to-image region ratios increase. These experimental results show that the proposed method quantitatively and qualitatively outperform the other algorithms.ConclusionThis study proposes a novel hybrid attention generative adversarial network for image inpainting called HDA-GAN that can generate reasonable and satisfactory content for a distorted image by fusing two carefully designed attention propagation modules. Using the cascaded attention propagation module in skip-connect layers can significantly improve the global structure and local texture captured by the generator， which is crucial for inpainting， particularly when filling complex missing regions or large holes. The cascaded attention propagation modules will be applied to other vision tasks， such as image denoising， image translation， and single image super-resolution， in future work.

关键词：image inpainting;generative adversarial network （GAN）;cascaded channel attention propagation module;cascaded self-attention propagation module;large area inpainting

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020701 false

发布时间：2023-12-07

Image Analysis and Recognition

Local similarity anomaly for general face forgery detection

Dai Yunshu,Fei Jianwei,Xia Zhihua,Liu Jianan,Weng Jian
Vol. 28, Issue 11, Pages: 3453-3470(2023) DOI: 10.11834/jig.221006

摘要：ObjectiveIn recent years， the development of DeepFake has made great progress， and the highly realistic forged face images created by such technology are posing a great threat not only to people’s privacy and security but also to the international political situation. Therefore， detection methods with good generalization ability need to be developed. In their early stages of development， forged faces had low fidelity with obvious defects. Therefore， traditional digital forensic algorithms and deep learning models could achieve good detection performances. However， with the development of DeepFake， these forged faces become increasingly realistic， thus posing a challenge to detection algorithms. Researchers have focused on the essential differences between real and forged faces to improve the detection performances of their algorithms. The process of DeepFake can be decomposed into the following steps： 1） detect and crop the face in the target image； 2） forge the face using a forgery algorithm； 3） paste the forged face back to the original image and use image fusion technology to eliminate the boundary defects and improve the visual effect. Step 3 often results in easily detectable local forgery traces， which are important cues for distinguishing real faces from fake ones. Many researchers have attempted to build models that can learn such traces to improve accuracy or to implement tampering localization. However， given that both the local traces and the image fusion methods involved in different forgery techniques widely differ， the detection algorithms for different forgery techniques have limited generalization ability. Therefore， although the local traces caused by Step 3 above are universal， directly learning such features for real and forged face recognition contributes little to generalizability.MethodThis paper proposes a DeepFake detection method based on local similarity anomalies to achieve high generalizability. Instead of directly learning local forgery traces to distinguish real faces from fake ones， this method transforms the learning objective into the similarity of local features. Specifically， the face region of the forged face image has source features that differ from the background region， and although these two types of regions have uniform source features internally， the fusion boundary between the face and background contains conflicting source features and thus has low level of local similarity. These local similarity anomalies are independent of both the specific forgery algorithm and the fusion algorithm and can be regarded as heterogeneous features that are highly consistent with the essential difference between real and fake faces. To cache these traces， this paper proposes the local similarity predicator module. By decomposing the local depth features of face images into horizontal and vertical groups， the learning objective is converted from recognizing specific forgery traces to predicting the similarity of source features within the image by calculating the similarity of local depth features and their neighbors so as to capture the essential differences between real and fake faces in a general way. In addition， previous studies find that frequency domain features contain important clues for distinguishing real from fake faces. The proposed method draws on the domain knowledge of steganalysis and constructs a learnable convolutional pyramid module based on the spatial rich model（SRM）， which compensates for the limited ability to express true and false features in the RGB space and improves the in-domain detection performance. This study also proposes the spatial rich model convolutional pyramid， which inherits the high-frequency noise features extracted by the spatial rich model convolutional pyramid （SRMCP） kernel， can be continuously updated during the training， and can be extended to a pyramid architecture with different receptive fields to effectively capture high-frequency noise features at different scales.ResultThe overall results of FF++ are compared under three compression factors. The proposed method， which uses ResNet18 as its backbone， achieves extremely high detection accuracy on both raw and compressed datasets. This method not only significantly outperforms the classical digital forensic algorithms but also surpasses some of the recently proposed advanced algorithms for deep forgery detection. Specifically， the proposed method achieves 99.72%， 98.34%， and 90.73% accuracies on RAW， C23， and C40， respectively， and its average accuracy is 2.31% and 13.33% （20.26% on the C40 dataset） higher than those of Xception and MesoNet， respectively. The proposed method also outperforms a metric learning method published in CVPR 2021 that incorporates the frequency and space domains. Specifically， the proposed method achieves 0.29%， 1.63%， and 1.22% higher accuracies on RAW， C23， and C40， respectively， compared with this metric learning method. Overall， the proposed method takes the lead in terms of accuracy. Experimental results reveal that the local similarity module can effectively capture the inherent features of forged faces， thus substantially improving detection accuracy and achieving high accuracy even with a simple ResNet18 as the backbone. The average cross-domain area under curves （AUCs） of the proposed method reach 91.40%， 96.03%， 99.08%， and 96.05% on the four subsets of FF++， which are 15.41%， 16.47%， 21.11%， and 14.7% higher than those of Xception， respectively. In addition， the average accuracies of the proposed method are improved by 0.77%， 5.59%， 6.11%， and 4.28%， respectively， compared with state-of-the-art methods. The cross-domain results on Celeb-DF show that the proposed method outperforms the existing methods with the help of ResNet18. Although recently introduced methods have made significant progress in cross-domain detection with an average accuracy exceeding 70%， the cross-domain accuracies of the proposed method are 1.11%， 3.73%， and 5.17% higher compared with those of state-of-the-art methods.ConclusionThe method proposed in this paper can greatly improve the detection performance of lightweight convolutional neural networks and achieves better generalization and robustness compared with other recently proposed methods. The local similarity learning module will be further optimized in future work to ensure that it can predict local anomalies with different types of forged faces to further improve its generalizability on unknown forged faces.

关键词：deep face forgery detection;spatially rich model （SRM）;convolutional pyramid;local similarity learning;multi-task learning

4

|

1

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020500 false

发布时间：2023-12-07
Cross-domain unsupervised Re-ID algorithm based on neighbor adversarial and consistency loss

Zhu Jinlei,Li Yanfeng,Chen Houjin,Sun Jia,Pan Pan
Vol. 28, Issue 11, Pages: 3471-3484(2023) DOI: 10.11834/jig.220838

摘要：ObjectiveThe purpose of pedestrian re-identification is to determine whether the people appearing in different camera scenes belong to the same person. This process can be regarded as a sub-problem of image retrieval and is widely used in intelligent video surveillance， criminal investigation， safety production， and other fields. Most of the pedestrian re-identification algorithms are designed with the supervised method based on known labels. These data are high expensive and are sometimes impossible to obtain. Most of the existing unsupervised pedestrian re-identification methods are based on loss functions， such as triplet loss， but have poor ability to distinguish similar identities. Compared with supervised pedestrian recognition， unsupervised pedestrian recognition technology has greater application prospects. Although the image of pedestrians is partly affected by the shooting angle， light， camera parameters， pedestrian clothing， and other factors， pedestrian features also have strong regularity， such as intra-class feature convergence， inter-class feature divergence， and intra class feature consistency. Different scenes face different data distributions， and a large domain difference can be observed in real applications. The aforementioned problems lead to performance degeneration when transfer learning the model. Due to the great differences between the source and target domain data in image acquisition conditions and application scenarios， applying the source domain training model directly to the target domain will result in poor performance. Unsupervised domain adaptive （UDA） person re-identification aims to adapt the model trained on a labeled source domain to an unlabeled target domain. For pseudo-label-based UDA methods， pseudo label noise is the main problem for model degradation， while the cross-camera problem is one of the main factors that cause this noise.MethodAiming at the poor discriminative ability of similar pedestrians caused by pseudo-label noise， a cross-domain unsupervised pedestrian re-identification method based on neighbor optimization is proposed in this paper. To address the incorrect selection of the hardest positive and negative samples in triplet loss caused by the cross-camera problem， a camera-pseudo-label-based triplet loss is designed. Triplet-based loss does not fully explore the sample similarities within the target domain， which highly depends on the pseudo labels. To enhance the identification ability of high-similarity pedestrians， a neighborhood adversarial loss （NAL） function is designed. By constructing the sample pair between any sample and other samples， the confrontation between sample pairs of the strongest certainty and uncertainty is implemented. To make the intra-class features converge in the same direction， a neighborhood consistency loss （NCL） function is designed. The feature distance curve is processed by center normalization， and the feature distances of the k-nearest samples are narrowed while maintaining the inherent difference of the feature curve. Unlike the migration mechanism of ordinary semi-supervised learning methods， the proposed algorithm focuses on the structure and loss function of the unsupervised learning model in the target domain. First， the input target domain samples are classified based on the pre-training model， and the pseudo labels are assigned to the clustering results. Second， triple hard loss is used to control the introversion of intra-class features and the divergence of inter-class features. To enhance the ability to distinguish similar identities， this paper designs an adversarial loss function in which the group with the closer feature distance in the class antagonize with the group having a longer feature distance. Furthermore， to ensure consistency in the convergence direction of class features， the feature consistency loss function is designed to measure the continuity of various sample features in the batch group. Finally， the above three loss functions are weighted and added to form the final loss function.ResultExperimental results on the Market-1501 and DukeMTMC-reID datasets show that the proposed method has certain advantages over state-of-the-art methods. Ablation experiments reveal the effectiveness of each part of the algorithm loss function. Analysis of the ablation experimental results shows that the three loss functions have certain complementarities in clustering. When considering the intra-class and inter-class divergence of features， further considering the consistency of feature convergence direction can comprehensively improve the performance of the pedestrian re-recognition algorithm. Comparative experiments show that the performance of the algorithm is significantly improved compared with existing methods， while the parameter experiments highlight the influence of different super-parameter values on recognition performance. In the comparative experiments， the proposed method obviously outperforms the existing methods. Rank-1/mean average precision （mAP） achieves 92.8%/84.1% and 83.9%/71.1% on the Market-1501 and DukeMTMC-reID datasets， respectively. Experimental results further show that similar people are prone to be given a pseudo noise label when clustering and that the proposed method can control the label noise by using the NAL loss function. Complementary with NCL， the NAL loss function controls the consistency of features of the k-nearest samples. Under the action of the NAL and NCL loss functions， the noise is effectively controlled， and the unsupervised learning effect is improved on the target domain.ConclusionThe proposed method can improve the adaptability of the network model via unsupervised training in the target domain. Through the neighbor adversarial loss and neighbor consistency loss functions， this method can easily distinguish similar people， thus effectively improving the performance and robustness of pedestrian re-identification. Ablation and comparative experiments are carried out on public datasets， and results show that the performance of this algorithm is significantly improved compared with existing methods.

关键词：pedestrian re-identification （Re-ID）;unsupervised learning;cross-domain learning;neighbor adversarial loss （NAL）;neighbor consistency loss （NCL）

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020703 false

发布时间：2023-12-07
Combining cross-layer feature fusion with cascade detectors for anti-vibration hammer defects detection

Liang Huagang,Zhao Huixia,Liu Lihua,Yue Peng,Zheng Zhenyu
Vol. 28, Issue 11, Pages: 3485-3496(2023) DOI: 10.11834/jig.220789

摘要：ObjectiveAnti-vibration hammers can reduce the periodic vibration of transmission lines to minimize line fatigue damage. Therefore， a regular inspection of these hammers is necessary. Previous studies on anti-vibration hammers for transmission lines have mostly focused on the identification or fault classification of anti-vibration hammers with only one case of deficiency. However， the detection methods for such small-scale targets are limited， and their leakage and false detection rates remain very high， which cannot guarantee the safe operation of transmission lines. Therefore， in view of the complex background of the aerial images obtained from the current unmanned aerial vehicle （UAV） inspection of transmission lines， the different types， shapes， and characteristics of anti-vibration hammers occupying a small pixel area in these aerial images can lead to problems， such as inability to determine the types of defects and the low detection accuracy in the anti-vibration hammer detection process. In this paper， we propose an anti-shock hammer defect detection method based on cross-layer feature fusion and cascade detector.MethodGiven the lack of a public anti-vibration hammer dataset at home and abroad for targeted research， this paper takes the aerial images taken via the UAV inspection of anti-vibration hammer components as the original data and then expands the dataset by geometric and contrast transformation to ensure the equalization of the number of different types of anti-vibration hammers in the dataset and to establish an anti-vibration hammer defect detection dataset. The anti-vibration hammer defects are then refined and classified into four categories， namely， normal， corroded， broken， and collision， to serve as the data basis for the algorithm research designed in this paper. The anti-shock hammer defect detection algorithm is then designed. The proposed anti-shock hammer defect detection method is mainly divided into two major parts， namely， the feature extraction network and classification location prediction network. The feature extraction network extracts accurate anti-seismic hammer features by fusing the Visual Geometry Group 16-layer network （VGG16）. The main idea is to insert a convolutional kernel of size 1 × 1 into the last convolutional layer of the first， third， and fifth layers， after which the last convolutional layer of the first layer is connected to the maximum pooling layer and fused with the third layer using a deconvolution operation after the fifth layer. These two layers are then fused to form the final feature map， which balances the semantic information and spatial features. To reduce the impact of the intersection over union （IoU） threshold on network performance， in the localization and classification network part， the classification and location prediction network proposed in this paper uses three cascade detectors to gradually increase the IoU threshold and improve the quality of samples and the training effect of the network. The non-maximal suppression method is then replaced by the soft non-maximum suppression （Soft-NMS） algorithm to remove the bounding box.ResultThe main contributions of this paper are as follows： 1） the dataset part expands the data using the aerial images taken via the UAV inspection of anti-vibration hammer components to establish the anti-vibration hammer defect detection dataset. To ensure the validity of this dataset， part of those samples that cannot be easily separated even through the naked eye are reasonably removed； 2） the network model is based on VGG16， feature fusion is performed on 1， 3， and 5 layers of features to effectively obtain more features， and three cascade detectors are used to classify the target to reduce the impact of the IoU threshold on network performance and improve the detection capability of the algorithm. The average accuracy of the network model was improved by 13.5%， 3.4%， and 5.8% over the fast region-based convolutional network （Fast R-CNN）， Faster R-CNN， and you only look once version 4 （YOLOv4）， respectively， and by 9.5%， 8.5%， and 8% over single shot MultiBox detector 300 （SSD300）， YOLOv3， and RetinaNet， respectively， when compared with six other advanced algorithms based on deep learning. These results may be explained by the incorporation of multiple layers of features， which enable the model to obtain more information about low- and high-level image features. Moreover， the uses of cascade detectors reduce the impact of the IoU thresholds on network performance and eventually improve the detection accuracy of small-scale targets. Compared with Faster R-CNN， the false detection rate of the proposed method is reduced by 5.61%， whereas the missed detection rate is reduced by 3.01%. Therefore， the proposed method improves its accuracy while effectively reducing its false detection and missed detection rates and works particularly well in practical applications.ConclusionThe proposed anti-shock hammer defect detection method obtains good detection results for different backgrounds， illuminations， angles， scales， and types of anti-shock hammers. Experimental results show that this method not only efficiently extracts the characteristics of anti-shock hammers but also improves the localization accuracy of the network， thus effectively improving the accuracy of the algorithm and satisfying the actual detection requirements of anti-shock hammer inspection work while showing improved robustness and effectiveness.

关键词：shockproof hammer defects;deep learning;small-scale object detection;cross-layer feature fusion;cascade detector

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020817 false

发布时间：2023-12-07
Defect detection of tower bolts by fusion of priori information and feature constraints

Yan Guangwei,Zhou Xiangjun,Jiao Runhai,He Hui
Vol. 28, Issue 11, Pages: 3497-3508(2023) DOI: 10.11834/jig.221077

摘要：ObjectiveBolts， as fasteners， play a role in connecting and fixing key components in power towers. Adverse environments， mechanical vibrations， and material aging may cause the bolt cotter pins to fall off or come out， thus impacting the normal operations of other parts on the transmission line. Monitoring， testing， and maintenance of bolt conditions play crucial roles in ensuring the uninterrupted operation of a power system. However， the timely and efficient automatic detection of transmission line bolt defects in a power system poses a challenge. Aiming at the challenges of intra-class diversity and inter-class similarity in automatic bolt defect detection， a faster regions with convolutional neural network （Faster R-CNN） model training method based on prior information and feature constraints is proposed in this paper.MethodIn the pre-processing of an aerial inspection image， this paper designs a region of interest extraction algorithm based on prior information that can extract the context region of the identified object so as to reduce the amount of data in the training stage， help the model focus on the key areas in the training stage， and improve its feature extraction ability. In the model training stage， the output features of Faster R-CNN model are initially constrained by the Fisher constraint to reduce the intra-class distances of the sample features while increasing their inter-class intervals， which improves the differentiation of sample features. Afterward， the K-nearest neighbor algorithm is used to process the sample features to obtain the k-nearest neighbor probability， which is used as an indication of difficult samples to make the model pay attention to difficult samples in the future.ResultAll data used in this paper are real aerial inspection images from a power company in Central China. Bolts on the power tower can be divided into bolts with and without pins. Bolts with pins connect and fix the key components and need to bear a large force. Compared with bolts without pins， those with pins are more prone to defects. Therefore， this paper treats bolts with pins as the research object. According to the state of the cotter pin on the bolt， bolts with pins are divided into normal and defective bolts， of which defective bolts have fallen or loose cotter pins. The dataset contains 28 887 images， with each image having a resolution of 5 000 × 3 000 pixels. The training set， validation set， and test set are divided according to the ratio of 8∶1∶1. The proposed model is then tested on this dataset. Compared with the baseline model， the proposed model improves the mean average precision of bolt recognition by 6.4%， increases the average precision of normal bolt identification by 0.9%， and increases the average accuracy of defective bolt identification by 12%.ConclusionIn this paper， the problems of intra-class diversity and inter-class similarity in bolt defect detection are explored ， and a method of transmission tower bolt defect detection based on the fusion of prior information and feature constraints is proposed.This improves the recognition effect of the model on bolt defects and lays a good foundation for the automatic detection of bolt defects in transmission lines.

关键词：power inspection;bolt defect detection;intra-class diversity;inter-class similarity;prior information;feature constraints;faster regions with convolutional neural network（Faster R-CNN）

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020328 false

发布时间：2023-12-07
Small object detection for ocean eddies using contextual information and attention mechanism

Du Yanling,Wu Tianyu,Chen Kuo,Chen Gang,Song Wei
Vol. 28, Issue 11, Pages: 3509-3519(2023) DOI: 10.11834/jig.220944

摘要：ObjectiveOcean eddies are responsible for most of the material transportation and energy transfers in the ocean. The accurate detection of these eddies serves as the basis for revealing the evolution of ocean eddies and their interactions with other marine phenomena. However， small-scale objects and dense distribution are often observed in the active area of ocean eddies， which leads to problem of low detection accuracy. Traditional detection methods are limited by the poor generalizability of the artificial parameter design. These methods also have poor ocean eddy detection accuracy compared with deep learning methods. However， a deep learning model with high sampling rate loses the underlying details and contour information in the process of small target detection. The target detection contour is located far from the real contour of the target. To address the low detection accuracy caused by the loss of low-level detail information and contour information of small-scale ocean eddy targets， this paper proposes an improved U-Net network.MethodBased on the U-shaped progressive sampling network， a context feature fusion module is added to fuse the features of each coding layer， and a residual attention mechanism is added to the target features before the feature fusion in order for the model to pay attention to the contour information of the ocean eddies. A data augmentation method is then introduced to reduce the overfitting problem of the model. Feature fusion is carried out through the context feature fusion module， which takes the three-layer feature map of the U-shaped structure coding layer of the U-Net network as input， the lowest-level feature map as the target feature， and the last two-layer feature map as the context and target features. The context feature map is initially upsampled to the same size as the lowest-level feature through the deconvolution structure， and the number of channels is reduced to 1/2 of the lowest-level feature in order to prevent the amount of information of the context feature from exceeding that of the target feature. L2 norm and ReLU are then used to achieve the fusion of context and target features. The proposed model uses two contextual feature fusion modules， which take the first to third layer feature maps of the encoding layer as input and the second to fourth layer feature maps as input， respectively. The residual attention mechanism consists of two processing channels. The first channel has a residual structure （batch norm， conv of 1 × 1 kernel and multiple concatenation of ReLU） that prevents gradient disappearance and extracts certain contour information， while the second channel comprises a down-up sampling layer and a sigmoid layer to extract high-level semantic information. To effectively reduce the over-fitting phenomenon， random region sampling and random mask processing are used for data augmentation. In the experiment， the model is trained in the NVIDIA GTX 1080Ti GPU environment， where its initial learning rate is set to 1 × 10^-3， the loss function is optimized by the Adam optimizer， the batch size of the model training is set to 16， and the number of iterations is set to 200.ResultThe satellite sea surface height dataset of the South Atlantic is used for the experiments. Ablation experiments are carried out to test the influence of each module on the performance of the ocean eddies detection model. The effects of adding the context feature fusion module， adding the attention mechanism module， and adding both modules at the same time are compared， and the detection effect after adding the data augmentation method is analyzed. In the ablation experiment， due to the introduction of the contextual feature fusion module and the residual attention mechanism， the model can fuse the contextual features of the ocean eddies in different feature layers， and the network can extract additional low-level spatial details of the ocean eddies. Each module improves the detection performance of the model， and the optimal detection accuracy of the model after using the data augmentation method reaches 93.27%. Compared with other deep learning models， the proposed model has a detection accuracy of up to 93.24%， and its detected number of ocean eddies is closer to the truth， thereby verifying its excellent performance in small target detection. Meanwhile， compared with the fully convolutional network （FCN） model， the proposed model can detect more small-scale ocean eddy targets， and the detected ocean eddy target contour is closer to the truth， thereby verifying the positive effect of progressive sampling on small target detection.ConclusionThe proposed model significantly outperforms the other deep learning models in detecting ocean eddies. Compared with the state-of-the-art， the proposed model achieves a higher small target detection accuracy， and the detected contour of ocean eddies is closer to the truth.

关键词：ocean eddy;small object detection;semantic segmentation;attention mechanisms;feature fusion

3

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020561 false

发布时间：2023-12-07

Image Understanding and Computer Vision

3D object detection in road scenes by pseudo-LiDAR point cloud augmentation

Jin Shuai,Li Xuanpeng,Yang Feng,Zhang Weigong
Vol. 28, Issue 11, Pages: 3520-3535(2023) DOI: 10.11834/jig.220986

摘要：ObjectiveLight detection and ranging（LiDAR） is one of the most commonly used sensors in autonomous driving that has strong structure sensing ability and whose point cloud provides accurate object distance information. However， LiDAR has a limited number of laser lines. As the distance from the object to the LiDAR increases， the object feedback point cloud area becomes sparse， and the effective information is greatly reduced， thereby reducing the detection accuracy of distant small objects. At the same time， due to the complex and changeable road environment， vehicles cannot rely on a single sensor， hence necessitating the use of multi-source data fusion to improve their perception capabilities. This paper proposes a pseudo-LiDAR point cloud augmentation technology that fuses an image and point cloud to improve its 3D object detection performance in road scenes.MethodFirst， a stereo image is used as the input of the depth estimation network to predict the depth image. The LiDAR point cloud is mapped to the plane to obtain the point cloud depth map， which is sent to the depth correction module together with the depth map of the image. Afterward， the depth correction module builds a directed k-nearest neighbor graph among the pseudo-LiDAR point clouds， finds the part of the pseudo-LiDAR point cloud that is closest to the LiDAR point cloud， uses the precise depth of the LiDAR point cloud to correct the depth of this part of the pseudo-LiDAR point cloud， and retains the shape and structure of the original pseudo-LiDAR point cloud to generate a corrected depth image. Second， the semantic segmentation network is applied to the image to obtain the foreground area of the vehicle. The semantic segmentation map and corrected depth map are simultaneously processed by the foreground segmentation module. The depth map corresponding to the foreground area is then mapped into the 3D space to generate the pseudo-LiDAR point cloud. Only the foreground points of the vehicle are retained. The point cloud at this time is called the foreground pseudo-LiDAR point cloud. Finally， the foreground pseudo-LiDAR point cloud performs 16-， 32-， and 64-line down-sampling in intervals of 0～20 m， 20～40 m， and 40～80 m， respectively， and fuses with the original point cloud to form a fusion point cloud. The fused point cloud has more foreground points than the original point cloud. For distant objects， the number of point clouds is greatly increased， thus improving the sparsity of small object point clouds.ResultIn this paper， the depth estimation network adopts a pre-trained model of pyramid stereo matching network based on the SceneFlow dataset （a large dataset for training convolutional networks for disparity， optical flow， and scene flow estimation）， and the semantic segmentation network adopts a pre-trained model of high-resolution representations for labeling pixels and regions （HRNet） based on the Cityscapes dataset. Five latest object detection networks， including sparsely embedded convolutional detection （SECOND）， are used as benchmark models for training and testing on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago dataset. Experimental results show that the proposed method can improve the small object detection accuracy of several network frameworks， with the largest improvement of each metric for the SECOND algorithm， multi-modal voxelnet for 3D object detection （MVX-Net） algorithm， and Voxel-RCNN algorithm. The average precision under the intersection over union of 0.7 is used to evaluate and compare the experimental results. Under the hard difficulty condition， the 3D detection accuracies for SECOND， MVX-Net， and Voxel-RCNN improve by 8.65%， 7.32%， and 6.29%， respectively， and the maximum improvement in the bird’s-eye view detection accuracy is 7.05%. Meanwhile， most of the other object detection networks obtain better 3D detection accuracy than the original method under the easy and moderate difficulty conditions， and all networks obtain a better bird’s-eye view detection accuracy than the original method under the easy， moderate， and hard difficulty conditions. Ablation experiments are also conducted using the network SECOND as the benchmark model， and experimental results show that the depth correction module， foreground segmentation module， and sampling module designed in this paper all contribute to the improvement of the results， among which the sampling module improves the results the most. Specifically， the introduction of this module improves the 3D detection accuracy in the easy， moderate， and hard difficulty conditions by about 2.70%， 3.69%， and 10.63%， respectively.ConclusionThis paper proposes a pseudo-LiDAR point cloud augmentation technique that uses the accurate depth information of the point cloud to correct the image depth map and uses the dense pixel information to compensate for the sparsity of the point cloud. This method effectively addresses the poor detection accuracy of small objects caused by the sparsity of the point cloud. Using the semantic segmentation module greatly increases the proportion of the number of foreground points. The sampling module is also adopted to compensate for those pseudo-point clouds with different line numbers according to the observation distance， thus greatly reducing the number of pseudo-point clouds. This method is applicable to all object detection networks with point cloud as input and significantly improves the 3D object detection and bird's-eye view detection performance of multiple object detection networks in road scenes. This method is proven effective and general， hence presenting a new idea for multi-modal fusion 3D object detection.

关键词：pseudo-LiDAR（point cloud）;depth estimation;semantic segmentation;fusion algorithm;3D object detection

3

|

0

|

2

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020502 false

发布时间：2023-12-07
Transformer-based multi-style information transfer in image processing

Sun Meiting,Dai Longquan,Tang Jinhui
Vol. 28, Issue 11, Pages: 3536-3549(2023) DOI: 10.11834/jig.211237

摘要：ObjectiveThe multi-style transfer technique can be applied and focused on the stylized image via visual styles-transferred content image. The stylized image can be used to preserve original content structure and it has the similar features with the style image. Style transfer can be as one of the essential branches of image processing. Conventional style transfer methods are oriented for style rendering in terms of low-quality information processing. Current deep learning based convolutional neural network （CNN） has been adopted in the style transfer domain. To balance and re-integrate images’ content and style information， the CNN can be used to extract content features and style features. However， due to the constrained range of visual perception of the convolutional layers， it can capture local associations only. To model its global information， the Transformer network proposed in natural language processing（NLP）can capture long-distance dependencies. However， its expression-related requirement is still challenged for computational cost because of correlation-learnt between all input elements. Furthermore， the Transformer has its slower convergence performance due to a lack of image prior. Given the similarity between the image style transfer process and the sentence translation process， we develop a CNNs-based hybrid network in terms of Transformer network.MethodThe network we proposed consists of four aspects in relevance to： encoding， style transformation， decoding， and the discriminative network. For network-encoded： convolutional layers are used to extract the high-level image features through lowering the image size. To exact more value of the pixels， features-extracted are more sensitive to semantic and consistent information in related to such specific objects and content in the image. However， the lower quality information of the image， such as lines and textures， is beneficial to reflect the stylistic features. Thus， the network-encoded can be added into the residual connection to enrich feature representation. For Transformer network structure-built style transformation network： it consists of three subparts： content encoder， style encoder， and the decoder. The content and style encoder is paid attention to global information-added for each of content features and style features. The decoder can used to optimize the original content features in related to the weighted sum of style features and stylized features can be generated further. For network-decoded： to up-sample stylized features back to the original image size and generate a final stylized image， topology-interpolated operation is interlinked to this sort of symmetric structure with the encoding network. For discriminative network： it is focused on distinguishing between the generated and natural style images.ResultTo verify the style transfer evaluation criteria， qualitative and quantitative comparative analysis is carried out with several other related flexible style transfer methods. The qualitative analysis consists of two parts： performance comparison and user experience. The performance comparison can show that the proposed network can generate more smooth and clear stylized images. The user-oriented result can show its user-preference ability as well. The results of the stylizing speed comparison in the quantitative comparison can demonstrate that the speed of the proposed network ranks is in the acceptable threshold. Additionally， the speed can keep its stability as the image size grows from 256 pixels to 512 pixels. An ablation experiment is designed to verify the effectiveness of the discriminative network as well. Its results show that the discriminative network-introduced can yield the network to get extraction ability better and generate more realistic images. To demonstrate the flexibility of the network we proposed， the trained network can be used for other related style transfer tasks， including content-style tradeoff， style interpolation， and region painting.ConclusionA hybrid network is facilitated and mutual-benefits are shown between CNNs and Transformer network. Experiment results show that the network we proposed can optimize the transferring speed further. The stylized images have its potentials for smaller image sizes （e.g.， 256 pixels） in terms of its content structure and stylistic features.

关键词：computer vision;image processing;multi-style image information transfer;attention mechanism;Transformer

3

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020953 false

发布时间：2023-12-07
Pose-guided instance-aware learning for driver distraction recognition

Li Shaofan,Gao Shangbing,Zhang Yingying
Vol. 28, Issue 11, Pages: 3550-3561(2023) DOI: 10.11834/jig.220835

摘要：ObjectiveDistracted driving is the main cause of traffic accidents. Data from the Department of Transportation show that about 2 million traffic accidents occur every year， of which over 80% are caused by distracted driving. In recent years， the advanced driver assistance system （ADAS） has been employed to collect data and to detect and recognize static and dynamic objects inside and outside vehicles. Driving behavior detection is the key technology in ADAS that effectively reminds drivers to avoid traffic accidents. Therefore， driver distraction recognition has broad research prospects in the fields of computer vision and autonomous driving. With the rapid development of deep learning and computer vision， many researchers have explored distracted driving detection in various ways. In recent years， deep learning has been widely used in detecting driver distraction. Compared with the traditional algorithms， deep learning methods shows great improvements in their performance and accuracy. Image-based driver distraction recognition can be considered a secondary image sub-classification problem. Unlike traditional image classification， the differences in driver distraction recognition tasks are very small， which means that a small area in an image determines the action class of that image. To solve this problem， this paper proposes a pose-guided instance-aware network for driver behavior recognition.MethodFirst， the human body is detected by the you only look once version 5 （YOLOv5） object detector， and then the recognizable hand-related area is obtained by using the human body pose estimation high-resolution network （HRNet）. The features of the human body and the hand area are then used as instance-level features， and an instance-aware module is designed to fully obtain the contextual semantic information at different levels. Second， a dual-channel interaction module is constructed using the hand-related features to characterize key spatial information and to optimize visual features. In this way， a multi-branch deep neural network is formed， and the scores of different branches are fused. ResNet 50 is used as a backbone network， and the backbone convolutional networks are initialized with the pre-trained ImageNet weights. The input size of the image is scaled to 224 × 224. Network training uses the cross-entropy loss function to update the weight of the network model. The initial learning rate is set to 1E-2， and the batch size of the algorithm training is set to 64. The stochastic gradient descent （SGD） optimizer with a momentum of 0.99 is applied to the cross-entropy loss function. For the SGD， a multi-step learning rate with an initial value of 0.01 is reduced by a weight decay of 0.1 after 20 epochs. The model is trained using NVIDIA Tesla V100 （16 GB） in Centos 8.0. The implementation is based on Python 3.8 and the PyTorch 1.8 toolbox.ResultExperimental results show that the proposed method has test accuracies of 96.17% and 96.97% on the American University in Cairo （AUC） and self-built large vehicle datasets， respectively. Compared with the model without instance-aware module and channel interaction， the accuracy of the proposed method is significantly improved， particularly in complex environments. Several experiments are also performed to analyze the effectiveness of the components of this method on two datasets. The highest accuracy is reported when combining the human， hand， and spatial branches. The accuracy has improved by 7.5% and more than 3% on the self-built large-scale vehicle driver dataset and the public AUC dataset， respectively. Results of ablation study also confirm the effectiveness of the proposed component in improving recognition accuracy.ConclusionThis study proposes a pose-guided instance-aware network for driver distraction recognition. Combined with object detection and human pose estimation， the human body and hand regions are treated as instance-level features， and an instance-aware module is established. A dual-channel interaction module is then constructed using the hand-related regions to characterize key spatial information. Experimental results show that the proposed method outperforms the other models in terms of accuracy on both self-built complex environment datasets and public datasets. Compared with the traditional RGB-based model， the pose-guided method shows significant improvements in its performance in complex environments and effectively reduces the impact of complex backgrounds， different viewpoints， illuminations， and changes in human body scales. This method also reduces the interference of complex environments， shows high accuracy， assists drivers in driving safely， and reduces the occurrence of traffic accidents.

关键词：distraction detection;pose estimation;object detection;instance level feature;multi-stream network

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020705 false

发布时间：2023-12-07
Answer mask-fused visual question answering model

Wang Feng,Shi Fangyu,Zhao Jia,Zhang Xuesong,Wang Xuefeng
Vol. 28, Issue 11, Pages: 3562-3574(2023) DOI: 10.11834/jig.211137

摘要：ObjectiveVisual question answering （VQA） is essential for artificial intelligence （AI） in recent years. Current VQA is concerned of the linkage of natural language processing and computer vision more. Therefore， VQA-related model is required for text and image information processing simultaneously， and the information of these two modes can be fused to infer the answer. Such popular VQA models like VQA v2.0 dataset have been developing in terms of a deep neural network and trained samples. However， these prior language models-based tasks can be simplified to learn the surface relationship for answer questions between questions and answers. The weakness of uneven distribution of answers is still to be challenged for its weak generalization and poor performance in the VQA-CP v2.0 dataset. Specifically， language problems-prior has threatened for its prediction errors of the model and the predicted answer and question are in irrelevance. To optimize this non-linkage and generalization of the model， we develop an answer mask-related method to cover the irrelevant answers for predictable results， which can forge the model to learn the deeper relationship between question and answer. The prediction accuracy of the model can be improved as well.MethodThe prediction results of the baseline model is masked via the answer mask. It is necessary to cluster all candidate answers and fewer answers-involved for each type of answer can be used to preserve accurate classification through more answers-irrelevant coverage of mask-of the prediction results. The answers consist of non-contextual words and phrases. Conventional Word2Vec and Glove is still challenged for its effectiveness of these encoded answers. Such clip is illustrated as the encoder to extract the answer features. And， the k-means algorithm is used to cluster answer-extracted feature vectors. After clustering， original dataset can be modified and the corresponded type is changed to the clustering-after type answer of the dataset， and different answer mask vectors are generated for each answer type. The answer mask vector is structured of 0 and 1. The elements of the vector can be assigned to 1 when the corresponding positions are contained for each answer type， and the others are configured to 0； the impact of irrelevant answers of prediction can be eliminated for final results of the baseline model. We design an answer type recognition model， which uses the questions and answers types for pre-training. Input question-based model can be used to predict the answer type corresponding to the question. The model’s accuracy can reflect the quality of clustering work， and its prediction results are the basis for the optioned answers mask types- task. The baseline model is focused on encoding the image and text and depth neural network is linked to fuse the image and text features. The preliminary prediction results can be obtained through the classifier as well. First， corresponding answer mask vector is leaked out in terms of answer type identification model-based prediction results. Then， the multiplied prediction results are generated via the baseline model and the distribution of irrelevant answers are covered in the prediction results of the baseline model. At the end， final results are predicted. The model is trained to learn the corresponding relationship between the types of questions and answers.ResultWe selected out UpDn， RUBi， LMH and CSS as baseline models and experiments are carried out on three large public datasets mentioned below. VQA-CP v2.0 dataset-related experiments can show its potentials for model’s accuracy. Three sort of accuracy of the UpDn， LMH and CSS model are improved by 2.15%， 2.29% and 2.02% each. Among them， the higher accuracy of the CSS model is reached to 60.14%. Additionally， our model's accuracy is preserved when VQA v2.0-related accuracy is reduced. The VQA v2.0-based experimental results show that the accuracy of most baseline models are improved further. Among them， the accuracy of the CSS model is optimized by 3.18%. To demonstrate better generalization of our model， comparative experiments are carried out on VQA-CP v1.0 dataset further. The experimental results show that our method is mutual-benefited for most of baseline models， which is sufficient to reflect its potential ability of generalization. Furthermore， ablation experiment on VQA-CP v2.0 shows that the accuracy can be optimized further in terms of the answer mask.ConclusionWe develop an answer mask-related method to cover irrelevant answers in the model prediction results and the final influence of irrelevant answers can be alleviated. The model is yielded to learn the corresponding relationship between the question and the answer type， and its challenge can be resolved for the question-irrelevant model's prediction answer to a certain extent ， and the model’s generalization and accuracy can be optimized as well.

关键词：visual question answering （VQA）;language priors;answer clustering;answer mask;answer type recognition

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44021035 false

发布时间：2023-12-07

Medical Image Processing

Boundary-preserving multi-scale glomerulus segmentation for full-stained kidney slice

Hua Yong,Li Zhenzhen,Pan Jianhong,Yang Xuan
Vol. 28, Issue 11, Pages: 3575-3589(2023) DOI: 10.11834/jig.221025

摘要：ObjectiveMedical image segmentation is a key issue in determining whether medical images can provide reliable information in treatment and clinical diagnosis. The accurate segmentation of glomeruli plays a key role in diagnosing and quantitatively analyzing diseases in renal pathology. Traditional methods used in glomerular image segmentation include traditional pattern recognition and machine learning-based recognition. However， these methods required hand-crafted features. Segmentation methods based on convolutional neural networks （CNNs） have shown strong generalization performance with features learned by networks. Early diagnosis is conducive to treating kidney disease. However， a full-stained kidney slice suffers from significant variations in the scale， shape， and texture of objects. Moreover， the high image resolution brings challenges to prediction efficiency. Therefore， CNN-based glomerulus segmentation plays an important role in clinical applications.MethodThis paper proposes a method for glomerular segmentation in full-stained kidney slices. A multi-granularity spatial attention mechanism is designed to deal with the diverse appearances of the glomerulus. This mechanism generates multiple scales and shape-changing feature maps for each pixel to focus on its context area instead of a fixed rectangular area as in traditional networks. For glomerulus with different sizes， the features of these feature maps should be fused at different scales， and the spatial information of features should be extracted by networks. Multi-granularities context feature maps are generated to pay attention to small objects using the context-based spatial attention mechanism， which can control the receptive field to obtain multi-granularities information and reduce background interference. The problem of high image resolution is addressed by cutting the original image into image patches. To detect the glomerulus located on the edge of two image patches， a padding strategy is formulated based on an augmented path. The disadvantages of the zero-padding strategy in standard convolution operation are then analyzed， and the contribution shifting effect is highlighted. The proposed padding strategy ensures that the boundary information of image patches is transferred to high levels of the network without information loss. Furthermore， given the very large resolution of the complete stained kidney slice， a window should be sliced along the image to predict objects. However， small objects are sensitive to the positions， thereby leading to different predictive probabilities. Sliding a window in an image also involves high computation complexity. To address these problems， this paper proposes a sliding window strategy that uses probability accumulation to fill those objects that are missing in stitching image patches. This strategy has high computation efficiency and can improve the detection accuracy of small objects in full-stained kidney slices.ResultThe proposed method achieves a higher segmentation accuracy on the mouse kidney cell and human biomolecular atlas program（HuBMAP） human kidney datasets compared with state-of-art methods. Specifically， the segmentation accuracy increases by 1% in Dice compared with U-Net after using a multi-granularities context-based spatial attention mechanism， and the number of missing and false objects is also reduced. The padding strategy based on an augmented path improves the predictive accuracy with only a few additional FLOPs（floating-point aperations per second）. The probabilistic cumulative strategy is also compared with the non-probabilistic cumulative sliding window strategy.Resultshow that the probabilistic cumulative sliding window strategy saves 52.83% of the time in the first layer of the network and 49.98% in the second layer compared with the non-probabilistic sliding window strategy. Overall， the proposed method increases the prediction speed by about 50%.ConclusionThe probabilistic accumulation sliding window strategy improves prediction efficiency and glomerulus segmentation accuracy compared with the state-of-the-art methods. The proposed multi-granularity context spatial attention mechanism fuses the information of multiple scales through multi-granularity receptive fields to enhance the relevant features and suppress the irrelevant features of the glomerulus. The proposed padding strategy based on an augmented path can deal with information attenuation and contribution-shifting issues in traditional zero padding and effectively preserve information when objects are located on the boundary of patches. Combining multi-grained context features with the proposed padding strategy also improves object segmentation in fully stained kidney images. During network inference， the proposed sliding window with probability accumulation reuses features to significantly increase the prediction efficiency. This method is also beneficial in detecting small objects that are sensitive to the position. Experimental results on different datasets show that the proposed method outperforms the state-of-the-art methods and is both stable and robust. Meanwhile， the sliding window with probability accumulation improves segmentation accuracy and greatly reduces the calculation time. The role of local window size learning in the multi-granularity spatial attention mechanism will be explored in future work. In addition， given that some objects in a glomerulus slice are too small to predict， additional training samples must be generated using the generative adversarial network to further improve prediction accuracy.

关键词：convolution neural network （CNN）;medical image segmentation;glomerular image;multi-scale contextual feature;padding

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020437 false

发布时间：2023-12-07
Cephalometric landmark keypoints localization based on convolution-enhanced Transformer

Yang Heng,Gu Chenliang,Hu Houmin,Zhang Jing,Li Kang,He Ling
Vol. 28, Issue 11, Pages: 3590-3601(2023) DOI: 10.11834/jig.220933

摘要：ObjectiveAccurate and reliable cephalometric image measurement and analysis， which usually depend on the correlation among anatomical landmark points， play essential roles in orthodontic diagnosis， preoperative planning， and treatment evaluation. However， manual annotation hinders the speed and accuracy of measurement to a certain extent. Therefore， an automatic cephalometric landmark detection algorithm for daily diagnosis needs to be developed. However， the size of anatomical landmarks accounts for a small proportion of an image， and the structures at different positions may share similar radians， shapes， and surrounding soft tissue information that are difficult to distinguish. The current methods based on convolutional neural networks （CNNs） extract depth features by applying down-sampling to facilitate the building of a global connection， but these methods may suffer from spatial information loss and inefficient context modeling， hence preventing them from meeting accuracy requirements in clinical applications. Transformer has advantages in long-term dependency modeling but is not good at capturing local features， hence explaining the insufficient accuracy of models based on pure Transformer for key point localization. Therefore， an end-to-end model with global context modeling and better local spatial feature representation must be built to solve these problems.MethodTo detect the anatomical landmarks efficiently and effectively， a U-shaped architecture based on convolution-enhanced Transformer called CETransNet is proposed in this manuscript to locate the key points of lateral cephalometric images. The overwhelming success of UNet lies in its ability to analyze the local fine-grained nature of an image at the deep level， but this method suffers from global spatial information loss. By improving and introducing the Transformer module into the U-shaped structure， the ability of convolutional networks to obtain local information is retained while establishing global context connection. In addition， to efficiently regress and predict the heatmaps， an exponential weighted loss function is proposed so that the loss value near the landmark pixels can receive more attention in the supervised learning process and the loss of distant pixels can be suppressed. Each image is rescaled to 768 × 768 pixels and maintains a fixed aspect ratio corresponding to its original ratio via a zero padding operation， and data augmentation is performed via random rotation， Gaussian noise addition， and elastic transformation. During the training phase， experiments are conducted on a server using Tesla V100 SXM3-32 GB GPUs. The model is optimized by an Adam optimizer with a batch size of 2， and the initial learning rate is set to 0.000 1 and decreased by 0.75 times every 5 epochs.ResultTo demonstrate its strengths， CETransNet is compared with the most advanced methods， and ablation studies are performed to confirm the contribution of each component. Experiments were performed on a public X-ray cephalometric dataset. Quantitative results show that CETransNet obtains mean radial error （MRE） values of 1.09 mm and 1.43 mm in the two test datasets， respectively， and the accuracies within a clinically accepted 2 mm error are 87.16% and 76.08%. A total of 9 key points in Test1 achieve a 100% successful detection rate （SDR） value， and in the clinically allowable 2.0 mm region， the detection accuracy reaches 90% with up to 12 landmarks. In Test2， although only 9 points satisfy the SDR accuracy of 90%， 10 points within 4 mm are completely detected. Compared with the best competing method， CETransNet improves the MRE by 2.7% and 2.1% on the two datasets， respectively. CETransNet also outperforms other popular vision Transformer methods on the benchmark Test1 dataset and achieves a 2.16% SDR improvement within 2 mm compared with the sub-optimal model. Meanwhile， the analysis of the influence of the backbone network on the model performance reveals that ResNet-101 reaches the minimal MRE， while ResNet-152 obtains the best SDR within 2 mm. Results of ablation studies show that the convolution-enhanced Transformer can decrease MRE by 0.3 mm and improve SDR in 2.0 mm by 7.36%. Meanwhile， the proposed EWSmoothL1 further reduces the MRE to 1.09 mm. Benefitting from these components， CETransNet can detect the position of anatomical landmarks quickly， accurately， and robustly.ConclusionThis paper proposes a cephalometric landmark detection framework with a U-shaped architecture that embeds the convolution-enhanced Transformer in each residual layer. By fusing the advantages of both Transformer and CNNs， the proposed framework effectively captures the long-term dependence and local natures and thus obtains the special position and structure information of key points. To address the ambiguity caused by other similar structures in an image， an exponential weighted loss function is proposed in order for the model to focus on the loss of the target area than the other parts. Experimental results show that CETransNet achieves the best MRE and SDR performance compared with advanced methods， especially in the clinically allowable 2.0 mm region. A series of ablation experiments also prove the effectiveness of the proposed modules， thereby confirming that CETransNet shows a competent performance in anatomical landmark detection and possesses great potential to solve the problems in cephalometric analysis and treatment planning. In future work， other lightweight models with better robustness will be designed.

关键词：cephalometric measurement;landmark keypoints localization;vision Transformer;attention mechanism;heatmap regression;convolutional neural network （CNN）

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020699 false

发布时间：2023-12-07
Pancreas segmentation based on 3D path aggregation high-resolution network

Yin Jing,Liu Zhe,Song Yuqing,Qiu Chengjian
Vol. 28, Issue 11, Pages: 3602-3617(2023) DOI: 10.11834/jig.220973

摘要：ObjectiveAccurate pancreas segmentation is an important prerequisite for the detection， identification， and analysis of pancreatic cancer. However， due to the small proportion of the pancreas in the input CT volume and the large variations in its position and shape， accurate pancreas segmentation has always been a challenging task. Most of the existing mainstream deep learning pancreas segmentation networks are based on the encoding-decoding structure， which initially reduces the resolution of the input image through continuous down-sampling in the encoder to capture strong semantics on a large receptive field， identify the complete pancreas， and gradually restore the lowest-resolution encoder features to obtain the predicted segmentation results. However， the continuous down-sampling in the encoder leads to the loss of location and details of features.MethodTo alleviate the above problem， this paper proposes a 3D path aggregation high-resolution network （3DPAHRNet） for pancreas segmentation. First， to capture additional 3D feature context information， the 2D convolution operation in the high-resolution network is extended to the 3D convolution operation. Second， this paper proposes a full-resolution path aggregation module that utilizes five consecutive nonlinear transformations to reduce the semantic difference between the full-resolution input and the output of the segmentation head network while reducing the impact of location and detail information loss due to the continuous down-sampling of the stem network on the segmentation results. Finally， this paper proposes a multi-scale feature path aggregation module that leverages the progressive feature channel compression and fusion strategy in order for the multi-scale features outputted by the high-resolution network to adaptively adjust the features in the network and avoid the problem of information content loss caused by the excessive compression of multi-scale low-resolution feature channels.ResultTo verify the effectiveness of the proposed method， extensive experiments are conducted on a public pancreas dataset. First， the segmentation results are compared with those of mainstream pancreas segmentation networks， including 3D U-Net， AttentionUNet， VNet， and 3D HRNet. Compared with the state-of-the-art segmentation results， the proposed method improves the Dice similarity coefficient， Jaccard index， precision， and recall by 1.41%， 2.09%， 2.35%， and 0.49%， respectively. Second， the effectiveness of the proposed module is verified by conducting three ablation studies. Experimental results show that when the number of down-sampling times in the stem subnetwork of 3DHRNet is reduced， either the full-resolution or multi-scale feature path aggregation module is added， and the average segmentation accuracy is significantly improved. Finally， the proposed method is compared with representative pancreas segmentation methods. Comparison results show that the proposed method improves the state-of-the-art segmentation accuracy by 1.1%.ConclusionThis paper proposes the 3DPAHRNet for pancreas segmentation. Unlike the use of high-resolution net on natural images， the proposed method not only keeps the high-resolution features in the network but also enables the network to retain additional location and detail features of the full-resolution input， thus significantly improving the performance of existing pancreas segmentation networks. The open-source code is available at https://github.com/qiuchengjian/PAHRNet3D.

关键词：pancreas segmentation;convolutional networks;3D path aggregation high-resolution network;full resolution feature;multi-scale feature

2

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020504 false

发布时间：2023-12-07
Spine CT image segmentation based on Transformer

Lu Ling,Qi Weimin
Vol. 28, Issue 11, Pages: 3618-3628(2023) DOI: 10.11834/jig.221084

摘要：ObjectiveThe incidence of spine diseases has increased in the contemporary era and is increasingly affecting younger individuals. Therefore， the diagnosis and treatment of such diseases are particularly critical. Using 3D reconstruction technology， computer-aided diagnosis， and segmentation of the spine area and the background area of the spine computed tomography （CT） image can assist physicians in clearly observing the spine lesion area and provide theoretical support for surgical path simulation and surgical planning. The accuracy of spine CT image segmentation is critical in restoring the actual position and physiological shape of the patients’ vertebrae to the greatest extent possible， thus allowing physicians to understand the distribution of lesions. However， the difficulty of spine segmentation is exacerbated by the complex structure of the spine， poor display of tissue structure， poor contrast， and noise interference in spine CT images. The segmentation of spine images via manual annotation relies on the physicians’ a priori knowledge and clinical experience， and the segmentation results are highly subjective and time consuming. Long working hours may also lead to deviations that affect the patients’ diagnosis. With the help of computer technology， the traditional segmentation method mainly uses low-latitude features， such as texture， shape， and color of the image， for segmentation and often can only achieve semi-automatic segmentation. Moreover， this method does not fully utilize the image information and has low segmentation accuracy that fails to meet the demand of real-time segmentation. The segmentation method based on deep learning can realize automatic segmentation， effectively extract image features， and improve segmentation accuracy. In the branch of computer vision （CV）， medical image segmentation algorithms based on convolutional neural network （CNN） have been proposed one after another and have become the mainstream research direction in medical image analysis. Among these algorithms， the characteristics of the U-Net structure itself and the fixed structure of medical images with multi-modality enhance the performance of U-Net in medical image segmentation and provide a benchmark for medical image segmentation. However， the inherent limitations of the convolutional structure can lead to problems， such as limited long-distance interaction. By contrast， Transformer， a non-CNN architecture， integrates a global self-attentive mechanism to capture long-range feature dependencies and is widely used in natural language processing， such as machine translation and text classification. In recent years， researchers have introduced Transformer into the field of computer vision and achieved advanced results in certain tasks， such as image classification and image segmentation. This paper then combines the advantages of the CNN architecture and Transformer to propose a CNN and Transformer hybrid segmentation model called Transformer attention gate U-Net （TransAGUNet） that realizes an efficient and automated segmentation of spine CT images.MethodThe proposed model combines Transformer， U-Net， and the attention gate （AG） mechanism to form an encoding–decoding structure. The encoder uses a hybrid Transformer and CNN architecture， which consists of a combination of ResNet50 and ViT models. For the sliced spine CT images， the low-level features are initially extracted by ResNet50， the feature maps corresponding to three downsampled features are retained， and then patch embedding and position embedding are performed. The obtained patches are then inputted to the Transformer encoder to learn long-term contextual dependencies and extract global features. The decoder adopts a CNN architecture that applies 2D bilinear upsampling at 2× rate to recover the image size layer by layer. The AG structure is incorporated into a jump-connected bottom-up triple layer to fuse shallow features with higher-level features for fine segmentation. The decoder uses a CNN structure to recover the image size layer by layer by performing 2D bilinear upsampling at a 2-fold rate. The AG structure is incorporated into the bottom-up three layers of the jump connection to obtain the attention map corresponding to the downsampled features， stitched with the upsampled features in the next layer， and then decoded by two ordinary convolutions and one 1 × 1 convolution. The AG structure then enters the binary classifier and distinguishes the foreground and background pixel by pixel to obtain the spine segmentation prediction map. The AG parameters are computationally small， easily integrated into CNN models， and can automatically learn the shape and size of the target to highlight salient features and suppress feature responses in irrelevant regions. These parameters replace the localization module via probability-based soft attention， thus eliminating the need to divide the ROI， and improve the sensitivity and accuracy of the model by a small amount of computation. The experiments use Dice Loss summed with weighted cross entropy loss as the loss function to solve the uneven distribution of positive and negative samples.ResultThe proposed algorithm is tested on the VerSe2020 dataset， and the Dice coefficients improve by 4.47%， 2.09%， 2.44%， and 2.23% over the mainstream CNN architectures of segmentation networks U-Net， Attention U-Net， U-Net++， and U-Net3+， respectively. Meanwhile， the Dice coefficients over the excellent Transformer and CNN hybrid segmentations TransUNet and TransNorm improve by 2.25% and 1.08%， respectively. To verify the validity of the proposed model， several ablation experiments are performed， and results show that compared with TransUNet， the Dice coefficient of the designed decoding structure improves by 0.75% and by 1.5% after adding AG. To explore the effect of the number of AG connections on the model performance， experiments are conducted using AG with different numbers of connections， and results show that the Dice coefficient obtained without adding AG is the smallest and that the optimal model performance is achieved by adding AG in three jump connections on the resolution scales of 1/2， 1/4， and 1/8.ConclusionCompared with the above six CNN segmentation models and the Transformer and CNN hybrid segmentation models， the proposed algorithm achieves the best segmentation results on spine CT images， thus effectively improving the segmentation accuracy of spine CT images with better segmentation real-time performance.

关键词：spine CT image;medical image segmentation;deep learning;Transformer;attention gate （AG）

3

|

0

|

0

<HTML>
<L-PDF><WORD><Meta-XML>

<引用本文> <批量引用> 44020330 false

发布时间：2023-12-07

Postal code：100190
Tel：010-58887035/58887030/58887418 Email：jig@aircas.ac.cn
Technical support is provided by Beijing Founder electronics co., LTD 京ICP备05080539号-4 京公网安备11010802024621
It is recommended to read the content of this site in Chrome&IE9+. Please switch to extreme mode in browser 360.
Cookies We use cookies to help provide and enhance our service and tailor content. By continuing, you agree to the use of cookies.

⁰