最新刊期

    28 7 2023

      Review

    • Jiang Junjun,Cheng Hao,Li Zhenyu,Liu Xianming,Wang Zhongyuan
      Vol. 28, Issue 7, Pages: 1927-1964(2023) DOI: 10.11834/jig.220130
      Deep learning based video-related super-resolution technique: a survey
      摘要:Video-related super-resolution (VSR) technique can be focused on high-resolution video profiling and restoration to optimize its low-resolution version-derived quality. It has been developing intensively in relevant to such domains like satellite remote sensing detection, video surveillance, medical imaging, and low-involved electronics. To reconstruct high-resolution frames, conventional video-relevant super-resolution methods can be used to estimate potential motion status and blur kernel parameters, which are challenged for multiscene hetegerneity. Due to the quick response ability of fully integrating video spatio-temporal information of real and natural textures, the emerging deep learning based video super-resolution algorithms have been developing dramatically. We review and analyze current situation of deep learning based video super-resolution systematically and literately. First, popular YCbCr datasets are introduced like YUV25, YUV21, ultra video group(UVG), and the RGB datasets are involved in as well, such as video 4 (Vid4), realistic and dynamic scenes (REDS), Vimeo90K. The profile information of each dataset is summarized, including its name, year of publication, number of videos, frame number, and resolution. Furthermore, key parameters of the video super-resolution algorithm are introduced in detail in terms of peak signal-to-noise ratio (PSNR), structural similarity (SSIM), video quality model for variable frame delay (VQM_VFD), and learned perceptual image patch similarity (LPIPS). For the concept of video super-resolution and single image super-resolution, the difference between video super-resolution and single image super-resolution can be shown and the former one has richer video frames-interrelated motion information. If the video is processed frame by frame in terms of the single image super-resolution method, there would be a large number of artifacts in the reconstructed video. We carry out deep learning based video super-resolution methods analysis and it has two key technical challenges of those are image alignment and feature integration. For image alignment, its option of image alignment module is challenged for severe hetergeneity between video super-resolution methods. Image alignment and non-alignment methods are categorized. The integration of multi-frame information is based on the network structure like generative adversarial networks (GAN), recurrent convolutional neural networks (RNN), and Transformer. To process video feature and make neighboring frames align with the target frame, image-aligned methods can use different motion estimation and motion compensation module. Image alignment methods can be segmented into three alignment-related categories: optical flow, kernel, and convolution-deformable. This optical flow alignment method can be used to calculate the motion flows between two frames through their pixels-between gray changes in temporal and the neighboring frames are warped by motion compensation module. We divide them into four categories in terms of the optical flow alignment-relevant model structure of deep convolutional neural network (CNN) further: 2D convolution, RNN, GAN, and Transformer. For optical flow-aligned 2D convolution methods analysis, we mainly introduce video efficient sub-pixel convolutional network (VESPCN) and its improvement on optical flow estimation network and motion compensation network, such as ToFlow and spatial-temporal transformer network (STTN). For the RNN methods with optical flow alignment, we analyze residual recurrent convolutional network (RRCN), recurrent back-projection network (RBPN) and other related methods using optical flow to align neighboring frames at the image level, which is required to resolve the constraints of the sliding window methods. Therefore, to obtain excellent reconstruction performance, we focus on BasicVSR (basic video super-resolution), IconVSR (information-refill mechanism and coupled propagation video super-resolution) and other networks, which can warp neighboring frames at the feature level. The optical flow alignment-based TecoGAN (temporal coherence via self-supervision for gan-based video generation) and VSR Transformer methods are introduced in detail as well. Due to a few kernel-based and deformable convolution-based align methods, it is still a challenging issue for classify network structure. Because convolution kernel size can used to limit the range of motion estimation, the reconstruction performance of the kernel-based alignment methods is relatively poor. Specifically, deformable convolution is a sampling improvement of conventional convolution, which still has some gaps to be bridged like high computational complexity and harsh convergence conditions. For non-alignment methods, multiple network structures are challenged for video frames-between correlation to a certain extent. We review and analyze the methods in related to non-aligned 3D convolution, non-aligned RNN, alignment-excluded GAN, and non-local. The non-alignment RNN methods consist of recurrent latent space propagation (RLSP), recurrent residual network (RRN) and omniscient video super-resolution (OVSR) and it demonstrates that a balance can be achieved between reconstruction speed and visual quality. To reduce the computational cost, the improved non-local module is focused on when alignment-excluded non-local methods are introduced. All models are tested with 4× downsampling using two degradations like bicubic interpolation (BI) and blur downsampling (BD). The multiple datasets-based quantitative results, speed comparison of the super-resolution methods are summarized as well, including REDS4, UDM10, and Vid4. Some effects can be optimized. The reconstruction performances of these video-based super-resolution networks are balanced in consistency, the parameters of the model are gradually shrinked, and the speed of training and reasoning is accelerated as well. However, the application of deep learning in video super-resolution is still to be facilitated more. We predict that it is necessary to improve the adaptability of the network and validate the traced result. Current deep learning technologies can be introduced on the nine aspects as mentioned below: network training and optimization, ultra-high resolution-oriented video super-resolution for, video-compressed super-resolution video-rescaling methods, self-supervised video super-resolution, various-scaled video super-resolution, spatio-temporal video super-resolution, auxiliary task-guided video super-resolution,and scenario-customized video super-resolution.  
      关键词:deep learning;video super-resolution (VSR);image alignment;motion estimation;motion compensation   
      2
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942782 false
      发布时间:2024-05-07
    • Kong Yinghui,Qin Yinfeng,Zhang Ke
      Vol. 28, Issue 7, Pages: 1965-1989(2023) DOI: 10.11834/jig.220436
      Deep learning based two-dimension human pose estimation: a critical analysis
      摘要:Computer vision-oriented human pose estimation is focused on location of human skeleton in image or video, in which pose information can be used for pose estimation or a specific pose or action-objective location analysis in terms of the position relationship between the key areas of the human body. Nowadays, human pose estimation-oriented action recognition and pose tracking have been developing intensively. Conventional pose estimation methods can be segmented into two categories of object detection and pose estimation. The object detection analysis is based on segmentation, matching, or statistical learning, which is challenged for targets and backgrounds clarification in complex scenarios and it is still vulnerable for prior information. Additionally, it is time-consuming and labor-intensive to construct training sample libraries and classifiers. The pose estimation analysis is in relevance to model-based or non-model-based methods, which is challenged for object detection-derived error extension and much more artificial constraint information. Nevertheless, its efficiency is still to be optimized farther. The emerging artificial intelligence (AI) based deep learning technique has its potentials for the recognition precision and speed of the deep learning-based human pose estimation methods to a certain extent. Generally, human pose estimation can be divided into two-dimensional and three-dimensional human pose estimation. For three-dimensional human pose estimation, two-dimensional human pose estimation model is beneficial for dealing with the crowding and occlusion situations. However, most network models are originated from convolutional neural network (CNN) models and it is challenged for depth-loaded network speed. Lightweight two-dimensional human pose estimation networks are concerned more for edge measurement deployment. We review the development process and optimization trend of the two-dimensional human pose estimation model based on deep learning literately. They can be divided into three categories: single-person pose estimation, multi-person pose estimation, and lightweight human pose estimation. Single-person pose estimation is the basis of multi-person pose estimation, which can be divided into methods based on keypoints regression and heatmap detection, and there is a trend to combine these two methods to achieve single-person pose estimation. Overall, multi-person pose estimation network model can be divided into top-down, bottom-up, and others. The precision of the top-down network model is higher, but the time efficiency is not satisfactory, especially for the crowded problem-related input data. The number of human bodies is larger in the input data, the estimation time is much more longer of network model. The precision of bottom-up network model has shrunk in small range, but the efficiency is greatly improved. Moreover, time consumption of network model is used and the human pose-estimated is independent of the number of human bodies in the input data. These two methods are actually as a dual method. Initially, to locate the position of the human body in the input data, top-down pose estimation method is focused on the body detector, and then pose estimation is performed for each sample. Specifically, some top-down methods need to crop single-person body accurately and adjust it to the central position of the input data for each estimation. The bottom-up approach is oriented to get all body keypoints in the input data and these keypoints are assigned to the objects. At the same time, the appearance of single-stage network also means that researchers need to pay more attention to the computational cost of network model. A small number of networks have combined with top-down and bottom-up methods together, and it has achieved good results. We summarize multiple CNN models used in various human pose estimations, analyze the characteristics of various neural network models, and compare the performance of various pose estimation methods. It can be seen that the structural design of deep convolutional neural network models is becoming more and more diverse, but various deep learning network models still have certain limitations when dealing with human pose estimation tasks. The technical methods adopted by the two-dimensional human pose estimation models and its existing problems are discussed, and possible future research directions are predicted. Our recommendation is aware to improve existing two-dimensional pose estimation network model for the pre-processing of input data on such aspects mentioned below: the clarity of the input data directly affects the pose estimation results, and effective image or video pre-processing methods may become a new idea to improve the precision and efficiency of pose estimation. The existing pose estimation methods are mostly via video data-cut static video frames. In essence, it is still restricted by image data pose estimation. Current real-time pose estimation of video data is essential for the application of pose tracking and action recognition. Nowadays, a few methods have been proposed to combine deep learning based pose estimation method in related to time domain information, such as optical flow, pose flow and long short-term memory. The images involved in the actual application are still to be developed on the aspects of more crowded and more serious occlusion, so they are still to be resolved and optimized. Recent pose estimation network models are improved through lightweight methods. Lightweight methods have its potentials and it can be as one of the key directions for pose estimation.  
      关键词:deep learning;human pose estimation;model structure;model optimization;lightweight   
      2
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942604 false
      发布时间:2024-05-07
    • Zhu Yi,Li Xiu
      Vol. 28, Issue 7, Pages: 1990-2010(2023) DOI: 10.11834/jig.211021
      A survey of medical image captioning technique: encoding, decoding and latest advance
      摘要:Medical image captioning is a labor-intensive daily task for radiologists nowadays. The emerging deep medical image captioning technique has its potential to generate medical captions automatically. There are some challenges to be resolved as mentioned below: 1) to organize a feasible and clear structure to readers; 2) to strengthen deep medical image caption task itself; 3) to optimize the introduced methods. First, the aims and objectives are identified. Then, literature is reviewed for the growth of deep medical image caption till 2021, including their latest methods, datasets and evaluation metrics, and comparative analysis between medical image caption task and generic image caption task. Deep image caption technique is introduced on the basis of prior network structure. Current deep medical image caption technique is mainly developed in terms of the encoder-decoder structure, such as adding retrieval-based methods, template matching based methods, attention mechanisms, reinforcement learning, and knowledge graphs. Specifically, the encoder-decoder structure can be integrated into convolutional neural network (CNN) for image feature extraction and recurrent neural network (RNN) for caption generation, and the two kind of networks are linked by an intermediate vector, called context vector. Such models are based on CNN-RNN-RNN structure, called hierarchical RNN or long short-term memory(LSTM). This structure allows two sort of RNNs to be stacked together, which can generate its thematic vector and captions, and the caption is generated and supervised by the theme vector. The feature of the medical captions can be recognized in relevance to high ratio of repetition and special sentence patterns although the retrieval-based and template-matched methods are still relatively simple. The attention mechanism can be used for a certain part of the image and sentence when the caption is generated and the length of the contextual vector becomes variable. Medical image caption task-oriented reinforcement learning (RL) can be used to alleviate the mismatch problem between the gradient descent training method and the discrete language generation evaluation metric as well. RL can also work as multi-agent to guide the decoder in the form of output before the decoder works, and it can output well-balanced and logical medical contents. Knowledge graph can integrate the prior knowledge of expertise into the model, and diseases having similar features will be in closer nodes in the graph where the disease information can be updated through graph convolution. The integration of medical knowledge graph is focused on improving the clinical accuracy of the generated report effectively.These methods are compatible for each other like template matching based method and attention mechanism based RL can be used simultaneously. In addition, Transformer-related structures have been developing intensively as the new backbone network beyond RNN and CNN. Transformer or the self-attention block can be trained in parallel, and it can capture the long-distance reliance between tokens, which serves as a better feature extractor. Popular datasets in deep medical image caption are IU X-Ray and MIMIC-CXR, in which frontal and lateral X-Ray images of chest and multiple sentences melted into a single report. Medical annotations like medical subject headings (MeSH) or unified medical language system (UMLS) keywords are beneficial to generate more accurate reports as they can be treated as extra information, and the classification of these tags can be seen as a pretraining task. Generic natural language generation metrics are applied to evaluate the report generated by deep medical image caption models. New metrics like SPICE, SPIDEr and BERTSCORE have been developing beyond existing BLEU-n, ROUGE, METEOR and CIDEr scores. Finally, future research directions are predicted on the four aspects: 1) more diverse and more accurate datasets, such as other related modalities like magnetic resonance imaging (MRI) and color Doppler ultrasound. The model can be more robust and adaptive to various tasks in this way because current datasets mostly focus on chest X-Ray photos, which is limited to a single body part and a single modality. 2) Evaluation metrics can be more accurate and cost-effective in clinical beyond BLEU or ROUGE scores-related generic natural language generation metrics. The manpower of radiologists can be optimized while existing generic NLG metrics are not the best evaluation in medicine. 3) Unsupervised and semi-supervised methods can be used to lower dataset-relevant cost for the medical image captioning task. The cost and training samples can be optimized based on the existing pre-training models like ViLBERT and VL-BERT. 4) More prior knowledge can be integrated into the model for the medical image captioning task and multi-round conversational medical report generation can be more detailed.  
      关键词:deep learning(DL);medical image captioning;automatic radiology report generation;encoder-decoder;image captioning   
      3
      |
      0
      |
      3
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943176 false
      发布时间:2024-05-07

      Image Processing and Coding

    • Wang Hongxia,Wu Jiali,Chen Deshan
      Vol. 28, Issue 7, Pages: 2011-2025(2023) DOI: 10.11834/jig.220098
      A complex background-related binarization method for document-contextual information processing
      摘要:ObjectiveThe optical character recognition (OCR) based binarization technique is essential for dealing with complex backgrounds recently. To recognize text in the image faster and more accurately, document information-contextual image binarization is oriented to segment captured color image or generated grayscale image, and a text information only-involved image can be output. Big data-driven massive data storage is requried for changeable text information versions from hard copies to electronic copies. However, huge amount of textual information is still stored on hard copies currently. Traditional textual information is completely still restricted by manpower-input electronic storage devices. Document information-contextual image binarization technique is benefited from information technology-based text information carriers. The learning technique has facilitated the growth of text information of images-relevant binarization. Multiple end-to-end convolutional neural network (CNN) models have been applied for the binarization of text images.MethodCompared to the traditional threshold-based document image binarization methods, deep learning-based methods melted into semantic distribution characteristics of text pixels, and its performance of CNN-based text information of image-related binarization methods is accurate to a certain extent. However, these methods are still challenged for complex backgrounds-derived text information images in relevance to high false positives and insufficient training data. The network model is easily overfitted, the intermediate network layers are not easily activated during training, and its CNN-based features extraction is still focused on low-level semantic features only. The key of binarization methods is focused on the low-level semantic features such as pixel color and contrast. It is required to leak out words-like low-level features of complex backgrounds. We develop a dual method of binarization to resolve the identifiable problem of document information images in complex scenarios. The method is segmented into two processing categories: confusing pixel screening and binary segmentation. According to the division of labor of the two aspects, two networks are constructed with different structures. The former can strengthen the recognition and segmentation ability of coarse pixels in complex backgrounds. Maximum interclass variance based pseudo labels can be generated for false positive regions. These labels can be interlinked with real labels from the binarization results to identify between text pixels and backgrounds that can easily be recognized as text. The latter network can be focused on the accurate prediction of text pixels. The improved encoder-decoder structure can drive the latter model to sort out more textual pixels-related accurate pixel boundaries. The dual networks are successively processed for complex backgrounds-based text images to improve the processing effect of the binary-completed method on complex background images; Dual networks are solely and it can perform their individual tasks under the premise of the shirinked quantity of parameters. The dual processing ability is beneficial to alleviate labor cost of two network models. The labeled data can be reused for the training of multiple structures, and it can be trained efficiently even with less data, and various of structures can be used to deal with a single problem that the model is not required to be expanded into a more complex network structure. To enhance the ability of textual processing in detail, an integrated asymmetric encoder-decoder structure is illustrated as well. The asymmetric encoder-decoder structure can be used to richer the number of convolutional operations to three or four for a single encoder. The multiple convolution operations can increase the perceptual field of view of the network. To be specific, multiple convolutions can be used to learn the non-linear features, while the decoder is streamlined and optimized. The asymmetric encoder-decoder can balance the model size and the network is yielded to improve the binarization effect with additional parameter numbers.ResultThe experiment is carried out and it is in comparison with the latest methods on three contextual datasets of competition on document image binarization (DIBCO): DIBCO2016, DIBCO2017, and DIBCO2018. In DIBCO2018, the F-measure of this method can be reached to 92.35%; for DIBCO2017 and DIBCO2016, F-measure is reached to 93.46% and 92.13% each.ConclusionThe experimental results show that the indexes of binary segmentation of asymmetric encoder-decoder structure are improved to some extent. The proposed dual network has the potential to deal with the task of complex backgrounds. The overfitting phenomenon of the network is effectively suppressed, the intermediate network layers are correctly trained, and the model can extract more high-level semantic features to sort text pixels and false positive regions out. The complex backgrounds-oriented dual binarization approach can improve the binarization effect on DIBCO datasets. The code is linked to https://github.com/wjlbnw/Mask_Detail_Net.  
      关键词:semantic segmentation;U-Net;document image recognition;binarization;complex background;encoder-decoder structure;multistage segmentation   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942643 false
      发布时间:2024-05-07
    • Ding Yuehao,Wu Hao,Kong Fengling,Xu Dan,Yuan Guowu
      Vol. 28, Issue 7, Pages: 2026-2036(2023) DOI: 10.11834/jig.211020
      A dual of real image noise-related blind denoising technique
      摘要:ObjectiveImage denoising technique can be as one of essential aspects in relevance to low-level vision. For image processing tasks, preprocessing is required for such contexts of high image quality, image segmentation, fine-grained detection and flaw detection. For real life scenario, image noise is inevitable because the image is collected and defected from the digital device itself, which will affect people’s visual perception of the image. Image denoising-related research is originated from removing additive Gaussian white noise. There is a great difference between additive Gaussian white noise and real noise. The causes of real noise are diverse, and its distribution is different from that of Gaussian noise. The distribution of real noise is complex and may not only manipulated for one distribution, in which poor denoising results are obtained in terms of the removal of real noise. Moreover, traditional image denoising methods are required to set parameters manually, and deep learning based denoising algorithm has a poor effect in removing real noise. To get better results of denoising, the network model of these denoising methods is complex and time-consuming, but actual denoising effect is not significantly improved. It is difficult to use an end-to-end algorithm to remove real image noise well beyond non-blind noise reduction. Therefore, we develop a dual of real image denoising model based on deep learning, which uses a simplified network structure and has ideal denoising results and visual performance on the image denoising dataset.MethodA blind denoising network model of two-stage modular denoising network (TMNet) is designed and illustrated, which is divided into two stages, the first stage is concerned of noise level estimation, and the second stage is concerned of non-blind denoising of the image. For the noise level estimation, a 4-layer fully convolutional neural network is used to estimate the noise level of the input image. The pooling layer is not used, but a convolutional layer is linked with a convolution kernel size of 3 only, and a padding operation is performed after the convolutional layer to keep the obtained feature size consistent from each layer. At the same time, due to the attention mechanism can effectively extract network features and improve network performance, before the last convolutional layer, the squeeze and excitation (SE) module can be used to process the features obtained via the noise level estimation, and such weights is adaptively assigned to the features of different channels. Finally, the estimated noise level and the input image are used as the output of this stage. The non-blind denoising stage can be focused on taking the estimated noise level and the image together as the input of this stage, and real image denoising becomes the problem of image denoising, which can remove the known noise distribution level. At this stage, we design a multi-scale structure, which can melt image features into two branches for processing. One branch is as an expanded convolutional layer, and the other branch is as an ordinary convolutional layer as well. Dilation convolution can be applied to get a larger receptive field by injecting holes into features of non-additional new parameters. In this way, image characteristics can be obtained under different receptive fields. The algorithm is designed for those skip connections are used in the multi-scale structure because effective skip connections are appeared in image denoising, and the loss of image information will be reduced in the process of parameter transmission when the number of network layers increases. For the convolution operation used in the non-blind denoising stage, the size of the convolution kernel is 3, and the corresponding padding operation is also guaranteed that the size of the image feature will not change in the entire network. Due to the high resolution of the original image of the smartphone image denoising dataset(SIDD) dataset used in the training model, our experiment is focused on dividing the image into 256 patches and performs 4 000 epochs of training randomly. The loss function used is L1 loss. All experiments are done on NVIDIA GTX 2080ti GPU.ResultFour sort of commonly used datasets in image denoising fields including darmstadt noise dataset (DND), SIDD, Nam, and the Hong Kong Polytechnic University (PolyU) are used to verify the effectiveness of the algorithm, and the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are used to evaluate the denoising results. Denoising experiments are carried out on DND, SIDD, Nam, and PolyU datasets, and the average PSNR values obtained are 39.23 dB, 38.54 dB, 40.45 dB, and 37.34 dB, respectively. Compared to the traditional algorithm, the denoising effect and visual sense are significantly improved, and the denoising effect can be similar to that of the deep learning denoising algorithm in related to fewer network layers. Furthermore, ablation experiments are designed to verify the effectiveness of the methods proposed, such as skip connection, channel attention, and noise level estimation. Experiments show that our methods proposed can effectively improve the effect of real image denoising to a certain extent.ConclusionWe design a convolutional neural network (CNN) model for removing noise from real images. Our model has a simple network structure, in which most of the convolution operations with a convolution kernel size of 3 are carried out, and the network training speed is fast. Our network can get richer residual structure in image denoising as well. Additionally, we use multiple convolution to obtain more effective feature information. Experimental results show that our network is potential to removing real image noise further to some extent.  
      关键词:deep learning;real image denoising;attention mechanism;noise level estimation;multiscale module   
      2
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943115 false
      发布时间:2024-05-07

      Image Analysis and Recognition

    • Li Xiaoyan,Kan Meina,Liang Hao,Shan Shiguang
      Vol. 28, Issue 7, Pages: 2037-2053(2023) DOI: 10.11834/jig.220141
      ProMIS: probability-based multi-object image synthesis-relevant weakly supervised object detection method
      摘要:ObjectiveNeural networks based fully supervised object detectors can be an essential way to improve the performance of object detection, and it is more reliable for real-world applications to a certain extent. However, it is still challenging for annotating huge amounts of data. A bounding box-related labor-intensive labeling task is required to be resolved for multiple categories and application scenarios. To meet multiple real-world applications, it is challenging to collect large-scale detection training datasets as well. Thus, a weakly supervised object detector is designed for its optimization through image category annotations only. Recent weakly supervised object detectors are focused on the multi-instance learning (MIL) technique. In these methods, object proposals are classified and aggregated into an image classification result, and objects are detected by selecting the bounding box that contributes most to the aggregated image classification results among all object proposals. However, since weakly supervised object detection lacks instance-level annotations, a challenging issue of differentiation needs to be resolved for instance from a part of the instance or a cluster of multiple instances of the same category. For training the object detector, our method proposed is focused on the learning ability to distinguish instances by inserting high confidence-relevant detected objects into an input image and generating augmented images along with pseudo bounding box annotations. However, the naive random augmentation method can not immediately improve the detection performance, owing to the following reasons: 1) over-fitting: the generated data is used to train the detection head itself; 2) infeasible augmentation: spatial distribution of the generated objects is often quite heterogenous from the real data since the hyper-parameters of the insertion are all sampled from uniform distributions.MethodTo resolve these issues mentioned above, a probability-based multi-object image synthesis (ProMIS) relevant weakly supervised object detection method is developed in terms of two iterative and interactive modules, namely the image augmentation module and the weakly supervised object detection module. For each training iteration, objects are detected in the original input image with the weakly supervised object detector (to ensure accuracy during the initial training, the detector is pre-trained according to its baseline method), and the highly confident detected objects are stored in an object-pending pool for the latter image augmentation. The image augmentation module inserts one or more objects sampled from the object-pending pool to the input image for an augmented training image with pseudo bounding box annotations. To make the augmented image more feasible, the referenced object category, position, and scale for the insertion are sampled from the detected objects-oriented posterior probability maps in this image. Three kinds of posterior probabilities are illustrated in the ProMIS in charge of describing the category, spacial and scale relations of an object and another referenced object, respectively. First, these posterior probabilities can be estimated online according to the objects detected in the previous training iterations, and the hyper-parameters of the newly inserted objects are assumed to obey these posterior probabilities. Then, the detection training module exploits the augmented image and its pseudo annotations to train the weakly supervised object detector. In the training process, to avoid over-fitting to the detected false positives, a new parallel detection branch is added to the baseline weakly supervised object detection head. The augmented bounding box annotations are only used to guide the newly added branch, while the original weakly supervised detection head is employed during the generation of the augmented data and it is trained on the basis of image-level labels only. In the inference process, only the added branch trained with the augmented annotation is kept for generating the testing results, which keeps the efficiency of the weakly supervised object detector in inference. The above image augmentation module and the weakly supervised object detection module can be used iteratively and interactively, and the weakly supervised object detector is facilitated to learn the ability for distinguishing instances steadily. The proposed ProMIS is an online augmentation method and does not require any additional images or annotations except the original weakly supervised detection training data. In addition, since the proposed approach is independent of the selection of the weakly supervised object detector, the proposed augmentation paradigm is generalized for all detector architectures.ResultIn the experiments, the effectiveness of the proposed parallel detection branch and the posterior probability maps is verified, and they improve the naive random augmentation method by 5.2% and 2.2%, respectively. The proposed ProMIS approach is applied to multiple previous weakly supervised object detectors (including online instance classifier refinement (OICR), segmentation-detection collaborative network (SDCN), and online instance classifier refinement with deep residual network(OICR-DRN)). Compared to these baseline methods, it achieves an average of 2.9% and 4.2% improvements on the Pascal VOC (pattern analysis, statistical modeling and computational learning visual object classes) 2007 and the Pascal VOC 2012 datasets, respectively. Furthermore, ablation analysis is carried out as well, and it is found that the proposed ProMIS can decrease the error mode of the ground-truth in the hypothesis and the hypothesis in the ground-truth.ConclusionIt is demonstrated that ProMIS make fewer mistakes when distinguishing instances from its parts or multiple instances of the same category.  
      关键词:weakly supervised object detection;multi-object data augmentation;image synthesis;probability map sampling;posterior probability estimation   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942861 false
      发布时间:2024-05-07
    • Yu Ye,Chen Weixiao,Chen Fengxin
      Vol. 28, Issue 7, Pages: 2054-2067(2023) DOI: 10.11834/jig.220122
      RIC-NVNet: night-time vehicle enhancement network for vehicle model recognition
      摘要:ObjectiveThe recognition of vehicle model based on night-time vehicle images is challenging for such constraints like weak exposure, irregular distribution of light conditions, and low contrast of night-time images. Specifically, vehicle images taken at night suffer from noise, over-exposure, under-exposure, and additional light source interference, making it difficult for artificial calibration with naked eyes, and recognition with artificial intelligence systems, thus, it directly degrades the performance of intelligent transportation systems. Conventional low-light methods directly brighten the whole image straightforward in terms of histogram equalization, or the correlation of adjacent pixels, which remains being improved on the noise suppression and color distortion. Once the Retinex theory has been proposed, subsequent researches followed the guidance of the theory to decompose the input image into illumination and reflectance components, then enhance the components in a traditional fashion. These methods performed well in the low-light enhancement areas and are limited because of the need for prior knowledge. The emerging deep learning technique-based methods have facilitated low-light image enhancement to a certain extent. Most of them adopted the U-Net and designed a great variety of loss functions to converge the network for better performance. Multiple categories of datasets were proposed and developed to optimize data-driven methods. However, few methods focused on night-time vehicles in the real scenario. Thus, these methods fail to generalize in night-time vehicle enhancement, especially on the aspect of vehicle light interference or underexposure of distinctive vehicle parts. Therefore, considering enhancing the night-time vehicle images, this paper proposes a night-time vehicle image enhancement network based on reflectance and illumination components (RIC-NVNet) to enhance the distinctive features so as to improve both the overall enhancement and the correct rate of vehicle model recognition.MethodThe RIC-NVNet model consists of an information extraction module, a reflection enhancement module, and an illumination enhancement module. Firstly, the RIC-NVNet utilizes an information extraction module based on the U-Net network structure, using a combination of the night-time vehicle image and its grayscale image as input, to extract the reflection and illumination components of the night-time image. Subsequently, the reflection enhancement module, with a skip connection structure, corrects the color distortion and additional noise problems of the reflection component of the night-time image to obtain an enhanced reflection component. Then, the illumination enhancement module, based on a generative adversarial network structure and an adaptive weight coefficient matrix, generates a day-time illumination component from the illumination component of the night-time image extracted by the information extraction module. Finally, based on the Retinex theory, the enhanced reflection component and the generated daytime illumination component are multiplied to obtain the image after illumination enhancement. To effectively train the RIC-NVNet, we improve the constraint loss of the illumination component to enhance the component extraction effectiveness of the information extraction network. Also, we use color restoration loss, structure consistency loss, and RGB channel loss to constrain the reflection enhancement module to further improve the model’s performance. In addition, we adopt a generative adversarial loss to constrain the illumination enhancement module and improve its robustness. In summary, the RIC-NVNet is a powerful night-time vehicle image enhancement model that can effectively improve the quality and recognition rate of night-time images.ResultOn one hand, the performance of RIC-NVNet was evaluated on the simulated night-time vehicle datasets (SNV) and real night-time vehicle datasets (RNV) proposed in this paper. The results showed that using the RIC-NVNet method for low-light image enhancement on these datasets resulted in higher Top1 and Top5 recognition rates obtained by residual neural network-50(ResNet50) compared to other low-light image enhancement methods. In the SNV dataset, the Top1 and Top5 recognition rates of RIC-NVNet were 82.68% and 94.92%, respectively, which were about 2% higher than the lower recognition rates of the zero-reference deep curve estimation(Zero-DCE) method. Additionally, the image quality evaluation indices peak signal to noise ratio (PSNR) and structural similarity (SSIM) were also correspondingly improved compared to other methods.ConclusionThe experimental results show that the proposed method can solve the problem of low recognition rates of night-time vehicle images caused by weak exposure and multiple interfering light sources. The method combines an information extraction module, a reflection enhancement module, and an illumination enhancement module, and outperforms other low-light enhancement methods in terms of objective recognition rates, image evaluation indices, and subjective overall image quality of the enhanced night-time vehicle images.  
      关键词:vehicle model recognition;low light enhancement;image decomposition;generative adversarial network (GAN);Retinex model   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942576 false
      发布时间:2024-05-07
    • Li Fengyong,Ye Bin,Qin Chuan
      Vol. 28, Issue 7, Pages: 2068-2080(2023) DOI: 10.11834/jig.211127
      Mutual attention mechanism-driven lightweight semantic segmentation network
      摘要:ObjectiveThe intrinsic and semantic feature-based image fusion is a challenging issue in relevance to image semantic segmentation. Image semantic segmentation has been developing for such domains like medical image analysis, remote sensing mapping, automatic driving and other related contexts. Lightweight semantic segmentation network is possible to meet the requirements of real applications via alleviating the constraints of speed and accuracy. The image semantic segmentation network is focused on convolutional neural network (CNN) to extract image-related texture, contour, color, gray and other related features of the objects, and it can be melted into semantic features as well. An efficient feature fusion mechanism is still to be tackled and resolved in terms of the intrinsic and semantic features extraction. For lightweight semantic segmentation network, feature fusion effect is restricted by insufficient model channels-derived feature representation. Network architecture-specific fusion modules are preferred but lack of scalability and universality. Although attention mechanisms have also been widely used in image semantic segmentation networks, current researches are still focused on self-attention mechanisms only, which are limited to existing feature maps, and information interaction is difficult to be achieved between different feature maps. To meet the requirement of feature fusion, we design a mutual attention mechanism-driven semantic segmentation module in terms of features-between correlations. In this module, attention mechanism is introduced for feature fusion stage, and information exchange can be realized between different feature maps.MethodFirst, we reconstruct non-local module based attention calculation mechanism, and the mapping between query, key and value can be changed to obtain the mutual attention module, which obtains detail characteristic map and semantic characteristics map. Then, a single attention mechanism-derived association model is built up between any one feature point on the intrinsic and the semantic feature map. Subsequently, association model-guided features are aggregated on the semantic feature map, which adds this feature point to the intrinsic feature map. Accordingly, feature map-semantic information is fused to intrinsic feature map. Furthermore, the same operation is implemented to fuse the information from the intrinsic feature map to the semantic feature map, and mutual attention-based feature fusion mechanism can be finally realized. Since the fused feature and the input feature are used to preserve the same type, the fused feature can be easily embedded into existing image semantic segmentation model. The proposed mutual attention module can be used to share queries and keys as well. The complexity of the model is alleviated effectively and multiple branches-sharing models can be used to strengthen the connection between different branches. To improve the performance of the model further, the transmission of information can be coordinated through the mutual attention mechanism. Since the association model can guide feature fusion and splicing the model with cross shared operation effectively, the representation capability of fused feature is significantly enhanced, which can optimize computational cost and semantic segmentation efficiency.ResultTo verify the effectiveness of the proposed mutual attention module, bilateral segmentation network (BiSeNet V2)-based comparative experiments can show its feature fusion potentials of the mutual attention module. To develop the diversity of the mutual attention module, five semantic segmentation models are selected for experiments, and CamVid-related public datasets are used for training simutaneously. The quantitative average results of five experiments can be used to compare the amount of floating-point operations, memory usage, number of model parameters, and average cross-to-parallel ratio when the mutual attention module is added to the original network. The experimental results demonstrate the network model is more lightweight for the BiSeNet V2 network using optional methods. Each optimization is improved by 8.6%, 8.5%, and 2.6%, but the average cross-to-parallel ratio has been improved. The average intersection of all networks has been improved significantly via the other related four networks-embedded modifications, and the highest gains can be reached to 1.14% and 0.74% of each. Additionally, result analysis is based on the number of channels of queries, keys and values, the number of attention on the complexity, and modeling generation.ConclusionA mutual attention mechanism-driven lightweight semantic segmentation network is developed and demonstrated. The proposed mutual attention mechanism can guide feature fusion via multiple feature-between correlation. The public datasets-between experimental results are compared and illustrated that mutual attention mechanism is melted into different lightweight semantic segmentation network models, and semantic segmentation accuracy of the model are improved as well. To strengthen its ability of universality further, the proposed mutual attention module can be used to realize the plug-and-play for different network models.  
      关键词:image semantic segmentation;lightweight network;mutual attention module;feature fusion;association model   
      3
      |
      0
      |
      3
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943079 false
      发布时间:2024-05-07
    • Yang Zhimin,Song Wei
      Vol. 28, Issue 7, Pages: 2081-2092(2023) DOI: 10.11834/jig.220052
      Selecting and fusing coarse-and-fine granularity features for fine-grained image recognition
      摘要:ObjectiveWith the rapid development of deep learning techniques, the machine intelligence of general image recognition has reached or even surpassed the human level. This encourages researchers to tackle highly complicated tasks in the field of fine-grained vision. As one of the classic tasks in this field, fine-grained image recognition (FGIR) subdivides subordinate categories, such as bird species, car models, and aircraft types. FGIR is more challenging than general image recognition because subordinate species often have smaller interclass differences (e.g., similar geometry or texture) and larger intraclass variations (e.g., illumination or pose variations). Therefore, determining how to explore subtle differences and shorten inner variations in diverse parts of an object becomes a challenge for FGIR. Faced with this challenge, most of the current FGIR methods mainly focus on locating the discriminant parts at coarse-and-fine granularity, with the goal of generating powerful representations of parts for fine-grained recognition. Recent studies have demonstrated that part-based methods locating diverse parts are capable of distinguishing different subclasses, contributing to enhancing the performance of FGIR. The success behind part-based methods can largely be attributed to being able to select and locate multiple parts with distinct differences for downstream recognition. Strongly supervised part-based methods leverage the part annotations to establish the connection of each part. However, this process usually causes heavy labor of manual labeling for them, which is time consuming and inefficient. By contrast, weakly supervised part-based methods show that the complementary relations between parts can be exploited in a learnable way. These methods based on part-level features may not be robust enough to face appearance distortion, e.g., the poses of bird heads are uncontrollable in a real scene. More importantly, the pose variations may influence the validity of spatial features. To compensate for the deficiency of spatial features, multigranularity methods are adopted for feature learning. However, in these multigranularity methods, little or no effort has been made toward at which granularities are these diverse parts most discriminative and how can information across different granularities be effectively fused for recognition accuracy. Therefore, we consider that FGIR needs not only to efficiently select coarse-and-fine granularity features but also to effectively fuse them through part relations. To deal with the two aforementioned limitations, we propose an FGIR method selecting and fusing coarse-and-fine granularity features for fine-grained recognition.MethodIn this study, an FGIR method based on pretrained convolutional neural network is built to extract basic convolutional features, and feature selection and fusion are carried out via three modules. First, we design a fine-grained feature selection module to highlight fine-grained discriminative features of parts through spatial and channel selection. Considering that the parts in a fine-grained image are usually spatially connected and activated in most feature channels, we perform spatial and channel selection to discard the background and highlight the informative channels for the convolutional features, contributing to obtaining discriminative image representations. Second, a coarse-grained feature selection module is constructed to focus on the subtle features of parts. Based on the selected parts of fine-grained modules, the semantic and position relationships among parts are mined to generate the coarse-grained diverse features that provide context information for fine-grained parts. Most of the previous methods focused on the use of local subtle features for FGIR, ignoring the potential influence of the relationship between parts and their coarse-grained context. These methods usually individually fed each part directly into the subnetwork or simply spliced their features to make recognition predictions. However, these relationships between the parts, especially the semantic and positional relationships of these parts and their coarse-grained contexts, contain valuable information that benefits feature learning and recognition processes. Therefore, inspired by the self-attention mechanism, this module in our method constructs the relationship between each part of the object and its context from two different perspectives of semantic modeling and spatial modeling. It selects the information related to the object with attention, thus providing coarse-grained diversity features for fine-grained parts. Lastly, we design a coarse-and-fine granularity feature fusion module to improve the accuracy of the FGIR method. It establishes the communication between fine- and coarse-grained features and carries out supplementary fusion of fine- and coarse-grained features to form complementary coarse-and-fine granularity representations.ResultExtensive experiments were carried out to verify the effectiveness of the proposed method. Specifically, we compared our method with seven state-of-the-art FGIR methods on three public standard datasets, namely, caltech-UCSD birds-200-2011 (CUB-200-2011), Stanford Cars, and fine-grained visual classification aircraft (FGVC-Aircraft). The quantitative evaluation metrics were accuracy, mean average precision, and precision-recall curves (higher is better), and we provided parameters and floating-point operations (lower is better) of several methods for comparison. The experimental results showed that our method outperformed all other FGIR methods on CUB-200-2011, Stanford Cars, and FGVC-Aircraft datasets, and the recognition accuracy of the proposed method was 90.3%, 95.6%, and 94.8%, respectively. Compared with the best results of progressive multigranularity (PMG) in the comparison methods, the recognition accuracy was improved by 0.7%, 0.5%, and 1.4% on the three datasets, respectively. Moreover, compared with PMG, the floating-point operations decreased by 17.8 G, and the parameters were reduced by 17.2 M on all three datasets. We also conducted a series of comparative experiments in our method to clearly show the effectiveness of different modules. Furthermore, we introduced a widely used method of class activation mapping to visualize the recognition results of our method and other higher-performance methods to carry out a fair comparison experiment and make a qualitative analysis of the success and failure regarding our proposed method.ConclusionIn this study, faced with the image task of fine-grained recognition, we propose an FGIR method selecting and fusing coarse-and-fine granularity features. The experimental results show that our method outperforms several state-of-the-art FGIR methods, indicating that our method can extract both coarse- and fine-grained visual features with the discrimination and diversity of features.  
      关键词:fine-grained recognition;coarse-and-fine granularity;feature selection;feature fusion;discrimination;diversity   
      2
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942935 false
      发布时间:2024-05-07
    • Dong Yangyang,Song Beibei,Sun Wenfang
      Vol. 28, Issue 7, Pages: 2093-2104(2023) DOI: 10.11834/jig.220079
      Local feature fusion network-based few-shot image classification
      摘要:ObjectiveThe emerging convolutional neural network based (CNN-based) deep learning technique is beneficial for image context like its recognition, detection, segmentation and other related fields nowadays. However, the learning ability of CNN is often challenged for a large number of labeled samples. The CNN model is disturbed of over-fitting problem due to insufficient labeled sample for some categories. The collection task of labeled samples is time-consuming and costly. However, human-related perception has its ability to learn from a small number of samples. For example, it will be easily recognized other related new images in these categories even under a few pictures of each image category circumstances. To make CNN model have the learning ability similar to human, a new machine learning algorithm is concerned about more, called few-shot learning. Few-shot learning can be used to classify new image categories in terms of a limited amount of annotation data. Current metric-based meta-learning methods can be as one of the effective task for few-shot learning methods. However, it is implemented on the basis of global features, which cannot represent the image structure adequately. More local feature information is required to be involved in as well, which can provide discriminative and transferable information across categories. Furthermore, there are some local features representation-derived metric methods can be used to obtain pixel-level deep local descriptors as the local feature representation of image via removing the last global average pooling layer in CNN. However, local descriptors are depth but the classification effect is restricted by sacrificed contextual information of the image. Additionally, for the feature extraction network, due to limited labeled instances, it is challenged to learn a good feature representation and generalize new categories. To utilize the local features of image and improve the generalization ability of the model, we develop a few-shot classification method in terms of local feature fusion.MethodFirst, to obtain local features, the input image is divided into H × W local blocks and then transferred to the feature extraction network. This feature representation-related method can demonstrate local information of the image and its context information. Multi-scale grid blocks are illustrated as well. Second, to learn and fuse the relationship between multiple local feature representations, we design a Transformer architecture based local feature fusion module because the self-attention mechanism in Transformer can capture and fuse the relationship between input sequences effectively. Each local feature consists of the information of other local features and it has fusion-after simultaneous global information. And, we concatenate the multiple local feature representations of each image as the final output. The feature representation of the original input image is enhanced and the generalization ability of the model can be improved after that. Finally, the Euclidean distance between the query image embedding and the support class prototype is calculated to classify the query image. Our training process is divided into two steps: pre-training and meta-training. For the pre-training stage, the Sofamax layer-attached backbone network is used to classify all images of the training set. To improve the generalization ability of the model, we use the data-augmented methods of random cropping, horizontal flipping and color jittering. After the pre-training, the backbone network in the model is initialized with pre-trained weights, and other components are then fine-tuned. For meta-learning stage, the episode training strategy is implemented for training. To make a fair comparison with other few-shot classification methods, the ResNet12 structure is used as the feature extractor of the backbone network, and the cross entropy loss of classification is optimized through stochastic gradient descent (SGD). The initial learning rate of the model is set to 5 × 10-4, and we set 100 epochs in total, the learning rate is decreased by half every 10 epochs, 100 episodes and 600 episodes for training and validate in each epoch. The domain difference is larger since there are more samples in TieredImageNet dataset, more iteration is required to make the model convergent. Therefore, we set 200 epochs, and the learning rate is decreased by half very 20 epochs. In the test stage, to evaluate the average classification accuracy, such 5 000 episodes are selected from the test set randomly.ResultComparative analysis is based on three benchmark datasets in few-shot classification. For MiniImageNet dataset, each average classification accuracy is optimized by 2.96% and 2.9% under 5-way 1-shot and 5-way 5-shot settings. For CUB dataset, each of average classification accuracy is increased by 3.22% and 1.77%. For TieredImageNet dataset, the proposed method is equivalent to the state-of-the-art method in average classification accuracy. To fully verify the effectiveness of the proposed method, a large number of ablation experiments are also carried out as well.ConclusionWe develop a local feature fusion-based method for few-shot classification. It is beneficial to make sufficient local features and the feature extraction ability and generalization ability of the model can be optimized as well. Our Transformer architecture based local feature fusion module can enhance feature representation further, which can be embedded into other few-shot classification methods potentially.  
      关键词:few-shot learning;metric learning;local feature;Transformer;feature fusion   
      3
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942688 false
      发布时间:2024-05-07

      Image Understanding and Computer Vision

    • Zhu Wei,Zhang Yuhang,Ying Yue,Zheng Yayu,He Defeng
      Vol. 28, Issue 7, Pages: 2105-2119(2023) DOI: 10.11834/jig.220047
      A dense residual structure and multi-scale pruning-relevant point cloud compression network
      摘要:ObjectiveIn recent years, point clouds technique have been widely used in autonomous driving, virtual reality, 3D measurement and other related domains. Point clouds can provide a more denser and realistic representation than mesh representing 3D data to a certain extent. Due to the high resolution characteristics of point clouds, their data is usually very large, and dense point clouds contain millions of points and more complex attribute information. A challenging issue is to be resolved for its transmission efficiency and storage resources of point cloud because it needs to consume a lot of network bandwidth and storage resources. Therefore, it is very necessary to develop point cloud compression methods with high compression ratio and low distortion.MethodFirst, to represent point clouds, we develop a sparse tensor-related network to replace voxels via COO (coordinate) format. Sparse convolution (SC) and sub-manifold sparse convolution (SSC) are used to replace regularized convolution. The SSC can preserve features-extracting sparse features, and the network’s ability is optimized to extract local features while SC has a larger receptive field, which can make up for the lack of SSC receptive field. Second, point clouds analysis are challenged for sparse and unorganized status in space, and their channel-related information is likely to be more effective than spatial information. By combining channel attention with the dense residual network that has a good performance in the field of image super-resolution, we construct a three-dimensional dense residual module with channel attention (3D-RDB-CA). This module is capable to capture cross-channel features of high-dimensional information and improve compression performance. Furthermore, existing point cloud compression networks reconstruct high-resolution point clouds from low-resolution features through multiple layers of de-convolution, but de-convolution layers-stacked may produce a checkerboard effect. Therefore, to mitigate this effect and reduce dynamic memory footprint during compression, a pruning layer is added after the multi-scale up-sampling layer in the decoder. According to the saved side information during encoding, this module cuts out the feature points, which do not contribute enough to the compression accuracy in the reconstruction process, and the optimal effect of dynamic memory of model training and convergence speed can be achieved. Finally, a geometric information-compressed point cloud color compression scheme is designed to expand the applicable scope of the compression network.ResultFor geometric information compression, three sort of conventional point cloud compression algorithms (G-PCC (octree), G-PCC (trisoup) and V-PCC) and two kind of point cloud compression algorithms (pcc_geo_cnn_v2 and learned_pcgc) based on deep learning are involved in the comparative experiment with the proposed network. By calculating the peak signal to noise ratio (PSNR) based on D1-p2point (D1 PSNR) and D2-p2plane (D2 PSNR) mentioned above, the corresponding rate-distortion curves are drawn at the same time. For G-PCC and V-PCC, the range of bit rate and corresponding parameters are configurable according to the MPEG CTC-related guidance. Finally, compared to the performance of D1 PSNR and D2 PSNR under the corresponding bit rate range, the proposed network can be used as the baseline, and the BD-Rate and BD-PSNR of other related methods can be calculated to compare its performance. Compared to the point cloud compression standard V-PCC proposed by MPEG, BD-Rate gains can reach more than 41%, 54% and 33% of each dataset. The encoding runtime of the proposed network is equivalent to G-PCC and it is 2.8% of V-PCC only. For color information compression, G-PCC (octree) can be as the baseline. By setting different octree bit depths, quantization ratios and color quality, the color compression distortion can be obtained at different bit rate. By calculating the YUV-PSNR of the two methods under the corresponding bit rate and geometric distortion, the rate-distortion curves can be drawn to compare their compression performance. The experiment demonstrates that the YUV-PSNR performance of the proposed network at low bit rate is better than the octree-based color compression method in G-PCC.ConclusionThe proposed network has its great potentials in geometric compression and color compression, and more original point cloud information with less bit rate can be preserved. It also can be used to facilitate geometry and color-compression-relevant applicable domains of point cloud compression method in the context of deep learning technique.  
      关键词:deep learning;point cloud compression;auto-encoder;sparse convolution;attention mechanism in point cloud;dense residual structure;multi-scale pruning   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942904 false
      发布时间:2024-05-07
    • Ju Yakun,Jian Muwei,Rao Yuan,Zhang Shu,Gao Feng,Dong Junyu
      Vol. 28, Issue 7, Pages: 2120-2134(2023) DOI: 10.11834/jig.220050
      MASR-PSN: a low-resolution photometric stereo images-relevant deep learning model for high-resolution surface normal reconstruction
      摘要:ObjectiveThree-dimensional (3D) reconstruction is currently focused on in computer vision. To optimize the problem of recovering fine details of the surface and dense reconstruction, a fixed scene-related photometric stereo technique can be used in terms of the pixel-wise surface normal under the circumstance of varying shading cues. It can recover per-pixel dense surface normal and improve weak texture-reconstructed objects to a certain extent beyond binocular and multi-view stereo in triangulate sparse 3D points. Photometric stereo can be used in the commonly-used high-precision 3D reconstruction domains like cultural relic reconstruction and industrial defect detection. To solve the complex three-dimensional structure and alleviate the blur problem in the normal reconstruction, high-resolution surface normal can provide richer and more effective 3D information. However, due to the high-resolution linear response cameras are high involved, it is still challenged to recover high-resolution surface normal for photometric stereo images. Therefore, it is urgent to develop the high-resolution surface normal reconstruction in terms of low-resolution photometric stereo images analysis.MethodWe facilitate deep learning based super-resolution photometric stereo algorithm further to recover accurate high-resolution surface normal from low-resolution photometric stereo images. First, a normalized operation is employed to normalize in situ pixels in completed low-resolution photometric stereo images, which can alleviate the effects-contextual of severely changing surface reflectance and oversaturated specular reflection. This pre-processing method can be used to deal with steep color change-related objects for surfaces-homogeneous training. Furthermore, we develop a multi-level aggregation super resolution photometric stereo network (MASR-PSN) and a novel deep and shallow fusion max-pooling aggregation framework is designed. The proposed deep and shallow fusion max-pooling aggregation framework can be used to enhance feature representation and preserve multi-scale information because of receptive fields-derived deep and shallow features; to optimize effective learning features related to a certain fixed scale, a weight-shared feature regressor is developed as well, which can learn and reconstruct the surface normal from the features in multiple scales. The weight-shared feature regressor can be paid attention on multiple scale features as the input, and the 4 × 4 super-resolution features can be output after that, which are fused in the following step; For the regressor, the parallel network structure of different sizes of convolution kernels are designed in parallel to the smooth transition-spatial preservation of 3 × 3 convolution kernel. But, due to excessive smoothing in the spatial domain, the loss of resolution details and blur is required to be resolved. To preserve the consistent details of super-resolution surface normal, we develop a paralleled network design, which consists of 3 × 3 convolution layers and 1 × 1 layers. Additionally, a joint loss function is demonstrated as well, which can optimize the MASR-PSN on the constraints of the normal gradient and normal angle. The normal angle constraint is melted into the average error value of the predicted normal only, but the details of the surface are sacrificed and the blur is generated. Therefore, the normal gradient constraint is introduced to focus on the adjacent pixels-between changes, which can concern of more details and preserve the clear recovered super-resolution surface normal.ResultExtensive ablation experiments are carried out and the effectiveness are demonstrated in terms of our proposed deep and shallow aggregation layer and parallel shared-weight regressor, which can reduce the mean angle error (MAE) of the generated surface normal significantly. It is required of input photometric stereo images according to other related resolution-half methods, and a high-resolution normal map-relevant structure of complex surfaces can be reconstructed accurately. The comparative experiments are carried out on the DiLiGenT benchmark dataset quantitatively, as well as on the light stage data gallery dataset and Gourd dataset in qualitative. For the DiLiGenT benchmark dataset (only using half-resolution photometric stereo images compared with other methods), the proposed MASR-PSN can achieve an average angle error of 7.31 degrees when 96 dense images are added as input, and 0.12 degrees are improved, and an average angle error of 9.00 degrees are optimized as well when 10 sparse images are added as input, which is higher of 0.43 degrees. The robustness and effectiveness of the proposed MASR-PSN are shown based on more qualitative experiments on the light stage data gallery and gourd datasets.ConclusionTo predict the super-resolution surface normal and clarify more details in the low-resolution input photometric stereo image, the photometric stereo task-oriented MASR-PSN is potential to improve the accuracy of surface normal reconstruction further.  
      关键词:3d reconstruction;photometric stereo;surface normal recovery;deep learning;super resolution   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942971 false
      发布时间:2024-05-07
    • Liu Suyi,Chi Jianning,Wu Chengdong,Xu Fang
      Vol. 28, Issue 7, Pages: 2135-2150(2023) DOI: 10.11834/jig.220154
      Recurrent slice networks-based 3D point cloud-relevant integrated segmentation of semantic and instances
      摘要:ObjectiveThe growth of GPU-based computing power is beneficial for 3D spatial-contexts of computer vision domain. The 3D point cloud-based segmentation technique has been facilitating such sub-research contexts like robot and manipulator. The 3D point cloud segmentation is mainly categorized into two aspects of semantic segmentation and instance segmentation, and both of them can be focused on detecting minimum unit set-represented specific information areas in the scene. The following sampled scene point cloud can be parsed into points groups as well, in which each group is recognized as a separate instance or class of objects. The integration of two methods are challenged to be mutual-benefited although each optimization of semantic and instance segmentation task can be achieved. Existing challenges is still in related to lower accuracy via 3D point cloud features extraction. Incorrect instance segmentation prediction will distort the effects of semantic segmentation and classification because accuracy of instance segmentation is highly cohesive to the performance of semantic segmentation, such as semantic classification error, instance edge fuzzy and other related problems. We develop a recurrent slice network for the integrated instances and semantics segmentation in the context of 3D point cloud.MethodThe backbone network consists of two networks in related to an improved recursive slice feature extraction and an integrated feature. First, its slice-pooling layer of recursive slice feature-extracted network is oriented to slice the input point cloud for each spatial direction of three, and the maximum pooling method can balance the disordered point cloud sequence. Second, the bidirectional recurrent neural network (RNN) is ineffective derived of such non-updated prior input information like insufficient learning ability, gradient disappearance and other related problems. To obtain local and global features-encoded matrix, the bidirectional long short term (BiLSTM) network is used to exchange local information of different slice. Third, the extracted features can be decoded into two kind of paralleling branches for semantic segmentation and instance segmentation. The multiple receptive fields-based feature fusion can melt each branch before semantic and instance feature are fused together. To get information-semantic instance segmentation model, semantic perceptive information is leaked out from the high-dimensional semantic features, and it can be combined with the instance features. To realize the semantic segmentation model of instance embedding, the instance-embed k-nearest neighbor (KNN) clustering method is facilitated to sort out a fixed number of adjacent points for each point in the instance clustering space. The points of the same class are correlated and the points of different classes are discrete. Meanwhile, super-parameters can filter some outliers to preserve the generalization performance of the model.Resultto verify the performance of the point cloud segmentation, two public datasets of Stanford 3D indoor semantics dataset (S3DIS), and the ShapeNet dataset is involved in for comparative analysis. This model analysis is in comparison with other related state-of-the-art saliency models, including such segmentation approaches of their semantic, instance and the joint contexts. For the 6-fold cross validation experiment on S3DIS dataset, the results show that semantic segmentation accuracy of the proposed algorithm can be reached to 73% of mean intersection over union (mIOU), 82.3% of mean accuracy (mAcc) and 89.3% of overall accuracy (oAcc). It is 4.4%, 10.2% and 1.9% higher than the position adaptive convolution (PAConv) algorithm; the m-Cov (mean instance coverage) and mean instance weighted coverage (mw-Cov) of the instance segmentation can be reached 64.1% and 65.3% texting on area 5, which is 0.6% and 0.7% higher than PointGroup algorithms. Furthermore, for the semantic segmentation experiment on the S3DIS dataset, our algorithm has achieved its ability for its 8/13 categories. For the ShapeNet dataset, the semantic segmentation accuracy of the proposed algorithm can be achieved 89.2% of mIOU, higher than PAConv algorithm 4.6%.ConclusionThe problems of semantic segmentation and instance segmentation in 3D point cloud can be focused on, and a feature slice network-based fusion algorithm of semantic segmentation and instance segmentation is developed as well. To get instance-embed semantic segmentation, instance features are melted into semantic branches, and semantic features can be conveyed to instance segmentation channel. The proposed algorithm demonstrates that the integration of semantic segmentation and instance segmentation is in comparison with other related point cloud segmentation algorithms in S3DIS and ShapeNet datasets.  
      关键词:3D point cloud;semantic segmentation;instance segmentation;recurrent slice network (RSNet);semantic feature;instance feature;feature fusion   
      2
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942713 false
      发布时间:2024-05-07

      Virtual Reality and Augmented Reality

    • Pan Xiaokun,Liu Haomin,Fang Ming,Wang Zheng,Zhang Yong,Zhang Guofeng
      Vol. 28, Issue 7, Pages: 2151-2166(2023) DOI: 10.11834/jig.210632
      Dynamic 3D scenario-oriented monocular SLAM based on semantic probability prediction
      摘要:ObjectiveVisual-based simultaneous localization and mapping (vSLAM) is essential for computer vision and robotic-related domain. The multiview and 3D structure scenarios can be recovered in terms of the input images analysis. Due to 3D objects in the real environment will seriously affect the stability of the vSLAM system, most of the existing vSLAM systems rely on static-scenarios assumption, which limits the application in the dynamic environment. Current geometry-based methods are focused on the negative effect alleviation of dynamic objects in checking some geometric constraints in 3D vision like epipolar constraints and re-projection error. Recent deep learning based semantic segmentation technology has been facilitating more effective information for SLAM system because the static and dynamic parts in the scene are often closely related to their semantics. Theoretically, due to the image information is transferred from pixel-level to semantic-level, the vSLAM system can be run stably in the dynamic environment in terms of the semantic information. However, some semantic-based SLAM schemes directly remove the semantic objects in the scene based on the segmentation results without considering their motion states. This may remove the areas that can provide stable visual features in some common real scenes, and the lack of sufficient observation will affect the stability of SLAM system severely. A feasible path is oriented to analyze the motion state of the semantic clustering, and applicability strategies are then implemented to alleviate the influence of the moving objects in visual localization module in vSLAM. The challenges of updating map in dynamic scene is required to be resolved as well.MethodA new monocular vSLAM algorithm-based semantic probability prediction method is developed, which can combine semantic segmentation and robust estimation algorithm. To simplify training and generalization of network, the learning-based semantic segmentation module is opted. First, to track the certain cluster in 2D image space and keep the temporal consistency of segmentation, the input image data is segmented in relevance to segmentation results of existing frames. Next, robust estimation method and geometric constraints are used to detect the state of these clusters, which can be embedded with temporal consistency information. To determine whether the cluster is dynamic or not, the negative effect is alleviated by sacrifice the possible moving objects in the scene straightforward, which may lead to the reduction of the number of observations, further weaken the robustness of the system. Finally, the cluster state is used to represent probability-related the static/dynamic state of the 2D observation, which can melt the uncertainty of the edge of the dynamic object into the image space and the interference of the observation on the dynamic object can be reduced for the pose estimation of the camera. In addition to eliminating the moving observations, a spatially consistent sparse map is in consistent and beneficial for the stability of monocular vSLAM in dynamic scenes. Dynamic 3D objects in the scene will introduce some invalid map points into the map, resulting in the inconsistency between the map and the 3D structure of real scene, which causes potential risk to the long-duration running of SLAM system. We can resolve this contextual problem from the perspective of probability. The observation is considered as an invalid map point when the dynamic probability is lower than a certain threshold in terms of the motion probability of the previous 2D observation. Our system can be used to alter the invalid map points in time, and it can run stably in highly dynamic scene for a long time.ResultOur method proposed is focused on both quantitative and qualitative aspects. Compared to such conventional monocular vSLAM system, our method has its priority in absolute trajectory error (ATE) on TUM-RGBD high dynamic sequences. This metric is illustrated its effectiveness for real time applications in terms of such AR-related visual localization. Due to the motion range of the dynamic object (person) in these sequences is vulnerable, and the scenes are relatively similar, the VICON motion capture system is used to record our own challenging dynamic dataset. In these scenes with complicated object and multiview motion, our method has absolutely advantage in ATE and the integrity of the tracking. Additionally, we compare the quality of mapping and augmented reality(AR) application in highly dynamic scenes qualitatively as well. To maintain the consistency of the real environment while mapping, the sparse map of our method can change synchronously in related to the dynamic object moving in the scene, which is beneficial for some map-based downstream applications like robot path planning. For AR application, virtual cube can be fixed in the 3D scene stably, and it is not drifted with the disturb of dynamic objects in the scene.ConclusionBy integrating semantic information into dynamic scene and robust estimation algorithm, the proposed method carries out data association and motion state detection on the segmentation areas, represents the motion state of 2D observation in the form of probability, and eliminates the invalid map points in time, which can significantly improve the accuracy of camera pose estimation, the quality of mapping and the robustness of monocular vSLAM in highly dynamic environment.  
      关键词:visual-based simultaneous localization and mapping (vSLAM);semantic segmentation;dynamic environment;robust estimation;probability prediction   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943218 false
      发布时间:2024-05-07
    • Zhang Chen,Jiang Wenying,Chen Siyuan,Zhou Wen,Yan Fengting
      Vol. 28, Issue 7, Pages: 2167-2181(2023) DOI: 10.11834/jig.211239
      Multi-agent path planning based on improved double DQN
      摘要:ObjectiveRescue-oriented evacuation drills like fire escape drills have often been structured to optimize rehearsal training effect and firefighting awareness. To get sufficient evacuation experience, multiple drills are costly for related organizers. The requirement of that is based on evacuation drills, emergency drill venue, the physical condition of participants, and position information in real-time. The emerging virtual reality technology can be used to guide virtual fire escape in relevance to lower cost and risk and higher reliability. Moreover, to simulate its emergency drills in virtual scenarios, multi-agent path planning has been recognized and developed nowadays.MethodWe develop an improved double deep Q network (DQN) framework. Specifically, this virtual scenario analysis is developed through collecting enough campus information, including multiple agents, obstacles, exits, fire affected areas, and other related factors. Since all agents are assumed on the same plane, we can convert them into two-dimensional grid diagrams via transformation gridding and coordination. Furthermore, different grids are colored and utilized in two-dimensional grid plane m to represent obstacles, fire affected areas, exits and locations of agents. According to the location of the agent in the virtual scene, the grid plane m is layered, and the grid plane m1 and the grid plane m2 can be obtained in terms of the sizes of 64 × 100 and 48 × 100 of each. In the double deep Q network, we use two double Q networks with the same structure, i.e., Q1 and Q2, which consists of two category of convolution and full connection layers. Furthermore, input size can be interlinked to the grid planes with the same size as m1 and m2 after environmental stratification. For the grid planes with the same size as m1 and m2, trainable grid planes m1t' and m2t' can be obtained by randomly assigning the same number of black blocks with size of 1×1 to represent the duplicable location of the obstacle, and generating planes corresponding to all different starting positions to represent all status of the agent in the scene, which are used to initialize experience pools D1 and D2 and train networks Q1 and Q2. For the actual evacuation drills, the evacuation of the crowd is not completely independent and discrete. Nevertheless, due to the sociality of people, there is a certain social relationship between the people involved in evacuation, and there is often a certain phenomenon of “gathering and following” in crowd evacuation. In addition, to achieve the evacuation process of the crowd better in an actual evacuation drill, the organizer often arrange a certain number of guiders at different locations to assist the participants to complete the process of evacuation. Hence, our framework can add this guide into the virtual scenario and an improved k-medoids algorithm based multi-agent grouping strategy method is implemented. Agent-based location and relationship are involved in and the related grouping of the agents are accomplished as well, i.e., the selection of corresponding guiding agents, and the evacuation-led of other agents in the group, and the improved path planning algorithm of double deep Q network architecture mentioned above. A reliability and efficiency of evacuation are improved further.ResultExtensive experiment is carried out to validate our proposed methods. In the training process, the network Q3 of the traditional DQN method converge 24 000 batch sizes, while the Q1 and Q2 networks converge about 3 000 batch size as well. In detail, it demonstrates that the convergence performance of proposed method is significantly faster than the traditional DQN method and more stable. Additionally, to improve the evacuation efficiency and evacuation safety of the agent in fire scenarios, average health evacuation value (AHEP) is used to evaluate the evacuation effect. In AHEP criterion, it is about 84% and 104% higher than each traditional path planning methods of A-STAR, DIJKSTRA. Compared to the extended A-STAR and Dijkstra-ACO hybrid algorithm based on changeable fire scene, hybrid algorithm can be improved by 30% and 21%; Compared to DQN algorithm, it can be reached 20% higher. What is more, evacuation efficiency and safety are improved more, and evacuation effect of the planned path is much better. Furthermore, to verify the evacuation effect under different groups, we compared the AHEP values under the four groups of 4, 5, 6 and 7. When the group is 6, its value is the highest, which is 17%, 13% and 6% higher than those three cases of 4, 5 and 7. Finally, the results show that the appropriate grouping of multi-agent can improve the evacuation efficiency of agent.ConclusionThe proposed method has its potentials to improve the evacuation efficiency and security to a certain extent.  
      关键词:virtual reality;fire drill;multi-agent;deep reinforcement learning;grouping strategy   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943161 false
      发布时间:2024-05-07

      Medical Image Processing

    • Ding Yi,Zheng Wei,Geng Ji,Qiu Luyi,Qin Zhiguang
      Vol. 28, Issue 7, Pages: 2182-2194(2023) DOI: 10.11834/jig.211197
      Multi-level parallel neural networks based multimodal human brain tumor image segmentation framework
      摘要:ObjectiveClinical-oriented magnetic resonance imaging (MRI) has its priority to analyze human brain tumors in relevance to such fields of MRI brain images of Flair, T1, T1c, and T2 modalities. The glioma can be oriented as the most common type of brain tumor for adults nowadays. Due to the features of its anatomical structure, such of lesions can be visually clarified in terms of MRI images analysis. The difference between medical treatment and scarcity of medical expertise are required of computers-assisted diagnosis and treatment. Recent deep learning technology for brain tumor segmentation has demonstrated its great potential for brain tumor image segmentation. In addition, to improve the segmentation accuracy of brain tumor images further, current literature reviews are focused on optimization of network feature extraction ability via an extraordinary network structure, and brain tumor images information are related to its multi-resolution information, spatial multi-view information, information post-processing, and symmetry information. Various of deep neural network (DNN) models have been developing in computer vision in recent years,such as Visual Geometry Group Network (VGGNet), GoogLeNet, ResNet, and DenseNet. The DNN model mentioned above can facilitate the development of deep learning-based brain tumor diagnosis methods. To extract feature information in brain tumor images more effectively, we develop a multi-level parallel neural networks based multimodal brain tumor image segmentation framework further.MethodTo enhance the ability of feature information extraction and expression, this framework is facilitated derived from the existing network backbone, and the feature information of brain tumors can be extracted and fused adaptively via a multi-level parallel feature extraction module and parallel up-sampling module. Deeper features can be extracted in the depth of the network and iterate multiple backbone network branches for feature extraction in parallel. The layer-by-level connection of the neural network can not only broaden the width of the neural network but also mine the depth of the neural network. As a result, multi-level parallel feature extraction structure has its stronger and richer nonlinear representation capabilities than the single-level feature extraction structure, and more complex mapping transformations are fit into more complex image features as well. To preserve the richness of features, hierarchical parallel feature extraction structure has sufficient network width to extract various attribute information of images, such as different colors, shapes, spatial relationships, textures, and other related features. Furthermore, inspired by long connected structure of U-Net, a multi-level pyramid long connection module is melted into the network to fully achieve the integration of the input features of different sizes and improve the transmission efficiency of feature information. The richness of the feature is enhanced in terms of multi-level pyramid long connection module. Meanwhile, the input end of the multi-level pyramid long connection module can be used to fully analyze the information fusion between layers of different sizes. All of them can alleviate the loss and deformation of image information to a certain extent, which can improve the propagation efficiency of features of the same size at both ends of a long connection. It can affect the segmentation accuracy of multimodal brain tumor images ultimately.ResultTo verify the overall performance of the algorithm, an evaluation is first carried out on the testing set of the public brain tumor dataset BraTS2015. The average Dice scores of the proposed algorithm in the entire tumor, tumor core, and enhanced tumor areas can be reached to 84%, 70%, and 60% of each. It can optimize segmentation duration to less than 5 s farther. Some other related comparative experiments are linked to such modules of feature extraction, up-sampling, and the pyramid long connection, and the effectiveness of each module is compared with the backbone method as well. An experiment is conducted on the BraTS2018 validation set. The proposed algorithm can achieve average Dice scores of 87%, 76%, and 71% of each. Compared to the backbone method,it illustrates higher average Dice scores of 8.0%,7.0%, and 6.0%.ConclusionWe extend the common network backbone and propose a multimodal brain tumor image segmentation framework based on a multi-level parallel neural network. We develop a multi-level parallel expansion at the same time. The hierarchical pyramid long connection module can be used to optimize original long connection model-derived multi-scale and receptive field-relevant unclear information, and the richness of features can be improved as well. Multi-level parallel neural network-based segmentation framework is demonstrated to optimize segmentation accuracy and efficiency.  
      关键词:multimodal brain tumor image;multi-level parallelism;deep neural network(DNN);feature fusion;semantic segmentation   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943008 false
      发布时间:2024-05-07
    • Yu Dian,Peng Yanjun,Guo Yanfei
      Vol. 28, Issue 7, Pages: 2195-2207(2023) DOI: 10.11834/jig.220078
      Ultrasonic image segmentation of thyroid nodules-relevant multi-scale feature based h-shape network
      摘要:ObjectiveEarly diagnosis of thyroid cancer-beneficial lesions of ultra-sound thyroid nodules are required to be located accurately. Ultra-sound imaging technique is potential for the diagnosis of thyroid diseases and it is cost-effective and simplified to a certain extent. Thyroid imaging reporting and data system (TI-RADS) is focused on benign and malignant nodules-relevant evaluation recently. The probability of the nodule will be much more distorted when the level is higher. The first step of ultrasound evaluation is oriented to segment thyroid nodule. At present, the commonly-used segmentation method is focused on manual segmentation, which is still labor intensive and experience behavioral. Computer technology-based medical imaging technique is focused on realizing automatic segmentation of ultrasonic nodules and the speed and accuracy of diagnosis can be improved. Current deep learning technique has its potentials for several of visual recognition tasks in recent years. Compared to traditional contour-shape and region based methods, deep learning technology is preferred to improve the accuracy of tasks. Fully convolutional neural network (FCN) and convolutional neural network (CNN) based multiple models are oriented to achieve specific segmentation tasks. However, the speckle noise of ultrasound images and the uncertainty of the size, shape and location of the patient’s nodules have affected the accuracy of nodule greatly.MethodFirst, the h-shape network framework is proposed in terms of an encoder and two decoders. The shape of the framework is similar to the letter “h”, and the depth separable convolution is introduced to shrink the network size. The second convolution of each layer in the network is replaced by the depth separable convolution to lower the number of parameters of the model. The encoder is used to extract image features, and the enhanced down-sampling module is constructed to alleviate down-sampling-led information loss. The module is composed of a connection of maximum pool and average pool, batch normalization and average pool, which is used to enhance the feature extraction capability of the decoder. The first decoder is in supporting of the preliminary segmentation information of the image, and the second decoder can be used to enhance the feature expression of the nodules, and the segmentation accuracy can be improved via the first decoder-related fusion of the learned information. Finally, the fusion convolutional pyramid pooling module is designed as well, in which atrous spatial pyramid pooling module and deep separable convolution can be integrated together to realize multi-scale feature fusion while the network size and the generalization ability of the model are optimized. The four sorts of decoder blocks of the second decoder can be operated through its fusion convolutional pyramid pooling module for each of them, and final prediction result is generated after concat operation. Three kinds of datasets are provided to verify the model well, which consists of 3 622 ultrasound images-within internal dataset, 637 ultrasound images-involved digital database thyroid image dataset, and 3 493 ultrasound images-included TN3K public dataset. The internal dataset and TN3K dataset are divided into training set, validation set, and test set in a ratio of 8∶1∶1 to train the model. Due to the small number of thyroid nodule samples are linked to the DDTI dataset, partitioning of the DDTI dataset is prone to be over fitted, and the weight information of the internal dataset can be used to test the DDTI dataset straightforward. The experiment is built on the Pytorch framework and Nvidia RTX 2080 TI is used to train the model. Using Adam as the optimizer, the initial learning rate is 0.000 1. There are 200 rounds of training, and the learning rate is lower to half every 20 rounds. The batch size is set to 8. To perform well in the segmentation of small nodules on the basis of stable overall segmentation, DiceBCELoss is regarded as the loss function, which can combine BCE loss function with Dice loss function. The segmentation results are analyzed in quantitative in terms of the Dice similarity coefficient (DSC), Hausdorff distance (HD), sensitivity (SEN), and specificity (SPE).ResultTo validate the ability of the proposed method, comparative analysis is carried out, which is in comparison with AttentionUNet, marker-guided U-Net (MG-UNet), fully convolutional dense dilated Net (FCdDN), DeepLab V3+, segmentation network (segNet) and context encoder network (CE_NET). For the internal dataset, the DSC, HD, SEN and SPE indexes of the proposed model are reached to 0.872 1, 0.935 6, 0.879 7 and 0.997 3 each. The DSC is 15.53% improved than the worst model, and 1.2% improved than the second best one; The HD is 2.583 6 decreased than the worst model, and 0.034 1 decreased than the second best model; The SEN and SPE are increased by 0.32% and 1.17% than the second best model, which are 7.57% and 9.96% higher than the worst model. For the digital database thyroid image dataset, the DSC and SPE are 0.758 0 and 0.977 3 each, which are 9.83% and 15.25% improved than the worst model, 1.02% and 0.71% increased than the best model. The DSC of 0.781 5 and the HD of 4.472 6 are obtained on TN3K dataset, which are 1.27% higher and 0.634 5 lower than the model with the second best performance. Furthermore, a series of ablation experiments are conducted on the proposed model as well to demonstrate the effectiveness of the different steps of the fusion algorithm.ConclusionThis proposed network can improve the segmentation accuracy of thyroid nodules, and its computing cost and generalization ability of the model can be optimized further.  
      关键词:deep learning;thyroid nodule;ultrasound segmentation;h-network;enhanced down-sampling;multi-scale   
      3
      |
      0
      |
      1
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39942822 false
      发布时间:2024-05-07

      Remote Sensing Image Processing

    • Wang Rongfang,Wang Liang,Li Chang,Huo Chunlei,Chen Jiawei
      Vol. 28, Issue 7, Pages: 2208-2220(2023) DOI: 10.11834/jig.211159
      IIQ-CNN-based cross-domain change detection of SAR images
      摘要:ObjectiveMultiple remote sensing (RS) imaging technologies-oriented synthetic aperture radar (SAR) have been developing dramatically nowadays. Compared to other related RS imaging techniques, SAR imaging technique has its high-resolution potentials for all-weather and all-day observation beyond atmospheric and sunlight conditions. SAR image is oriented to changeable detection in related to such land cover analyses and certain regional comparison via multi-temporal SAR images. It can be widely applied for such domains in the context of natural disaster analysis, agricultural monitoring, and public security surveillance. To get final binary dynamic detection map, conventional detection method consists of such three aspects like preprocessing image, generating difference image, and analyzing the difference map. The accuracy of conventional changeable detection results are greatly originated from data preprocessing and difference image quality. The information loss is related to preprocessing and generating the difference image, especially for the regions of slight changes, and it is very tough to detect the changeable information of this location subsequently. Recent neural network has been facilitating for effective feature learning method, and non-local structures are developed and highly robust to noise-relevant features extraction and invariance can be generated as well. In addition, deep neural network (DNN) based end-to-end structure of can effectively alleviate the constraints of changeable detection results on the difference image. In recent years, multiple variants of convolutional neural network (CNN) models have been developing intensively. The growth of complex network models is required to deal with much more computing resources. To optimize the complexity of neural network, there are two sort of problem-solving schemes mentioned below: the first scheme is developed based on the structure of the neural network model. It can optimize the complexity of the model and such computing resources can be assigned to a new type of operator for fully utilizing of memory and computing power. The second one is developed based on the storage profiling of network parameters. Generally, such neural networks-based learned parameters like weights and biases are involve d in and its activations are 32-bit floating-point values. To alleviate model complexity, it is still challenging for converting 32-bit floating-point numbers into low-bit integer representations. The change detection-relevant labeled data is called to be more sufficient for SAR images. To resolve the problem of time-consuming and costly for data collection, multiple expertise and prior knowledge are required to compare the corresponding optical images to be obtained. However, expertise-derived label accuracy is higher in comparison with the pseudo-labels-generated unsupervised methods. The problem solving is focused on, in which existing labeled data is used for cross-domain changeable detection. We develop an integer inference-based quantization CNN (IIQ-CNN) method for cross-domain change detection of SAR images.MethodThe problem of multiple scenario-related cross-domain change detection is challenged to be resolved, where the labeled source domain data is employed to detect unclear target domain data. Furthermore, to weaken the linkage of the detection results on the difference image and improve the detection accuracy, a sample construction method is designed to incorporate bi-temporal SAR images and the corresponding difference image. Finally, to alleviate the complexity of the model and accelerate the reasoning time, integer reasoning quantization technology is introduced to simulate and quantify the deep network model as well.ResultComparative experiments are carried out on four sets of real SAR image datasets. 1) For detection performance: compared to other related CNN methods, the Kappa coefficient of IIQ-CNN is optimized by 4.23% to 9.07%. 2) For quantization performance: IIQ-CNN can be quantized 16, 8 and 4 bits of each. The detection results are significantly reduced when only 4-bit quantization is performed. In 16-bit and 8-bit quantization, the model can preserve its detection performance and the reasoning time is accelerated significantly.ConclusionThe pseudo-label quality on change detection performance is feasible and the purpose of accelerating reasoning is mutual benefited with the accuracy of model detection, which can promote the application of change detection algorithms in embedded devices.  
      关键词:synthetic aperture radar (SAR) image;change detection(CD);integer inference quantization;convolutional neural network (CNN);cross domain detection   
      2
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 39943072 false
      发布时间:2024-05-07
    0