
Ranking
- Current Issue
- All Issues
- 1
Research on spiking neural networks for brain-inspired compu...
671 - 2
A literature review for neural networks-based encoding model...
444 - 3
A summary of image recognition-relevant multi-layer spiking ...
429 - 4
Review of visual neural encoding and decoding methods in fMR...
326 - 5313
- 6
Multi-scale context information fusion for instance segmenta...
260
- 1
A summary of image recognition-relevant multi-layer spiking ...
288 - 2
A literature review for neural networks-based encoding model...
225 - 3
Multi-scale context information fusion for instance segmenta...
185 - 4
Performance evaluation of mainstream chromosome recognition ...
180 - 5
Global and spatial multi-scale contexts fusion for vehicle r...
165 - 6
Review of visual neural encoding and decoding methods in fMR...
155
- 1134949
- 221237
- 318516
- 418021
- 516492
- 615477
About the Journal
Journal of image and Graphics(JIG) is a peer-reviewed monthly periodical, JIG is an open forum and platform which aims to present all key aspects, theoretical and practical, of a broad interest in computer engineering, technology and science in China since 1996. Its main areas include, but are not limited to, state-of-the-art techniques and high-level research in the areas of image analysis and recognition, image interpretation and computer visualization, computer graphics, virtual reality, system simulation, animation, and other hot topics to meet different application requirements in the fields of urban planning, public security, network communication, national defense, aerospace, environmental change, medical diagnostics, remote sensing, surveying and mapping, and others.
- Current Issue
- Online First
Cover&Content
- "Journal of Image and Graphics" 2023 Volume28 Number 2 doi:10.11834/jig.23000002
18-02-2023
244
88
Abstract:
Brain-inspired Vision
- Information of Special Column "Image Brain-inspired Vision" doi:10.11834/jig.2300002
18-02-2023
246
91
Abstract:
- A literature review for neural networks-based encoding models of biological visual system Zheng Yajing, Yu Zhaofei, Huang Tiejundoi:10.11834/jig.220461
18-02-2023
444
225
Abstract:The biological visual system, an important part of the brain's nervous system, has evolved over hundreds of millions of years. About 70% of the information that humans obtain from the outside world comes from vision. Its complicated systematic functions are relevant to visual pathways and visual cortex, as well as its mechanism. Human-perceptive and energy-efficient vision ability is better than machine-based vision system like real-time sensor data processing, perception tasks and motion control. To realize a more advanced machine vision paradigm, it is still challenged to learn from the design of biological ingenious vision system effectively. The biological vision systems-contextual researches can be recognized as one of the key aspects for computer vision algorithms. Conventional visual neuroscience to the computer vision domain is focused on the structure of hierarchical processing in the visual cortex. The following artificial neural networks (ANNs) are targeted on the hierarchical structure design in the visual system. Visual system is mainly composed of the eyes (retina), the lateral geniculate nucleus and the visual cortex (including the primary visual cortex and the striatal cortex). The human visual cortex and its relevance account for about 1/3 area of the cerebral cortex. It has the ability for visual information-related (e.g., extraction, processing and integration) and advanced brain functions-organized (e.g., learning, memory, decision-making, and emotion). For example, for the task of object recognition, the human brain can identify thousands of objects effectively, but this challenging issue is required to be resolved for machine-relevant. In recent years, deep neural networks (DNNs) have been projecting for computer vision and machine learning. To fit more multiple functions of network, the DNN plus multi-layer structure is designed for the back-propagation training. The biological visual system can be used to recognize as the mapping-learnt relationship between the external visual information and the internal neuron expression. In addition, the neural network itself is a biological visual system-derived multi-layer structure design. Nowadays, the DNNs are the most accurate model for learning the mapping relationship between visual stimuli and neuron responses. The internal units of the ANN can learn the expressions of the internal subunits of the visual system further. The DNNs-hierarchical can predict the visual representation of visual neural response as well (e.g., V1, V2 and interior temporal of visual cortex). Furthermore, the latest unsupervised learning is employed to visual cortex. To outreach a new generation of general artificial intelligence (AGI), the research and development of ANNs and the exploration of brain function and its structure can be mutual-benefited. Our visual system-based review is focused on neural network-based coding models on the basis of primary visual cortex like retina and advanced visual cortex (e.g., V4, IT area). The main literatures are involved in:1) concept and definition of the visual system model, 2) the neural network prediction model of the primary visual system, and 3) the goal-driven advanced visual cortex coding model. The latest unsupervised learning reseaches are reviewed and summarized literatly. Technical challenges and future development directions of its neural network-encoding model are predicted further.
- A review of detailed network simulation methods Zhang Yichen, Huang Tiejundoi:10.11834/jig.220266
18-02-2023
313
132
Abstract:Neurons in brain have complicated morphologies. Those tree-like components are called dendrites. Dendrites receive spikes from connected neurons and integrate all signals-received. Many experiments show that dendrites contain multiple types of ion channels, which can induce high nonlinearity in signal integration. The high nonlinearity makes dendrites become fundamental units in neuronal signal processing. So, understanding the mechanisms and function of dendrites in neurons and neural circuits becomes one core question in neuroscience. However, because of the highly complicated biophysical properties and limited experimental techniques, it's hard to get further insights about dendritic mechanisms and functions in neural circuits. Biophysically detailed multi-compartmental models are typical models for modelling all biophysical details of neurons, including 1) dynamics of dendrites, 2) ion channels, and 3) synapses. Detailed neuron models can be used to simulate the signal integration. Detailed network models can simulate biophysical mechanisms and network functions both, helping scientists explore the mechanisms behind different phenomena. However, detailed multi-compartmental neuron models has high computational complexity in simulation. When we simulate detailed networks, the computational complexity highly burdens current simulators. How to accelerate the simulation of detailed neural networks has been a challenging research topic for both neuroscience and computer science community. During last decades, lots of works try to use parallel computing techniques to achieve higher simulation efficiency. In this study, we review these high performance methods for detailed network simulation. First, we introduce typical detailed neuron simulators and their kernel simulation methods. Then we review those parallel methods that are used to accelerate detailed simulation. We classify these methods into three categories:1) network-level parallel methods; 2) cellular-level parallel methods; and 3) GPU(graphics processing unit)-based parallel methods. Network-level parallel methods parallelize the computation of different neurons in network simulation. The computation inside each neuron is independent from other neurons, so different neurons can be parallelized. Before simulation, network-level methods assign the whole network to multiple processes or threads, and each process or thread simulate a group of neurons. With network-level parallel methods, scientists can use modern multi-core CPUs or supercomputers to simulate detailed network models. Cellular-level parallel methods further parallelize the computation inside each neuron. Before simulation, cell-level parallel methods first split each neuron into several subblocks. The computation of all subblocks is parallelized. With cellular-level parallel methods, scientists can make full use of the parallel capability of supercomputers to further boost simulation efficiency. In recent studies, more works start to use GPU in detailed-network simulation. The strong parallel power of GPU enables efficient simulation of detailed networks, and makes GPU-based parallel methods more efficient than CPU-based parallel methods. GPU-based parallel methods can also be categorized into network-level and cellular-level methods. GPU-based network-level methods compute each neuron with one GPU thread, while GPU-based cellular-level methods compute single neuron with multiple GPU threads. In summary, we review and analyze recent detailed network simulation methods and classify all methods into three categories as mentioned above. We further summarize the strength and weakness of these methods, and propose our opinion about future works on detailed network simulation.
- Review of visual neural encoding and decoding methods in fMRI Du Changde, Zhou Qiongyi, Liu Che, He Huiguangdoi:10.11834/jig.220525
18-02-2023
326
155
Abstract:The relationship between human visual experience and evoked neural activity is central to the field of computational neuroscience. The purpose of visual neural encoding and decoding is to study the relationship between visual stimuli and the evoked neural activity by using neuroimaging data such as functional magnetic resonance imaging (fMRI). Neural encoding researches attempt to predict the brain activity according to the presented external stimuli, which contributes to the development of brain science and brain-like artificial intelligence. Neural decoding researches attempt to predict the information about external stimuli by analyzing the observed brain activities, which can interpret the state of human visual perception and promote the development of brain computer interface (BCI). Therefore, fMRI based visual neural encoding and decoding researches have important scientific significance and engineering value. Typically, the encoding models are based on the specific computations that are thought to underlie the observed brain responses for specific visual stimuli. Early studies of visual neural encoding relied heavily on Gabor wavelet features because these features are very good at modeling brain responses in the primary visual cortex. Recently, given the success of deep neural networks (DNNs) in classifying objects in natural images, the representations within these networks have been used to build encoding models of cortical responses to complex visual stimuli. Most of the existing decoding studies are based on multi-voxel pattern analysis (MVPA) method, but brain connectivity pattern is also a key feature of the brain state and can be used for brain decoding. Although recent studies have demonstrated the feasibility of decoding the identity of binary contrast patterns, handwritten characters, human facial images, natural picture/video stimuli and dreams from the corresponding brain activation patterns, the accurate reconstruction of the visual stimuli from fMRI still lacks adequate examination and requires plenty of efforts to improve. On the basis of summarizing the key technologies and research progress of fMRI based visual neural encoding and decoding, this paper further analyzes the limitations of existing visual neural encoding and decoding methods. In terms of visual neural encoding, the development process of population receptive field (pRF) estimation method is introduced in detail. In terms of visual neural decoding, it is divided into semantic classification, image identification and image reconstruction according to task types, and the representative research work of each part and the methods used are described in detail. From the perspective of machine learning, semantic classification is a single label or multi-label classification problem. Simple visual stimuli only contain a single object, while natural visual stimuli often contain multiple semantic labels. For example, an image may contain flowers, water, trees, cars, etc. Predicting one or more semantic labels of the visual stimulus from the brain signal is called semantic decoding. Image retrieval based on brain signal is also a common visual decoding task where the model is created to "decode" neural activity by retrieving a picture of what a person has just seen or imagined. In particular, the reconstruction techniques of simple image, face image and complex natural image based on deep generative models (including variational auto-encoders (VAEs) and generative adversarial networks (GANs)) are introduced in the part of image reconstruction. Secondly, 10 open source datasets commonly used in this field were statistically sorted out, and the sample size, number of subjects, types of stimuli, research purposes and download links of the datasets were summarized in detail. These datasets have made important contributions to the development of this field. Finally, we introduce the commonly used measurement metrics of visual neural encoding and decoding model in detail, analyze the shortcomings of current visual neural encoding and decoding methods, propose feasible suggestions for improvement, and show the future development directions. Specifically, for neural encoding, the existing methods still have the following shortcomings:1) the computational models are mostly based on the existing neural network architecture, which cannot reflect the real biological visual information flow; 2) due to the selective attention of each person in the visual perception and the inevitable noise in the fMRI data collection, individual differences are significant; 3) the sample size of the existing fMRI data set is insufficient; 4) most researchers construct feature spaces of neural encoding models based on fixed types of pre-trained neural networks (such as AlexNet), causing problems such as insufficient diversity of visual features. On the other hand, although the existing visual neural decoding methods perform well in the semantic classification and image identification tasks, it is still very difficult to establish an accurate mapping between visual stimuli and visual neural signals, and the results of image reconstruction are often blurry and lack of clear semantics. Moreover, most of the existing visual neural decoding methods are based on linear transformation or deep network transformation of visual images, lacking exploration of new visual features. Factors that hinder researchers from effectively decoding visual information and reconstructing images or videos mainly include high dimension of fMRI data, small sample size and serious noise. In the future, more advanced artificial intelligence technology should be used to develop more effective methods of neural encoding and decoding, and try to translate brain signals into images, video, voice, text and other multimedia content, so as to achieve more BCI applications. The significant research directions include 1) multi-modal neural encoding and decoding based on the union of image and text; 2) brain-guided computer vision model training and enhancement; 3) visual neural encoding and decoding based on the high efficient features of large-scale pre-trained models. In addition, since brain signals are characterized by complexity, high dimension, large individual diversity, high dynamic nature and small sample size, future research needs to combine computational neuroscience and artificial intelligence theories to develop visual neural encoding and decoding methods with higher robustness, adaptability and interpretability.
- A summary of image recognition-relevant multi-layer spiking neural networks learning algorithms Li Yaxin, Shen Jiangrong, Xu Qidoi:10.11834/jig.220452
18-02-2023
429
288
Abstract:To understand the structure of human brain further, Wolfgang Mass summarizes that the structures, training methods and some other crucial parts of spiking neural networks (SNNs) systematically, which are known as the third-generation of artificial neural networks. There are hundreds of millions of neurons and synaptic structures in related to human brain, but the requirement of energy is quite small. The SNNs has its advantages of biological interpretability and lower power consumption in comparison with the first and the second generation artificial neural networks (ANNs). Its neurons simulate the internal dynamics of biological neurons, and the weight-balanced simulates the construction, enhancement and inhibition rules of biological synapses. The SNN is mainly composed of such commonly-used spiking neuron models in relevant to Hodgkin Huxley (HH) model, leaky integrate-and-fire (LIF) model, and spiking response (SRM) model. The difference of ion concentration-inner and the biological neuron-outer can activate the potential of the cell membrane. To improve an action potential, channel-based ions move in and out of the neuron membrane in the neuron membrane when a neuron is stimulated. Spiking neuron model is a mathematical model that simulates the action potential process of biological neuron. A spiking neuron receives neurons-derived spiking stimulation in the upper layer. It will fire spikes to outreach a spiking train. The SNNs is focused on the transmission from spiking trains to information-targeted, which can simulate the propagation of biological signals in the biological neural network. The spiking trains can convey spatiotemporal information. However, current performance of SNNs-based pattern recognition tasks is still challenged to its immature deep learning methods. The artificial neurons of the neural network are based on the output in the form of real numbers, which makes it possible to use the global back-propagation algorithm to train the parameters of the deep neural network. But the spiking train is a sort of binary discrete output, which is still a challenging issue for SNN-based training. First, to clarify its current situation, our summary is focused on recent SNN-based learning algorithms. Then, to analyze pros and cons of popular works, the three main algorithms are introduced:1) supervised learning, 2) unsupervised learning, and 3) ANN-SNN conversion. The unsupervised learning algorithm is mainly based on the mechanism of spike timing dependent plasticity (STDP). The biological synapses-interconnected is enhanced or inhibited according to the relative timing of the firing of presynaptic neurons and postsynaptic neurons. Unsupervised learning methods have stronger biological interpretability, which can use the local optimization method to balance the synaptic weights, but this method is challenged for its complicated and large-scale network structures. Therefore, drawing on the advantages of ANN's easy calculation, some supervised algorithms have emerged like gradient-based training method and ANN2SNN method. The gradient-based learning algorithm is mainly concerned of the training idea of back-propagation (BP), which can balance the weight in terms of the error between the output value and the target value. This challenge is to be resolved in accordance with non-differentiable nature of discrete spikes. More methods of the BP-error have been proposed like gradient surrogate method. This gradient-based training method is focused on leveraging the training advantages of ANN and SNN. The training of SNN is interpretable biologically and easy to be computed. The ANN2SNN method can be used to convert the ANN weights-trained to SNN. This method can be used to realize the continuous activation values in the ANN into spiking trains. To reduce the conversion loss of ANN and SNN, this method is fine-tuned and converted according to neuron dynamics. This training method feature is indirect that it can apply SNN to complex network structures. The method of weight transfer can avoid direct training of SNN, which can apply SNN to complicated network structures. The ANN has been widely used in the field of image recognition. To extract more image features, ANN can be mainly used for consistent functions. SNN has its features of interpretability-biological and power consumption-lower, which can show its high performance in image recognition tasks. Finally, future SNNs-bionic learning methods are predicted in terms of some popular domain methods.
- Research on spiking neural networks for brain-inspired computing Huo Bingqiang, Gao Yanzhao, Qi Xiaofengdoi:10.11834/jig.220527
18-02-2023
671
136
Abstract:Human brain-like computing is focused on for next-generation artificial intelligence (AI) recently. Spiking neural network (SNN) has its potentials to simulate human brain-like complicated information in relevant to its learning, memory, reasoning, judgment, and decision-making. The higher computational and lower power-consumed SNN can be used to simulate the neurons-biological information transfer mode effectively. It is summarized as mentioned below:first, the brain-biological and event-driven SNN is regarded as a new generation of artificial neural network, which consists of input layer, coding mode, spiking neurons, synaptic weights, learning rules and output layer. SNN has its complication-biological and brain-like mechanism in terms of its spatio-temperal information. Second, its structural optimization is related to 5 aspects and highlighted in terms of 1) coding technique, 2) spiking neuron enhancement, 3) topology, 4) algorithm-trained, and 5) algorithm-coordinated. For the encoding method:1) grouping, 2) bursting, 3) frequency, and 4) temporal; For the spike neurons-related models:1) the Hodgkin-Huxley (H-H), 2) the integrated and fire (IF), 3) leaky integrate and fire (LIF), 4) spike response model (SRM), and 5) the Izhikevich between conventional models-relevant. In addition, the improved models for spike neurons are summarized on 8 aspects:for network topology:SNN has 3 conventional structures:1) feed-forward, 2) recurrent, and 3) hybird on the basis of 5 SNN-based popular topology models. For algorithm-trained, it is summarized from 4 aspects for these methods:1) back propagation (BP), 2) spike-timing dependent plasticity (STDP), 3) artificial neural network-to-spiking neural network(ANN-to-SNN),and 4) learning algorithms-relevant; Third, the pros and cons of SNN are segmented from the supervised and unsupervised learning both. Supervised learning:since spiking neurons are nonlinear and inconsistent, the error function between the actual output spike sequence and the desired output spike sequence of the neuron is challenged to meet the requirement of differentiability-consistent, resulting in differentiable transfer function and suppress itsback propagation. Unsupervised learning:the Hebb's rule and STDP algorithm are taken as the key elment in the field of brain-like computing, and it evolves into a variety of learning algorithms. This method is more consistent with the information transmission mode of neurons-biological. But, it is infeasible to build up a large-scale deep network model. Finally, SNNs are applied to brain-like computing and bionic tasks. A systematic SNN-based overview is clarified in the context of its 1) mechanisms, 2) encoding techniques, 3) network structure, and 4) algorithms-trained. Furthermore, to optimize the effectiveness of human brain-like computing model, we predict that the third generation of artificial intelligence is feasible to be realized through information transmission and multi-brain functional simulation and super-large-scale neural network topological and connection-strengthened information memorizing.
Image Analysis and Recognition
- Flame detection combined with receptive field and parallel RPN Bao Wenxia, Sun Qiang, Liang Dong, Hu Gensheng, Yang Xianjundoi:10.11834/jig.210772
18-02-2023
243
147
Abstract:Objective Early flame detection is essential for quick response event through minimizing casualties and damage. Smoky and flaming alarms have been using in indoor-scenario in common. However, the challenging issue of most traditional physical sensors is limited to the fire source-near problem and cannot be meet the requirement of outdoor-scene flame detection. Real-time image detection has been developing in terms of image processing and machine learning technique. However, the flame shape, size, and color can be varied intensively, and there are a plenty of pseudo-fire objects (very similar to the features of the flame color) in the natural environment. To distinguish real flames from pseudo flames precisely, the detection model has been developing dramatically. Image processing and machine learning methods can be divided into three categories:1) traditional image processing, 2) machine learning, and 3) deep learning. Traditional image processing and machine learning is often concerned of design-manual of flame features, which is not quantitative and poor matching to complex background images. Thanks to the self-learning and deep learning techniques, current flame-based detection and interpretation has been facilitating. First, convolution-depth can be used to interpret small-scale areas less than 32×32 missing information on the feature map. Second, deep learning models can be applied to detect color features similar to the object-targeted and the misjudgment-caused, while it is restricted of small target and color-similar feature in flame-detection interpretation. In order to alleviate the pseudo-fire-objects-derived false alarm rate and the missed detection rate of early small flames, we develop a receptive field module based (RF-module-based) convolutional neural network (CNN) and the parallel region proposal network (PRPN) is designed for flame detection, called R-PRPNet. Method The R-PRPNet is mainly composed of three parts:1) feature extraction module, 2) region-parallel network, and 3) classifier. The feature extraction module is focused on convolutional layers of lightweight MobileNet, which makes our algorithm-proposed run faster without the performance loss of flame detection. To extract more discriminative flame features and alleviate the high pseudo-fire-objects-derived false alarm rate, the RF module is embedded into this module for receptive field-expanded and richer context information-captured. The features of multi-scale flames in flaming duration are combined with the region-parallel network. To connect the PRPN, a multi-scale sampling layer is established at the back of the feature extraction module. Furthermore, we use 3×3 and 5×5 full convolution to broaden the receptive field width of multi-scale anchor points, which can improve multi-scale flames detection ability, and resolve the problem of detection-missed of small flames in the early stage of a fire. To achieve classification and regression, the classifier is implemented through softmax and smooth L1, and the final flame category and position information is produced in the image. Result Our method is tested on the multiple datasets-self-built, such as indoor, building, forest and night scene flame data and scenario-based pseudo fire data like light, sunset glow, burning cloud and sunshine. The Faster region CNN (R-CNN) with MobileNet as the backbone network is used as the benchmark. The RF module-based network is compared to the network-benched, which can be used to learn more flame features-discriminative, and the detection-missed rate and alarm-false rate are lower by 1.1% and 0.43% of each. The network-benched is melted into parallel RPN (PRPN) on the basis of the RF module, which improves the networks recognition rate of multi-scale flames effectively. The recall rate is increased by 1.7%, and the detection-missed rate is decreased by 1.7%. The RF module is compared via the negative sample fine-tuning strategy. The pseudo-fire features are enriched through negative sample fine-tuning strategy, and the network classification performance is interpreted and improved for real flames and pseudo-fire objects. The false alarm rate can be decreased by 21% as well based on the two components mentioned above. The comparative analysis is carried out with three detection methods as following:1) for traditional flame detection methods:R-PRPNet is better than edge gradient information and clustering methods in all indexes. 2) For classical target detection algorithms:the performance is improved as well. 3) For YOLOX-L, the false alarm rate and detection-missed rate are reduced by 22.2% and 5.2%, respectively. The final results are reached to 98.07% (accuracy), 4.2% (alarm-false rate) and 1.4% (detection-missed rate) of each. Conclusion We design a CNN for flame detection in related to a receptive field module and the parallel RPN. To expand the receptive field and extract more contextual information-captured discriminative flame features, the RF module is embedded into the feature extraction module of the network, and flame features are melted into through splicing, downsampling and elements-added. Comparing the proposed network with some classic convolutional neural networks and traditional methods, the experimental results show that our network-proposed can extract complex image flame features for multiple scenarios automatically, and it can be filtered pseudo-fire objects accurately.
- Region enhancement and multi-feature fusion for contraband recognition in X-ray images Han Ping, Yang Hui, Fang Chengdoi:10.11834/jig.211134
18-02-2023
252
143
Abstract:Objective X-ray security screening technology is widely used in public transportation infrastructures. The real-time security images are generated via X-ray-related scanning for checking. Due to manual inspection mechanism has its hidden risks, it is required to develop prohibited items-related intelligent recognition based on X-ray security check images. Method The convolutional neural network based (CNN-based) technique has been developing dramatically in the field of computer vision tasks. The CNN-based intelligent recognition model is restricted by a huge amount of label-manual X-ray images for training. Current recognition model is just suitable for homogeneous data-distributed between the training and testing sets. When the color distribution of X-ray images in the testing set is inconsistent in the training set, it is difficult to identify the target for the model. However, the problem of multiple datasets distribution is more prominent in practice for such application scenarios. Dual-energy X-ray imaging technique allows the scanner to distinguish different colors in terms of the item's effective atomic number. The heterogeneity problem of X-ray images is challenged in color distribution. The performance of X-ray image intelligent recognition algorithm will be lower intensively when the distribution of training and test data is inconsistent. So, we develop a heterogeneity-alleviated multi-feature fusion model further. First, to alleviate the influence of different color distribution of prohibited items, the attention mechanism is adopted to extract a newly pixel-level feature, called region-enhanced feature, which are trained in terms of overall feature distribution. The generalization ability is improved for multicolor-distributed X-ray images. Then, multi-feature fusion strategy is used to enrich the feature information like color, shape and outline. The features of color, edge and region-enhanced are melted into a centralized manner. The balanced weight parameters are added to the three kinds of features. Multi-feature fusion can be used to realize more effective feature information and optimal robustness in the case of chaotic objects in an image. Finally, a ternary loss functions are illustrated in relevant to fusion, edge and regional enhancement. To get feature fusion better, weight of three losses are set to balance the weighted feature-parameters. Result The experimental analysis is carried out on the public dataset for the performance evaluation of entirety and generalization (i.e., performance on test samples with the same and different color distributions), called SIXray. Our mean average precision (mAP) can be improved by 4.09% and 2.26% of each in comparison with the ResNet18 and ResNet34. For a single class prohibited items, the average accuracy of our method can reach 94.25% and 90.89% in the identification of guns and pliers-prohibited. We can identify 26 samples in SIXray_last101 dataset in generalization, which is 4.3 times beyond benchmark. The demonstration shows the effectiveness is improved in terms of multicolor-distributed samples. Additionally, ablation experiments are conducted to verify the effects of multiple features and hyper-parameter settings. The experimental results show that the overall recognition performance can be improved based on the richer multiple features (each of edge features and regional enhancement features improve the overall recognition performance by 1.32 and 1.05 percentage points). Conclusion A region-enhanced multi-feature fusion method is developed to deal with rich color and different distribution and chaotic and complex objects through X-ray security images-relevant feature analysis. The enhanced region features are obtained in terms of feature distribution overall. Multi-feature fusion strategy is implemented for the optimization of color, shape and contour details. And, a ternary loss function is used to improve the fusion effect and its heterogeneity. Our analyses demonstrate that the performance of the model can be improved for prohibited items checking. The effectiveness and robustness of the proposed method are verified as well. The multi-branch structure of the model is required to be developed further due to its limitations of computational cost and recognition efficiency.
- An adaptive occlusion-aware multiple targets tracking algorithm for low viewpoint Yue Yingying, Xu Dan, He Kangjian, Zhang Haodoi:10.11834/jig.210853
18-02-2023
208
140
Abstract:Objective Multi-target tracking technique is essential for the computer vision-relevant applications like video surveillance, smart cities, and intelligent public transportation. The task of multi-target tracking is required to better location for multiple targets of each frame through the context information of the video sequence. To generate the motion trajectory of each target, its identity information (ID) is required to keep in consistency. So, we focus on low viewpoint-based multi-target tracking with no high viewpoint involved. For low viewpoint tracking scenes, the occlusion can be as a key factor to optimize tracking performance. The occlusion-completed is restricted by the target-captured issues temporarily, which is challenged for target tracking. The partial-occluded target is still challenged to be captured because the visual information of the occluded target is contaminated and the extracted target features are incomplete, and it will cause tracking drift as well. Method To resolve occlusion problem, we develop a low viewpoint-based adaptive occlusion-relevant multiple targets tracking algorithm. The proposed algorithm is composed of three main aspects as following:1) An adaptive anti-occlusion feature is illustrated in terms of the occlusion degree of each frame. To enhance its adaptability for occlusion, global occlusion information is used to adjust feature-related structure dynamically. 2) When the occlusion occurs, the target will disappear temporarily. When it reappears again after occlusion, it is often transferred to a new target and the tracking ID switch occurs. Therefore, a cascade screening mechanism is melted into for new target problem-identified. Due to the intensive change of occlusion-based target features, high-level and low-level features are employed both to prevent the virtual phenomenon for new target. 3) A large amount of target-occluded noise will be introduced into the template library if they are updated into the template library with no clarification. Therefore, an adaptive anti-interference template update mechanism is proposed for that. Multiple weights are given to the target templates-profiled of different occlusion states based on the local occlusion information of all targets, and the weights-based adaptive template-updated is then performed, which can alleviate the interference of severe-occluded targets to the template library. Result Our algorithm is experimented on the low viewpoint tracking videos-selected of MOT16, which includes special tracking scenes like 1) partial occlusion, 2) short-term full occlusion, and 3) long-term full occlusion. The experimental results show that the tracking performance of our algorithm has been improved, achieve improvement of 3.67%, 1.57%, 2.77%, 5.71%, and 3.07% on MOTA (multiple object tracking accuracy) respectively than STAM (spatial-temporal attention mechanism), ATAF (aggregate tracklet appearance features), STRN (spatial-temporal relation network), BLSTM_MTP_O (bilinear long short-term memory with multi-track pooling) and IADMR (instance-aware tracker and dynamic model refreshment). Furthermore, the ablation experiment shows that our anti-occlusion feature proposed can achieve an increase of 1.9% compared to the hybrid feature, an improvement of 1.8% compared to the appearance feature, and an optimization of 13.6% compared to the motion feature on MOTA. Compared with the weighted update strategy, the adaptive anti-interference update strategy proposed has achieved an improvement of 10.7% on MOTA, and an improvement of 17.7% compared with the conventional update strategy. Moreover, compared with the weighted update strategy, the number of ID switching times is significantly reduced from 244 to 119, which shows that our anti-interference update strategy can optimize the cleanliness of the template library and the accuracy of data association. Additionally, to validate the effectiveness of the update strategy we proposed, more indicators are improved obviously, such as Rcll (recall), FN (false negatives), MT (mostly lost tracklets), ML (mostly lost tracklets), and Frag (fragments). Conclusion The low viewpoint-based adaptive occlusion-relevant multiple targets tracking algorithm can be used to enhance the perception and balancing capabilities of the features-used in data association, reduces the impact of severe-occluded target templates beyond template library-profiled on the multi-tracking performance. Limitation and recommendation our proposed algorithm have no motion and speed-related estimation-specific mechanism for the rigid motion of the camera. Our data association-based algorithm is still cohesive to target detection algorithm severely. Therefore, the trajectory has to be disturbed and crossed when the target is missed or falsely detected. The future work can be focused on improving the tracking adaptability to actual tracking scenarios and the immunity of detection errors further.
- Sparse constraint and spatial-temporal regularized correlation filter for UAV tracking Tian Haodong, Zhang Jinpu, Wang Yuehuandoi:10.11834/jig.210611
18-02-2023
183
139
Abstract:Objective Correlation filter (CF)-based methods have demonstrated their potential in visual object tracking for unmanned aerial vehicle (UAV) applications. Current discriminative CF trackers can be used to learn a multifaceted feature filter in the sample region. However, more occlusion or deformation-derived features may distort the filter and degrade the discriminative ability of the model. To mitigate this problem, we develop a novel sparse constraint and spatio-temporal regularized correlation filter to ignore those distractive features adaptively. Method By imposing a spatial (bowl-shaped) elastic net constraint on the objective function of the correlation filter, our algorithm can restrict the sparsity of the filter values corresponding to the target region instead of the whole sample region and adaptively suppress the distorted features during tracking. In addition, a temporal regularization term in spatial-temporal regularized correlation filter (STRCF) is integrated to enhance the filter's ability to suppress distortion. Our research treats the object tracking task as a convex optimization problem and provides an efficient global optimization method through alternating direction method of multipliers (ADMM). First, the objective function is required to meet Eckstein-Bertsek condition. Thus, it can converge to the global optimal solution by an unconstrained augmented Lagrange multiplier formulation. Next, ADMM is used to transform the Lagrange multiplier formulation into two sub-problems with closed-form solution. To improve computational efficiency, we convert the sub-problems into the Fourier domain according to Parseval's theorem. Our algorithm can converge quickly within a few iterations. Result Several evaluation metrics like center location error and bounding box overlap ratio are used to test and compare the proposed method against other existing methods. The center location error measures the accuracy of the tracking algorithm's estimation for the target location. It computes the average Euclidean distance between the ground truth and the center location of the tracked target in all frames. The center location error can represent the location accuracy of the tracking algorithm. But, the sensitivity of the different size targets to the center location error is different because scale and aspect ratio are not taken into consideration. Another commonly used evaluation metric is the overlap rate, which is defined as the intersection over union between the target box prediction and the ground truth. We compare our approach with several state-of-the-art algorithms on well-known benchmarks, such as DTB70 (drone tracking benchmark), UAVDT (unmanned aerial vehicle benchmark:object detection and tracking) and UAV123_10 fps. The experiment results show that our model outperforms all other methods on DTB70 benchmark. The average accuracy rate and the average success rate are 0.707 and 0.477, which are 5.8% and 4% higher than STRCF. For UAVDT benchmark, the average accuracy rate and the average success rate are 0.72 and 0.494, respectively, which are 8.4% and 3.8% higher than STRCF. For UAV123_10 fps benchmark, the average accuracy rate and average success rate are 0.667 and 0.577, respectively, which are 5% and 3.3% higher than STRCF. Furthermore, an ablation experiment demonstrates that the proposed strategy improves the tracking speed by about 25% without affecting the tracking accuracy, and the running speed can reach 50 frame/s on a single CPU. Conclusion Compared with the current popular methods, the proposed sparse constraint and spatio-temporal regularized correlation filter achieves leading performance. Due to the introduction of sparse constraints and spatial-temporal regularization, our algorithm improves the tracking effect and has strong robustness in complex scenes such as occlusion and deformation.
- Global and spatial multi-scale contexts fusion for vehicle re-identification Wang Zhenxue, Xu Zheming, Xue Yangyang, Lang Congyan, Li Zun, Wei Lilidoi:10.11834/jig.210849
18-02-2023
226
165
Abstract:Objective Vehicle re-identification issue is concerned of identifying the same vehicle images captured from multiple cameras-based non-overlapping views. Its applications and researches have been developing in computer vision like intelligent transportation system and public traffic security. Current sensor-based methods are focused on hardware detectors utilization as a source of information inputs for vehicle re-identification, but these methods are challenged to get effective information of the vehicle features in related to its color, length, and shape. To obtain feature information about the vehicle, most methods are based on label-manual features in the context of edges, colors and corners. However, such special decorations are challenged to be identified on the aspects of camera view variation, low resolution, and object occlusion of the vehicle-captured images. Thanks to the emerging deep learning technique, vehicle re-identification methods have been developing dramatically. Recent vehicle re-identification methods can be segmented into two categories:1) feature learning and 2) ranges-metric learning. To enhance the re-identification performance, existing methods are restricted by multi-scale contextual information loss and lacking ability of discriminative feature selection because most feature learning and ranges-metric learning approaches are based on vehicle visual features from initial views captured or the additional information of multiple vehicles attributes, spatio-temporal information, vehicle orientation. So, we develop a novel global and spatial multi-scale contexts fusion method for vehicle re-identification (GSMC). Method Our method is focused on the global contextual information and the multi-scale spatial information for vehicle re-identification task. Specifically, GSMC is composed of two main modules:1) a global contextual selection module and 2) a multi-scale spatial contextual selection module. To extract global feature as the original feature, we use residual network as the backbone network. The global contextual selection module can be used to divide the original feature map into several parts along the spatial dimension, and the 1×1 kernel size convolution layer is employed for the dimension-reducing. The softmax layer is used to obtain the weight of each part, which represents the contribution of different parts to the vehicle re-identification task. To extract more discriminative information of vehicles, the feature-optimized is melted into original feature. Additionally, to obtain a more discriminative feature representation, the feature outputs are divided into multiple horizontal local features in this module and these local features are used to replace global feature classification learning. In order to alleviate the feature loss in the boundary area, local features-adjacent have an intersection with a length of 1. What is more, the multi-scale spatial contextual selection module for GSMC is introduced to obtain multi-scale spatial features via different down-sampling, and, to generate the foreground feature response map of the vehicle image, this selected module can be used to optimize those multi-scale features, which can enhance the perception ability of GSMC to the vehicle's spatial location. To enhance the effect of the foreground, an adaptive larger weight can be assigned to the vehicle. To select more robust spatial contextual information, a smaller weight is assigned to the background for alleviating the interference of background information. Finally, as the final feature representation of the vehicle, our approach can fuse the features in the context of the global contextual selection module and the multi-scale spatial contextual selection module. In order to obtain a fine-grained feature space, GSMC is used for the label-smoothed cross-entropy loss and the triplet loss to improve its learning-coordinated ability overall. In the training process, in order to make the model have a faster convergence rate, our model is implemented in the first 5 epochs to keep the model stable in terms of the warm-up learning strategy. Result To valid the effectiveness of our approach proposed on vehicle re-identification task, we evaluate our model with some state-of-the-art methods on two public benchmarks of those are VehicleID and vehicle re-idendification-776 (VeRi-776) datasets. The quantitative evaluation metrics are related to mean average precision (mAP) and cumulative matching curve (CMC), which can represent the probability that the image of the probe identity appears in the retrieved list. We carry out a series of comparative analysis with other methods, which are additional non-visual information and the multi-view leaning methods. Our analysis is demonstrated that it can surpass PNVR (part-regularized near-duplicate vehicle re-identification) by a large margin significantly. On the VehicleID dataset, we improve the rank-1 by 5.1%, 4.1%, 0.8% and the rank-5 by 4.4%, 5.7%, and 4.5% on three test subsets of different size. Compared to PNVR on the VeRi-776 dataset, GSMC gains 2.3% and 2.0% performance improvements of each in terms of mAP and rank-1. The lower ranks of CMC accuracy illustrates that our method can promote the ranking of rough multi-view captured vehicle images. Furthermore, we use re-ranking strategy as a post processing step over the VeRi-776 dataset and the results have significant improvement in mAP, rank-1 and rank-5 scores. At the same time, to verify the necessity of different modules in the proposed model, we design an ablation experiment to clarify whether a single branch can extract discriminative feature or not and the effectiveness of the feature fusion of the two modules is optimized as well. When different modules are added sequentially, the combination can realize the performance improvement by a large margin on mAP, rank-1 and rank-5. We are able to conclude that our proposed module is effective and can be capable to pull the images of same vehicle identity closer and push the different vehicles far away through the comparative analysis in relevant to the experimental results, the attention heat map visualization and the foreground feature response map. Conclusion To resolve the problem of vehicle re-identification, we develop an optimized model in terms of a global contextual selection module and a multi-scale spatial contextual selection module. The proposed model has its potential effectiveness in the extensive experiments in comparison with two popular public datasets mentioned.
Image Understanding and Computer Vision
- LiDAR point cloud semantic segmentation combined with sparse attention and instance enhancement Liu Sheng, Cao Yifeng, Huang Wenhao, Li Dingdadoi:10.11834/jig.210787
18-02-2023
177
131
Abstract:Objective Outdoor-perceptive recognition is essential for robots-mobile and autonomous driving vehicles applications. LiDAR-based point cloud semantic segmentation has been developing for that. Three-dimensional image-relevant (3D image-relevant) LiDAR can be focused on the range of information quickly and accurately for outdoor-related perception with no illumination effects. To get feasible effects for autonomous driving vehicles, LiDAR point cloud-related semantic segmentation can be predicted in terms of point cloud analysis for overall scene factors like roads, vehicles, pedestrians, and plants. Recent deep learning-based (DL-based) two-dimensional image-relevant (2D image-relevant) computer vision has been developing intensively. Nevertheless, LiDAR point cloud data is featured of unstructured, disorder, sparse and non-uniform densities beyond 2D image-relevant structured data. The challenging issue is to extract semantic information from LiDAR data effectively.DL-based methods can be divided into three categories:1) point-based, 2) projection-relevant, and 3) voxel-related. To extract effective semantic information, the existing methods are often used to project irregular point cloud data into 2D images-structured because of the unstructured characteristics of LiDAR point cloud data. However, geometric information loss-derived high-precision segmentation results cannot be obtained well. In addition, lower segmentation effect for small sample objects has restricted by uneven data distribution. To resolve these problems, we develop a sparse attention and instance enhancement-based LiDAR point cloud segmentation method, which can improve the accuracy of semantic segmentation of LiDAR point cloud effectively. Method An end-to-end sparse convolution-based network is demonstrated for LiDAR point cloud semantic segmentation. To optimize uneven data distribution in the training data set, instance-injected is used to enhance the point cloud data. Instance-injected can be employed to extract its points cloud data factors like pedestrians, vehicles, and bicycles. Instance-related data is injected into an appropriate position of each frame during the training process. Recently, the receptive field-strengthened and attention mechanism-aware visual semantic segmentation tasks are mainly focused on. But, a wider receptive field cannot be realized due to the encoder-decoder-based network ability. A lightweight Transformer module is then illustrated to widen the receptive field of the network. To get global information better, the Transformer module can be used to build up the interconnection between each non-empty voxel. The Transformer module is used in the bottleneck layer of the network for memory optimization. To extract the key positions of the feature map, a sparse convolution-based spatial attention module is proposed as well. Additionally, to clarify the edges of multiple types of point cloud objects, a new TVloss is adopted to identify the semantic boundaries and alleviate the noise within each region-predicted. Result Our model is proposed and evaluated on SemanticKITTI dataset and nuScenes dataset both. It achieves 64.6% mean intersection over union (mIoU) in the single-frame accuracy evaluation of SemanticKITTI, and 75.6% mIoU on the nuScenes dataset. The ablation experiments show that the mIoU is improved by 1.2% in terms of instance-injected, and the spatial attention module has an improvement of 1.0% and 0.7% each based on sparse convolution and the transformer module. The efficiency of these two modules is improved a total of 1.5%, the mIoU-based TVloss achieves 0.2% final gain. The integrated analysis of all modules is increased by 3.1% in comparison with the benchmark. Conclusion A new sparse convolution-based end-to-end network is developed for LiDAR point cloud semantic segmentation. We use instance-injected to resolve the problem of the unbalanced distribution of data profiling. A wider range of receptive field is achieved in terms of the proposed Transformer module. To extract the key location of the feature map, a sparse convolution-based spatial attention mechanism is melted into. A new TVloss loss function is added and the edge of the objects in point clouds is clarified. The comparative experiments are designed in comparison with recent SOTA(state of the art) methods, including projection and point-based methods. Our proposed method has its potentials for the improved segmentation ability of the network to point cloud details and the effectiveness for point cloud segmentation further.
- Multi-scale context information fusion for instance segmentation Wan Xinjun, Zhou Yiyun, Shen Mingfei, Zhou Tao, Hu Fuyuandoi:10.11834/jig.211090
18-02-2023
260
185
Abstract:Objective Case-relevant segmentation is one of the essential tasks for image and video scene recognition. Its precise segmentation is widely used in real scenes like automatic driving, medical image profiling, and video surveillance. To classify and locate multiple targets of image, this kind of segmentation can be used for pixel-level case-related masks. However, different targets interpretation often has featured of multiple scales. For larger-scale targets, the receptive field can be covered its local area only, which is possible to get detection error or insufficient or inaccurate segmentation. For smaller-scale targets, the receptive field is often affected by much more background noise, and it is easy to be misjudged as the background category and lead to detection error. The recognition and segmentation accuracy are lower at the target boundary and occlusion. To enhance the segmentation accuracy effectively, most of case-relevant segmentation methods are improved in consistency without a multi-scale targets-oriented solution. To optimize segmentation accuracy further, we develop a mask region-based convolutional neural network (Mask R-CNN) based case-relevant segmentation network in terms of the improved feature pyramid network (FPN) and multi-scale information. Method First, an attention-guided feature pyramid network (AgFPN) is illustrated, which optimizes the fusion method of FPN adjacent layer features through an adaptive adjacent layer feature fusion module (AFAFM). To learn multi-scale features effectively, the AgFPN is based on content-oriented reconstruction for features-upsampled and a channel attention mechanism is used to weight channels before adjacent layer feature fusion. Then, we design an attention feature fusion module (AFFM) and a global context module (GCM) in relation to multi-scale channel attention. We enhance the multi-scale feature representation of the mask prediction branch and the classification and regression branch for region of interest (RoI) features via multi-scale contextual information-adding. Hence, our analysis can improve the quality of mask prediction for multi-scale objects. First, we utilize AgFPN to extract multi-scale features. Next, multi-scale context information extraction and fusion are carried out in the network. The inner region proposal network (RPN) can be used to develop the bounding boxes of target regions and filters. Meanwhile, multi-scale context information is derived from the output of AgFPN in accordance with AFFM and GCM. Then, to obtain a fixed-size feature map, the network-based RoIAlign algorithm can be used to map the RoI to the feature map, which is fused with the following multi-scale context information. Finally, the bounding box regression and mask prediction are performed in terms of features-fused. We use the deep learning framework PyTorch to implement the algorithm proposed. The experimental facility is equipped with the Ubuntu 16.04 operating system, and a sum of 4 NVIDIA 1080Ti graphics processing units (GPUs) are used to accelerate the operation. The ResNet-50/101 network is used as the backbone network and the pre-trained weights on ImageNet are utilized to initialize the network parameters. For the Microsoft common objects in context 2017(MS COCO 2017) dataset, we use stochastic gradient descent (SGD) for 160 000 iterations of training optimization. The initial learning rate is 0.002 and the batch size is set to 4. When the number of iterations is 130 000 and 150 000, the learning rates can be reached to 10 times lower. For the Cityscapes dataset, we set the batch size to 4 and the initial learning rate to 0.005. The number of iterations is 48 000. When it reaches 36 000, the learning rate can be down to 0.000 5. The weight decay coefficient is set to 0.000 5 and the momentum coefficient is configured to 0.9. The loss function and hyperparameters-related are set and initialized of the strategy-described following. Result Our method effectiveness is evaluated through comprehensive experiments on the two datasets of MS COCO 2017 and Cityscapes. For the COCO dataset, the algorithm value can be increased by 1.7% and 2.5% of each compared to the benchmark of Mask R-CNN when the backbone network is based on ResNet50 and ResNet101. For the Cityscapes dataset, ResNet50 is used as the backbone network to evaluate on the validation set and test set, which are 2.1% and 2.3% higher than Mask R-CNN for the two sets. The ablation results show that the AgFPN has its potential performance and is easy to be integrated into multiple detectors. Furthermore, feature-related augmentation is utilized to improve average accuracy of 0.6% and 0.7% each for attention feature fusion module and the global context module. When we combine the two modules, the performance-benched is improved by 1.7%. The visualization results show that our method is more accurate in positioning multi-scale targets. The segmentation effect is improved significantly on the two aspects of mutual occlusion and the boundary of multiple targets. Conclusion The experimental results show that our algorithm is based on the overall multi-scale context information of the target and the multiple feature representation of the target can be improved. Therefore, the algorithm effectiveness is demonstrated that it can improve the accuracy of the network for target detection and segmentation at different scales further.
- Leading weight-driven re-position relation network for figure question answering Li Ying, Wu Qingfeng, Liu Jiatong, Zou Jialongdoi:10.11834/jig.211026
18-02-2023
154
93
Abstract:Objective Figure-based question and answer (Q&A) is focused on learning the basic information representation of data mining in real scenes and provide the basis of judgment for reasoning in terms of the text information of the joint questions. It is widely used for multi-modal learning tasks. Existing methods can be segmented into two categories in common:1) model tasks are based on neural network framework algorithms directly. The statistical graph is processed by the convolutional neural network to obtain the feature map of the image information, the question text is encoded by the recurrent neural network to obtain the sentence-level embedding representation vector. The output answer is obtained by the fusion inference model. To capture the overall representation of the fusion of multi-modal feature information, the popular attention mechanism is concerned about the obtained image feature matrix as the input of the text encoder in recent years. However, the interaction between the relationship features in the multi-modal scene has a huge negative impact on the extraction of effective semantic features. 2) A multi-module framework algorithm is used to decompose the task into multiple steps. Different modules are used to obtain the feature information at first, the obtained information is then used as the input of the subsequent modules, and the final output results are obtained through the subsequent algorithm modules. However, this type of method needs to rely on additional annotation information to train individual modules, and the complexity is quite higher. So, we develop a weight-driven re-located relational network model based on fusion semantic feature extraction. Method We clarify the whole framework for weight-driven re-located relation network, which consists of three modules in the context of image feature extraction, the attention-based long short-term memory(LSTM) and joint weight-driven re-located relation network. 1) For the image feature extraction module, image feature extraction is implemented via fusing the convolutional layer and the up-sampling layer. To make the extracted image feature information more suitable for the scene task, we design a fusion of convolutional neural network and U-Net network architecture to construct a network model that can extract the semantic meaning of low-level and high-level image features. 2) For the attention-based LSTM module, we joint the problem-based reasoning feature representation in terms of attention mechanism. LSTM can just retain the influence of existing words on unrecognized words. To obtain a better vector representation of the sentence, we can capture different contextual information based on attention mechanism. 3) For the joint leading weight-driven re-located relation network module, we propose a paired matching mechanism, which guides the matching process of relationship features in the relationship network. That is to calculate the inner product of the feature vector of each pixel with the feature vectors of all the pixels, the similarity can be obtained between it and all the points and the pixel can be obtained by averaging in the entire group at the end. However, to resolve the high complexity problem, it ignore the overall relationship balance that can be obtained by the original pairwise pairing method although the relationship features matching pair sequence obtained by the above method. Therefore, our re-located operation is carried out to achieve a balanced effect for overall relationship. 1) Remove the relationship feature of the pixel paired with itself from the obtained relationship feature pair set; 2) swap locations in the relationship feature list of each pixel according to a constant one exchange and this iterative rule; and 3) add the location information of the pixels and the sentence-level embedding. Especially, the generation of relational features is composed of three parts:a) the feature vector of two pixels, b) the coordinated value of the two pixels, and c) the embedding representation of the question text. Result The experiment is compared to the 2 datasets with the latest 6 methods. 1) For the FigureQA(an annotated figure dataset for visual reasoning) dataset, compared to IMG+QUES(image+questions), relation networks(RN) and ARN(appearance and relation network), the overall accuracy rate is increased by 26.4%, 8.1%, and 0.46%, respectively. 2) For a single verification set, compared to LEAF-Net(locate, encode and attend for figure network) and FigureNet, the accuracy is increased by 2.3% and 2.0% of each. 3) For the understanding data visualization via question answering(DVQA) dataset, the overall accuracy of the DVQA dataset is increased by 8.6%, 0.12%, and 2.13% compared to SANDY(san with dynamic encoding model), ARN, and RN, and 4) For the Oracle version, compared to SANDY, LEAF-Net and RN, the overall accuracy rate has increased by 23.3%, 7.09%, 4.8%, respectively. Conclusion Our model has good results on the two large open source datasets in the statistical graph Q&A beyond baseline model.
- Visual localization system of integrated active and passive perception for indoor scenes Xie Ting, Zhang Xiaojie, Ye Zhichao, Wang Zihao, Wang Zheng, Zhang Yong, Zhou Xiaowei, Ji Xiaopengdoi:10.11834/jig.210603
18-02-2023
174
140
Abstract:Objective Visual localization is focused on the location and estimation of motion objects via easy-to-use RGB images. The feature-extracted information is challenged to meet the requirements of tasks in traditional computer vision methods in terms of feature extraction algorithms. The deep learning-based feature abstraction and demonstration ability can promote an emerging research issue for pose estimation in computer vision. In addition, the development and application of depth cameras and lasers-based sensors can provide more diverse manners to this issue as well. However, these sensors have some constraints of the entity and shape of the object and it need to be used in a structured environment. Multi-vision ability is often challenged to the issues of installing and debugging problems. In contrast, sensors-visual applications are featured of low cost and less restrictions, and they are easy to be recognized and extended for multiple unstructured scenarios. Interferences are being existed in indoor scenes, such as object occlusion and weak texture areas, which can cause the incorrect estimation of the target points easily and affect the accuracy of visual localization severely. The different methods of camera-deployment can be divided into two categories based on visual object pose estimation method. 1) In order to get the target position data, one category of the two is based on monocular object positioning of pose estimation technology of using the deployment in cameras-fixed in the scene and detecting targets in the images of the relevant information. The pros of positioning result is stable and the cons of it is affected by light and fuzzy image easily, it cannot be dealt with object occlusion in the scene as well due to the limitation of observation angle; 2) The other category of two is oriented on scene reconstruction-based object pose estimation technology, which can use the camera fixed on the target itself to obtain the pose information of the target by detecting the feature points of the scene and matching the features with the 3D scene model constructed in advance. This scheme is derived of the status of texture features. 1) For rich textures and clear features scenes, the accurate positioning results-related can be obtained. 2) For non-texture features scenes and weak texture areas like walls scene, the positioning results are unstable, and other sensors such as inertial measurement unit (IMU) are needed to be positioning-aided. To achieve more precise positions of moving objects in indoor scenes, we propose an active and passive perception-based visual localization system, which combines the advantages of fixed and motion perspectives. Method First, a plane-prior object pose estimation method is proposed by our research team. Based on the monocular localization framework of keypoint detection, the plane-constraint is used to optimize the 3-DoF (degree of freedom) pose of the object and improve the localization stability under a fixed view. Second, we design a data fusion framework in terms of the unscented Kalman filtering algorithm. To improve the reliability of the pose estimation of the moving target, a fixed view-derived passive perception output and the active perception output are fused from a motion view. The active and passive-integrated indoor visual positioning system is composed of three aspects as mentioned below:1) passive positioning module, 2) active positioning module, and 3) active and passive fusion module. The input of passive positioning module is oriented to RGB image-captured by indoor fixed camera, and the output is based on the target pose data-contained in the image. The input of the active positioning module is the RGB image shot on the perspective of the target to be located, and the output is based on the position and pose-relevant information of the target in the 3D scene. The active and passive fusion module is dealt with the integrating the positioning results of passive and active positioning, and the output is linked to more accurate positioning result of the target in the indoor scene. Result The average localization error of the indoor visual localization system proposed can reach 2~3 cm on the iGibson simulation dataset, and the accuracy of the 10 cm-within localization error can reach to 99%. In the real scenes, the average localization error can reach 3~4 cm, and the accuracy of the localization error within 10 cm is above 90%. Experimental results are shown our proposed system can obtain centimeter-level accurate positioning. The experimental results of real scenes illustrate that the active and passive fusion visual positioning system can reduce the external interference of passive positioning algorithm under fixed visual angle effectively due to the limitation of visual angle, object occlusion and other external disturbances, and it also can optimize the defects of single frame positioning algorithm with insufficient stability and large random error. Conclusion Our visual localization system has its potentials to the integrated advantages of passive-based and active-based methods, which can achieve high-precision positioning results in indoor scenes at a low cost. It also shows better robust performance under complex interference such as occlusion and target-missed. We develop a lossless Kalman filter based framework of active and passive fusion indoor visual positioning system for indoor mobile robot operation. Compared to the existing visual positioning algorithm, it can achieve high-precision target positioning results in indoor scenes with lower equipment cost. And, under the shade circumstances, the loss of target under complex environment factors shows robust positioning performance and the indoor scene visual centimeter-level accuracy-purified of positioning. The performance is tested and validated in simulation and the physical environment both. The experimental results show that the positioning system has its priority on high positioning accuracy and robustness for multiple scenarios further.
- Image retrieval based on transformer and asymmetric learning strategy He Chao, Wei Hongxidoi:10.11834/jig.210842
18-02-2023
217
114
Abstract:Objective Image retrieval is one of the essential tasks in computer vision. Most of deep learning-based image retrieval methods are implemented via learning-symmetrical strategy. Training images are profiled into a pair and then melt into convolutional neural network (CNN) for features extraction. Similar loss is utilized to learn the hash-relevant images. Consequently, a feasible performance can be obtained through this symmetric learning scheme. In recent years, to improve its performance better, a mass of the CNNs-improved are extended horizontally or deepened vertically. CNNs-based structure is complicated and time-consuming for large scale image datasets. Recently, the Transformer has been developing to the domain of computer vision and its image classification ability is improved intensively. Our Transformer-relevant research is melted into the task of large-scale image retrieval method because the Transformer is feasible for large scale dataset like ImageNet-21k and JFT-300M. For symmetric methods, the overall image dataset-related should be involved in the training phase. Meanwhile, query images have to be added into a pair for training, which results in the problem of time-consuming. The hash function between the training image and the query image is learnt in terms of similar calculation. The information-supervised is used for the similarity matrix only, which is insufficient to be used. A learning-asymmetric scheme is optimized for training based on some images-selected only. Furthermore, the corresponding hash function can be learnt from the hash loss. For the rest of images, their feature representations can be picked out as well. Moreover, classification constraints can be used for the query images and the corresponding classification loss can be optimized in terms of learning-altered technology. To resolve the problems of time-consuming and the insufficient information-supervised, we develop a deep supervised hash image retrieval method in terms of the integrated Transformer and learning-asymmetric strategy. Method For the training images, a Transformer-designed can be utilized to generate the hash representation of these images. The hash loss is used to guarantee the hash representation of images is closer to the real hash value. In the original Transformer, its input is derived of one-dimensional data. First, each image is required to be divided into multiple blocks. Then, each block is mapped into one-dimensional vector. At last, to form the one-dimensional vector, these one-dimensional vectors of all blocks of one image are concatenated together. Our Transformer is designed and composed of 1) two normalization layers, 2) a multi-head attention module, 3) a fully-connected module, and 4) a hash layer. First, the input one-dimensional vector is penetrated into the normalization layer. Next, the output of the normalization layer is fed into the multi-head attention layer of those are 16 heads. Hence, multiple local features of images can be captured. Additionally, the residual link is developed to integrate the initial one-dimensional vector with the output of the multi-head attention layer. By this way, the global features of images can be preserved better. Finally, the representation vector of each image can be obtained through the fully-connected module and the hash layer. In this study, the process for generating representation vectors mentioned above will be replicated for 24 times. For the rest of images, the classification loss is taken as a constraint, which can be used to learn the hash representation of these images in an asymmetric way. To improve the training efficiency, supervised information can be used effectively. The query images are not required to be involved in. The model can be trained via learning-alternated technology. Specifically, first, hash representation of the query images and the weights of the classification are configured and initialized randomly. Then, the parameters of the model can be optimized in terms of stochastic gradient descent algorithm. Following by an epoch of training, the weights of the classification can be balanced by the trained model. Meanwhile, the hash representation ability of the rest of images is improved gradually. In this manner, the hash code of the rest images can be obtained from a well-trained model directly, which optimizes the training efficiency. Finally, our method can realize fast retrieval of similar images through the calculation of the Hamming distance in the hash space. Result In our experiment, our method is compared to five categories of symmetric methods and two sorts of asymmetric methods based on two large scale image retrieval datasets. The performance of the proposed method is increased by 5.06% and 4.17% of each on the two datasets (i.e., CIFAR-10 and NUS-WIDE). The evaluation metric is based on mean average precision (mAP). Ablation experiments validate that the classification loss can make the images closer to the real hash representation. Furthermore, the hyper-parameters of the classification loss are tested as well, and the appropriate hyper-parameters are obtained. Conclusion To complete the image retrieval task effectively, our Transformer-based method has its potentials on image features extraction for large scale image retrieval, and the hash loss can be melted into classification loss for model training further.
Computer Graphics
- Real-time indirect glossy reflection based on linearly transformed spherical distributions Xia Bo, Liu Yanli, Zeng Wei, Pu Yilei, Zhang Yancidoi:10.11834/jig.211082
18-02-2023
152
84
Abstract:Objective Indirect glossy reflection effect (IGRE) can be as one of the commonly-used lighting effects nowadays. The IGRE-based simulation has been developing in related to designs of games, movies, animations, virtual reality and visual simulation. The IGRE can be employed to enhance image quality and is an essential part of the rendering in computer graphics. Current instant radiosity (IR) algorithm is beneficial to IGRE for real-time rendering. In order to calculate the glossy reflection effect (GRE) at the virtual point lights and the shading points, a certain lighting model is usually adopted to calculate the radiance of the ray-reflected like Blinn-Phong lighting model. In recent years, to calculate the IGR, IR-based GGX stochastic light culling algorithm (SLC) is focused on the GGX bidirectional reflectance distribution function (BRDF) lighting model further. However, the GGX BRDF lighting model has featured of high computational complexity, and it is time-consuming as well, including redundant trigonometric functions and square roots. Meanwhile, its computational overhead will increase dramatically in linear with the growth of the number of virtual point lights. To optimize the IGR effects, there are often hundreds of thousands or even millions of virtual point lights in the scene, which will cause very high shading overhead in the GGX SLC algorithm. But, the rendering performance is still ultimately challenged. So, we focus on the GGX SLC algorithm and carry out the analysis of the high overhead caused by the GGX SLC algorithm, which uses the GGX BRDF lighting model when rendering the IGR in detail. Our method is developed to improve the real-time IGR algorithm in terms of IR (i.e. GGX SLC algorithm). Method To lower the computational complexity of GGX BRDF spherical distribution, our research is concerned of the mutual-fitted issue of GGX BRDF spherical distribution with another spherical distribution based on linearly transformed spherical distribution. In this way, the goal of the GGX BRDF-simplified can be achieved. This spherical distribution to low computational complexity is melted into the point light source and a fast physics-based lighting model is obtained from single-point and multi-point light sources. This lighting model has a lower computational cost compared to the GGX BRDF lighting model. First, this fast lighting model is used to calculate the radiance of the virtual point lights and the shading points both. To improve its rendering efficiency, we implement a lighting model-based texture sampling strategy. Our algorithm has its potentials to improve the rendering efficiency of the indirect glossy reflection effect without the rendering quality loss. Result We conduct several experiments to verify that our real-time indirect glossy reflection algorithm. First, to validate the feasibility of our lighting model, our lighting model is compared to the GGX BRDF lighting model. The root mean squared error (RMSE) of our algorithm is less than 0.002. This indicates that our lighting model can achieve similar rendering quality to the GGX BRDF lighting model. Next, our algorithm is optimized and linked to improve rendering quality and rendering efficiency in different scenes. Our experiment results show that our rendering quality is very close to the GGX SLC algorithm (the RMSE is less than 0.006). The rendering efficiency can be increased up to 40% in the Sponza scene. The experiments are carried out on the effectiveness of the number of virtual point lights between the two algorithms mentioned above. The experimental results demonstrate that the efficiency of our algorithm can be improved more based on the number of virtual point lights-enlarged in the scene, while little loss in rendering quality compared with the GGX SLC algorithm. The RMSE is less than 0.003 while the rendering efficiency improvement can reach 30% in the metal ring scene. Conclusion A real-time indirect glossy reflection algorithm is developed in terms of linearly transformed spherical distribution. A lighting model with lower computational complexity is employed to optimize the GGX BRDF lighting model, which can reduce rendering overhead better. To alleviate the problem of texture sampling, a texture sampling scheme is optimized to improve rendering efficiency further. The experiment results demonstrate that our algorithm can improve the rendering efficiency of the indirect glossy reflection effect without sacrifice rendering quality potentially.
- Circle average nonlinear subdivision curve design with normal constraints Liu Yan, Shou Huahao, Ji Kangsongdoi:10.11834/jig.211072
18-02-2023
135
93
Abstract:Objective The subdivision technique has been developing in relevant to the design of efficient, hierarchical, local, and adaptive algorithms for modeling, plotting, and manipulating arbitrary topology-related free-shaped objects beyond the non-uniform rational B-splines (NURBS). The initial subdivision step is oriented at a control polygon or mesh. First, novel vertices can be involved in and the existed vertices can be optimized. Next, a new control polygon or control mesh can be obtained. Finally, the target curves of surfaces can be replicated to produce. Subdivision schemes can be segmented into two categories:1) linear-based and 2) nonlinear-based because the issue of new points can change the linear combinations of old points in the iterative process. Generally speaking, linear subdivision schemes are easier to be implemented, but there are inflection points plotting on the limit curves and it is challenged to represent precise circle, while nonlinear subdivision schemes can eliminate inflection points and reproduce a circle accurately. The smooth curve-fit point clouds problem is concerned more in the context of computer-aided geometric design (CAGD) and computer graphics (CG). Measurement data can be obtained on real objects via such techniques like laser scanning, structure light source converter, and X-ray tomography. To perform a commonly-used model reconstruction and functional recovery for the original model or product, these discrete data points are scanned and used for data fitting. But, data points are often linked to position information and normal vector information like optical reflector design. To resolve this problem, we develop two schemes of nonlinear subdivision in related to a parameters-dual 4-point binary and a parameter-solo 3-point ternary interpolation in terms of circle average. Method First, a two points and its normal vector-related binary nonlinear circle average is introduced. This task is targeted on circle average because the new point is on the circle-constructed derived from the original two points and corresponding normal vectors. Next, linear subdivision method is rewritten into a replicated binary average of points. In order to optimize linear subdivision schemes, linear average is replaced by the circle average. Third, the weighted geodesic average is used to calculate the newly vertex-inserted normal vector. The two kinds of circle average-based nonlinear subdivision schemes are obtained through melting the operations mentioned above into linear parameters-dual 4-point binary subdivision and parameter-solo 3-point ternary interpolation subdivision. For circle average-related two-parameter four-point binary subdivision scheme, each subdivision process is composed of two steps of displacement and tension both. For circle average-based single-parameter three-point ternary subdivision method, each subdivision step is reconstructed by left interpolation, interpolation, and right interpolation. In addition, the feasibility of these methods is tested theoretically and numerically. Some theorems of convergence and consistency of two proposed methods are illustrated because normal vectors-proved have factor-contracted, and the data points have factor-contracted and backup-displaced in the subdivision process. Result These methods are implemented in terms of MATLAB overall. The issue of parameters is studied on the two proposed subdivision schemes. First, for the circle average-based two-parameter 4-point binary subdivision scheme, the smaller of the tension parameter is, the limit curve is closer to the initial control polygon. The smaller of the displacement parameter is the limit curve is near to the initial control vertex. When the parameter-displaced is zero, the subdivision method is transferred to interpolation subdivision. For the circle average-based solo-parameter 3-point ternary subdivision method, the circle average is first to be applied to the linear ternary interpolation subdivision, which makes the vertices-controlling more fast. Then, for the same initial control vertices, the normal vector of one fixed control vertex is changed to produce different limit curves freely. Test results show that the selection of parameters and initial normal vectors can be used to control the shape of limit curves effectively. Finally, our nonlinear subdivision schemes proposed are compared to the corresponding linear subdivision schemes. When the initial control vertices are sampled from the circle, the corresponding normal vectors will be pointed between the center of the circle and the vertex. Test results show that our nonlinear subdivision schemes proposed can reconstruct the circle through the proposed nonlinear subdivision schemes and the corresponding linear subdivision schemes-reconstructed, but the corresponding linear subdivision schemes cannot be used to reconstruct the circle. Furthermore, three sorts of case studies for curve models are selected in comparison with curve reconstruction from multiple subdivision methods. The initial control vertices and their normal vectors are sampled based on curves-consistent, and they are subdivided for 8 times totally. Our nonlinear subdivision-schemed limit curve is much smoother, while the corresponding linear subdivision schemes have their sharp points. Conclusion Theoretically, it shows that our circle average-based two nonlinear subdivisions proposed are convergent and consistent with C1. Experimental results indicate that our nonlinear subdivision schemes can optimize linear subdivision-schemed modeling ability, and it has its circular regenerative potentials. Normal vectors-selected is beneficial to the shapes of limit curves to some extent.
Medical Image Processing
- Performance evaluation of mainstream chromosome recognition algorithms under deep learning Yi Xusheng, Yin Aihua, Huang Jiesheng, Peng Jing, Chen Hanbiao, Guo Li, Lin Chengchuang, Li Shuangyin, Zhao Gansendoi:10.11834/jig.210669
18-02-2023
193
180
Abstract:Objective Deep learning technique-based medicinal image processing is essential for clinical information in related to disease diagnosis, treatment, and surgical planning. Chromosome-relevant segmentation can be as one of the specific tasks for medical-based image processing. It is beneficial to prenatal diagnosis via clinical diagnosis information gathering and analysis. In recent years, an end-to-end training features-based deep learning technique has been developing intensively. Chromosome-relevant segmentation has been facilitating as well. Chromosomes can be one of the key carriers of genetic information. Chromosomes-based genetic information analysis is often employed for human genetic diseases. Chromosome images-related karyotyping analysis is a commonly-used method for diagnosing birth defects and it can be as the "gold standard" for the clinical diagnosis of genetic diseases. Chromosome segmentation is challenged for the manipulation problem in the context of chromosome karyotype analysis. It has a strong reference value for prenatal diagnosis results. However, most of chromosome-related segmentation algorithms have restricted by its heterogeneity, resulting in a lack of a screening standard for algorithms in clinical applications. To carry out more comparative experiments on the large-scale chromosome-constructed database, we develop a multiple of chromosome-essential segmentation models. Method Our database is constructed and segmented in terms of the chromosome karyotype (funded by Guangdong Maternity and Child Health Hospital). first, it consists of large-scale chromosome clinical data in relevant to 126 453 chromosome samples. Then, the publicly-available multi-chromosome-essential recognition models are selected. Finally, experiments and performance evaluation of our model is carried out in the clinical chromosome database. Result Random sampling-stratified experiment is used to divide the clinical chromosome data set into training data set (80%), validation data set (10%), and test data set (10%) totally. The models-selected are all developed in terms of Pytorch framework. The training process of the model is summarized as mentioned below:First, all models are pre-trained and migrated from the ImageNet classification task. Second, a single-stage cycle learning (1 cycle LR) learning and training method is used to balance the performance of each model in the clinical data set gradually. The batch size of all experiments is set to 32 (batch_size=32). The balanced loss function is based on the cross-entropy loss function smoothed by the mark of α=0.1. The learning rate is set to 1E-4. The hyper parameter for the maximum number of training iterations is set to 500. Moreover, the early stopping strategy will be implemented to terminate the training process if the verification loss is not decreased in five consecutive periods. Finally, our training weights can be optimized and restored for the corresponding model. Large-scale clinical-oriented chromosome data sets are beneficial for evaluating existing chromosome classification methods and improving their performance. The CirNet and MixNet models-based initial performance and classification effect are optimized on the original ResNet networks. It can strengthen the depth and width of the network, increase the number of parameters, and get a better classification level. The guarantee of the amount of data can alleviate the problem of over-fitting. The classification accuracy rate is optimized and outreached to 98.92%, but there is still a gap between the high-precision clinical applications. Conclution To develop deep learning technique-based chromosome classification and ensure its high-precision potentially, a refined network structure is required to be designed and tackled the chromosome images-related homogeneity further. The quality and quantity of chromosome data samples should be guaranteed as well.
- Gated recurrent unit method for motor tasks recognition using brain fMRI Yuan Zhen, Hou Yuliang, Du Yuhuidoi:10.11834/jig.210607
18-02-2023
166
110
Abstract:Objective In the field of neuroscience, there have been studies using functional magnetic resonance imaging (fMRI) data to explore functions of the human brain and distinguish its states under different motor tasks. However, previous studies that focused on the brain state classification using task fMRI did not make full use of temporal characteristics of fMRI data. Here, we propose a method (named TC-GRU) that employs gated recurrent unit (GRU) to capture fine-grained features from time courses (TC) of whole-brain regions estimated from fMRI data for the classification of motor tasks. Method The fMRI data are gathered from 100 healthy subjects in the human connectome project (HCP) under 5 body-motion tasks (including left hand, right hand, left foot, right foot, and tongue motor) with 2 scanning operations, resulting in 1 000 samples for classifying the 5 motor tasks. First, for each sample, we calculate the average fMRI TC for each brain region as the representative TC of the brain region. The whole brain is divided into 360 brain regions according to the Glasser brain template. Then, using a 10-fold cross-validation framework (8:1:1 for the training set, the validation set, and the testing set) with 100 repetitions, the TC-GRU model is trained and optimized based on the training set and the validation set, and the model-trained is further applied to the testing set to examine its ability in classifying this 5 body motor tasks. In our TC-GRU model, the GRU is used to extract the temporal features in the TCs of the brain regions, and a linear classifier is used for classification based on the temporal features. Specifically, at a certain moment, the inputs of the GRU model are the TC amplitudes of the whole-brain regions at that moment as well as the temporal features of the past moments captured by the GRU model, and the GRU model fuses the inputs and produces the temporal features at the current time. This process continues until the last moment to generate temporal features for the classification. In our work, we also compare the most state-of-the-art methods with the TC-GRU. The long short-term memory (LSTM), graph convolutional network (GCN), and multi-layer perceptron (MLP) are used to classify the motor tasks based on TCs of whole-brain regions as well as brain functional connectivity measures estimated by the fMRI data. Furthermore, we examine the effects of prior feature selection and no feature selection on the classification performance. It is noteworthy that a consistent 10-fold cross-validation framework is used for multiple methods and the overall classification accuracy is summarized through 100 cross-validation tests. The overall classification accuracy is the mean classification accuracy, and the performance stability is reflected by the standard deviation of the classification accuracy. Result The highest ranking of the classification accuracy (accuracy:94.51%±2.4%) can be achieved via the TC-GRU method, and the second rank is the LSTM using TC information (accuracy:93.73%±2.67%). Using MLP based on the TCs of whole-brain regions (accuracy from the experiments with prior feature selection and without prior feature selection is 92.75%±2.59% and 92.04%±7.15%, respectively) is better than using GCN (accuracy:87.14%±3.73%) based on the TCs of whole-brain regions and MLP based on the brain functional connectivity measures (accuracy from experiments with prior feature selection and without prior feature selection is 72.47%±4.47% and 61.49%±9.97%, respectively). Conclusion To the best of our knowledge, this paper is the first time to distinguish different human brain motor tasks using GRU based on time courses of the whole-brain regions. Our results support that the TC-GRU method outperforms six state-of-the-art methods on human brain motor task classification because that the TC-GRU can mine more useful information in the brain fMRI data. In summary, our finding suggests the importance of utilizing temporal information of fMRI data to decode the complex brain.
- A deep hash retrieval for large-scale chest radiography images Guan Anna, Liu Li, Fu Xiaodong, Liu Lijun, Huang Qingsongdoi:10.11834/jig.211012
18-02-2023
167
92
Abstract:Objective Big medical data is mainly concerned of data storage-related like electronic healthy profiles, medical image and genetic information. It is essential to process large-scale medical image data efficiently. For large-scale retrieval tasks, deep hashing methods can be used to optimize traditional retrieval methods. To improve its retrieval efficiency, the potential ability is developed to map the high-dimensional features of an image into the binary space, generate low-dimensional binary encoded features, and avoid dimensional catastrophe problem. Hash-depth methods are divided into two data categories of those are independent data and dependent data. Although the deep hashing method has great advantages in large-scale image retrieval, the challenges are still to be resolved for the features loss issues of key areas of the lesions like redundant lesions, high noise, and small targets. So, we develop a deep hash retrieval network for large-scale human chest-related X-ray images. Method For the feature learning part:to obtain their initial features, the ResNet-50 is used as the backbone network and the input image is subjected to feature extraction. To obtain global features, a feature-refined block is followed. Here, the feature-refined block is structured via the residual block and the average pooling layer. To obtain the detailed focal regions, we design a spatial attention module in related to three descriptors:1) maximum element along the channel axis, 2) average element, and 3) maximum pooling. In addition, to obtain a feature focusing on the prominent region, the key features are input into the spatial attention module, and then local features are obtained in terms of feature-refined block. First, the resulting global and local features are integrated seamlessly by dimension. Next, to optimize hash codes, the cascade layer is connected to the fully-connected layer. For the part of hash code optimization:in order to obtain high quality hash codes and improve the quality of sorting results, a joint loss function is used to define the target error. To generate a more discriminative hash code, we leverage the label information and semantic features of the image in related to the losses of contrast, regularization and cross entropy. Finally, the searching results are calculated in terms of the similarity metric. Result The comparative experiments are carried out on two different datasets of those are ChestX-ray8 and CheXpert. Our analysis is compared to other five classical generic hashing methods for the same task, including deep hashing methods and shallow hashing methods. Among them, the deep hashing methods are based on deep hashing (DH), deep supervised hashing (DSH), and attention-based triplet hashing (ATH), and the shallow hashing methods are based on semi-supervised hashing (SSH) and iterative quantization (ITQ). The normalized discounted cumulative gain (nDCG@100) and mean average precision (mAP) are used as evaluation metrics. The experimental results show that the retrieval performance of our method has some optimal value in comparison with deep learning-relevant methods. For the ChestX-ray8 dataset, the mAP is increased by about 6% and the nDCG@100 is improved by 4%. For the CheXpert database, the mAP is higher by about 5% and the nDCG@100 is improved by 3%. Conclusion To deal with the problem that the existing hashing methods pay less attention to salient region features, we demonstrate a deep hash retrieval network for large-scale chest X-ray images for large-scale human chest-relevant radiographic image retrieval. To improve the accuracy and ranking quality of image retrieval, this deep hash retrieval method is proposed and be focused on the lesion region effectively. It is beneficial to clarify focal area information and reveal the attention-less problem to salient areas in terms of a spatial attention module-constructed. The feature fusion-defined module can be used resolve the problem of information loss effectively. We use three loss functions to make the real value output more similar to the binary hash code, which can optimize the sorting quality problem of the retrieval results. It is possible to adjust the order of network composition for the regions of interest (RoI)-concerned. The loss function can be optimized the existing hashing method. It is potential to distinguish small sample images further.
Remote Sensing Image Processing
- Non-negative sparse component decomposition based modeling and robust unmixing for hyperspectral images Wang Shunqing, Yang Jingxiang, Shao Yuantian, Xiao Liangdoi:10.11834/jig.211054
18-02-2023
217
77
Abstract:Objective Hyper-spectral remote sensing (RS) image based pixel issue is focused on the spectra of several elements-purified mixture in common restricted by the low spatial resolution of hyperspectral imagers and the diversity of spectral features in nature. The quality of hyper-spectral images (HSIs) is often constrained of mixed pixels. The development and application of RS-based hyper-spectral images is challenged for the constraints of mixed pixels. Hyper-spectral un-mixing problem is essential to hyper-spectral image analysis through decomposing mixed pixels into a set of pure substances (endmembers) and corresponding component ratios (abundances). The challenging issue of un-mixing is connected with insufficient and unmatched information problem. At present, the hyper-spectral un-mixing content is required to segment into three categories:1) geometry-based, 2) statistics-based, and 3) spectral profiles-based. The spectral profiling is relevant to a pre-collected spectral library composed of a large number of elements-purified spectra as an end-member dictionary. It is not necessary to extract end-members from hyper-spectral data, and the following influence of inaccurate end-member extraction can be avoided. Generally, the number of end-members in the spectral library is more than the end-members-related amount in a hyper-spectral image, which make richer matrix more sparsely. To improve the un-mixing accuracy, the constraints-sparsely is required to be added in terms of richer matrix. In addition, HIS-based spatial information can aid to the improvement of un-mixing accuracy greatly. In reality, hyper-spectral data quality is constrained of multiple noises because the photonic energy is limited in imaging process, such as Gaussian, impulse and deadline noise, and each band of noise-intensified. However, most of un-mixing problems are relevant to Gaussian noise and the other types of noise like impulses and deadlines are not involved in, and the assumption of Gaussian noise intensity is for the same bands of hyper-spectral data. In order to resolve the problems mentioned above, we develop non-negative sparse component decomposition, and a robust un-mixing to sparse component analysis (RUnSCA) for hyper-spectral images. Method First, the application of mixed noise of real hyperspectral data and the statistic features of different noise intensity of each band are involved in. Then, we carry out a non-negative sparse component-based model analysis in terms of the maximized posterior probability framework. The hyper-spectral data quality of each band is improved further when the mixed noise is as the linear combination of spectra in a profile. Considering the sparsity nature of impulse noise and deadline noise, the l1,1 norm is as the constraints of these noises. Since the number of endmembers in the spectral library is more than the number of end-members in the hyper-spectral image, a smaller number for non-zero rows can be existed in the abundance matrix, which means the abundance has global row sparsity. In order to analyze the sparsity of the abundance matrix more effectively, we introduce l2,0 norm constraint to the abundance matrix for mining the global row sparsity. In hyper-spectral images, adjacent pixels-local contains mutual substances, and their spectra should be similar mutually, so the image can show smoothness-piecewise. The abundance matrix is regarded as piecewise smooth as well because the elements in each column of the abundance matrix represent the proportion of corresponding objects in a pixel. The factor of total variation regularization (TV) is added to the model analysis. Finally, the proposed RUnSCA problem can be resolved in terms of the alternating direction multiplier method (ADMM). Result In order to verify the effectiveness of the proposed RUnSCA, we use two sorts of simulated data sets and one real data set in experiment. We use signal to reconstruction error (SRE) for evaluating the performance of un-mixing as well. There are five representative methods used in the comparative experiments. The methodology is based on collaborative sparse theory of those are variable-split-coordinated sparse un-mixing and augmented Lagrangian (CLSUnSAL) and sparse-cooridinated hyper-spectral un-mixing using l0 norm (CSUnL0). Sparse un-mixing problem is resolved via variable-split and Lagrangian-augmented and total variation (SUnSAL-TV), and the row-sparsity spectral un-mixing is resolved via total variation (RSSUn-TV). Sparse un-mixing of hyper-spectral data with bandwise model (SUBM) is also be considered as the variation of Gaussian noise in different bands of hyper-spectral image and the impulse noise or deadlines is taken into account. On the two simulated data sets, in comparison with the best result of five popular methods, the experiments show that our SRE result can improve 4.11 dB and 6.94 dB on average under various noise intensities. It illustrates the proposed method can deal with it more robustness. The RUnSCA also achieves good performance of mineral un-mixing on real data over Cuprite mine in Nevada, USA. The un-mixing performance-validated is based on a single mixed spectrum as well. The results show that the spectrum-reconstructed is closest to the real spectrum with the lowest RMSE. The RUnSCA can analyze the composition and proportion of the mixed spectra accurately. The experiments on simulated data and real data show that our analysis has more robustness over mixed noise. Conclusion This non-negative sparse component decomposition task demonstrate that a robust un-mixing for hyper-spectral images is effective with sparse component analysis (RUnSCA), which can comprehensively considers the influence of Gaussian random noise and sparse structural noise on the accuracy of linear un-mixing. The potential ADMM problem can be resolved further.
- Airborne image segmentation via progressive multi-scale causal intervention Zhou Feng, Hang Renlong, Xu Chao, Liu Qingshan, Yang Guoweidoi:10.11834/jig.211036
18-02-2023
175
121
Abstract:Objective Airborne-relevant image segmentation is one of the essential tasks for remote sensing, which can assign a semantic label to each pixel in an image. Its applications have been developing in related to such research domain like land use, urban planning, and environmental surveillance. To analyze the segmentation results of airborne image, most of conventional methods are concerned about label-manual features like scale-invariant feature transform (SIFT) and histogram of oriented gradient (HOG). Their performance is constrained of features-selected intensively. It is still challenged to deal with such complex scene. To optimize image classification tasks, the deep convolution neural network (DCNN) has been melted into pixel-wise classification issues like airborne image segmentation. The ability of DCNN is linked to auto task-adaptive features extraction for training to a certain extent. Fully convolutional network (FCN) can be used to improve the performance of airborne image segmentation. FCN-based UNet and SegNet are followed and developed further. A newly encoder-decoder design is involved in for airborne image segmentation. The fixed-size convolutional kernels are employed to capture contextual information for segmentation. Deep learning technique is beneficial for airborne image segmentation, but the output-learned is restricted of single-scale and local. In fact, it is required to handle the two challenging issues in airborne image segmentation as mentioned below:1) remote sensing based multiple objects, and 2) multi-source images based heterogeneity. The first task is focused on multi-scale contexts for segmentation. The second one is developed to get discriminative information more in terms of global extraction. To alleviate the limitations and improve the performance, these two kinds of methods are compared to FCN-based methods. However, the mutual benefits are not included and the interference of confounders is leaked out. So, we develop a causal and effect-intervened segmentation method to suppress the interference of confounders. Method In this study, a progressive multi-scale cause and effect intervention model (PM-SCIM) is built up. First, the PM-SCIM takes ResNet18 as backbone network to extract convolutional features of airborne images. Then, a de-confounded module is designed to measure the average cause and effect of confounders on the convolutional feature through stratifying the confounders into different cases. In this way, to suppress the interference of a specific confounder, it is possible to collect objects in any context confounders indirectly. Next, the de-confounded feature generated is used to analyze the segmentation result from the deepest layer. This overall scale segmentation result can be obtained while a fusion module is fed into the segmentation results are guided in terms of de-confounded features from shallow layers. Finally, all segmentation results are fused via sum-weighted. The PM-SCIM is trained on two datasets of those are Potsdam and Vaihingen. For Potsdam, we choose 24 images for training and the remaining 14 images for testing. For Vaihingen, we select 16 images for training and the remaining 17 images for testing. To make full use of computing resources, a 256×256 sliding window is used to crop the input images for generating training samples. At inference phase, the same sliding method is used to crop input tiles from the original testing image and they are processed gradually. For training, the momentum parameter is set to 0.9, the learning rate is kept to 0.01, and the weight decay is configured at 0.000 01. The SGD (stochastic gradient descent) learning procedure is accelerated using a NVIDIA GTX TITAN X GPU device. A poly learning rate pathway is employed to update each iteration-after learning rate as well. Result Our demonstration is compared to 4 popular state-of-the-art deep methods and 7 public benchmark data sets. The quantitative evaluation metrics are composed of overall accuracy (OA) and F1 score, and we offer several segmentation maps of benched results for comparison. Specifically, the OA is increased by 0.6% and 0.8% each (higher is better), and mean F1 increased by 0.7% and 1% of each as well (higher is better) compared to DANet on Potsdam and Vaihingen. The OA is increased by 1.3%, and the mean F1 is increased by 0.3% in comparison with CVEO2 on Potsdam. The OA is increased by 0.5% and the mean F1 is increased by 0.5% in terms of the comparative analysis with DLR_10 on Vaihingen. The segmentation maps showed that our method has its potentials for small objects (e.g., car) and ambiguous objects (e.g., tree and lawn). Additionally, to clarify the effectiveness of multiple modules in PM-SCIM, a series of ablation studies on Potsdam and Vaihingen are carried out. Conclusion To suppress the interference of confounders using cause and effect intervention, a novel segmentation method is proposed and developed through melting de-confounded module and fusion module into ResNet18.