
Ranking
- Current Issue
- All Issues
- 1
A review of human face forgery and forgery-detection technol...
554 - 2474
- 3
Joint loss optimization based high similarity identification...
411 - 4
Attention-mechanism-based light single shot multiBox detecto...
409 - 5
The cross-view gait recognition analysis based on generative...
373 - 6
MTMS300: a multiple-targets and multiple-scales benchmark da...
372
- 1
A review of human face forgery and forgery-detection technol...
438 - 2
Light field image re-focusing based on conditional generativ...
379 - 3
Real-time semantic segmentation analysis based on cavity sep...
346 - 4
Spatial divide and conquer based remote sensing image quick ...
329 - 5
Attention-mechanism-based light single shot multiBox detecto...
319 - 6
The cross-view gait recognition analysis based on generative...
306
- 1134101
- 217649
- 317413
- 416608
- 515544
- 614327
About the Journal
Journal of image and Graphics(JIG) is a peer-reviewed monthly periodical, JIG is an open forum and platform which aims to present all key aspects, theoretical and practical, of a broad interest in computer engineering, technology and science in China since 1996. Its main areas include, but are not limited to, state-of-the-art techniques and high-level research in the areas of image analysis and recognition, image interpretation and computer visualization, computer graphics, virtual reality, system simulation, animation, and other hot topics to meet different application requirements in the fields of urban planning, public security, network communication, national defense, aerospace, environmental change, medical diagnostics, remote sensing, surveying and mapping, and others.
- Current Issue
- Online First
Review
- Image engineering in China: 2021 Zhang Yujindoi:10.11834/jig.220257
21-04-2022
474
305
Abstract:This is the 27th annual survey series of bibliographies on image engineering in China. This statistic and analysis study aims to capture the up-to-date development of image engineering in China, provide a targeted means of literature searching facility for readers working in related areas, and supply a useful recommendation for the editors of journals and potential authors of papers. Specifically, considering the wide distribution of related publications in China, all references (833) on image engineering research and technique are selected carefully from the research papers (2 958 in total) published in all issues (154) of a set of 15 Chinese journals. These 15 journals are considered important, in which papers concerning image engineering have higher quality and are relatively concentrated. The selected references are initially classified into five categories (image processing, image analysis, image understanding, technique application, and survey) and then into 23 specialized classes in accordance with their main contents (same as the last 15 years). Analysis and discussions about the statistics of the results of classifications by journal and by category are also presented. Analysis on the statistics in 2020 shows that image analysis is receiving the most attention, in which the focuses are mainly on object detection and recognition, image segmentation and edge detection, as well as human biometrics detection and identification. In addition, the studies and applications of image technology in various areas, such as remote sensing, radar, sonar and mapping, as well as biology and medicine are continuously active. In conclusion, this work shows a general and up-to-date picture of the various continuing progresses, either for depth or for width, of image engineering in China in 2021. The statistics for 27 years also provide readers with more comprehensive and credible information on the development trends of various research directions.
- A review of human face forgery and forgery-detection technologies Cao Shenhao, Liu Xiaohui, Mao Xiuqing, Zou Qindoi:10.11834/jig.200466
21-04-2022
554
438
Abstract:Face image synthesis is one of the most important sub-topics in image synthesis. Deep learning methods like the generative adversarial networks and autoencoder networks enable the current generation technology to generate facial images that are indistinguishable by human eyes. The illegal use of face forgery technology has damaged citizens’ portrait rights and reputation rights and weakens the national political and economic security. Based on summarizing the key technologies and critical review of face forgery and forged-face detection, our research analyzes the limitations of current forgery and detection technologies, which is intended to provide a reference for subsequent research on fake-face detection. Our analysis is shown as bellows: 1) the technologies for face forgery are mainly divided into the use of generative confrontation technology to generate a category of new faces and the use of existing face editing techniques. First, our review introduces the development of generative adversarial network and its application in human face image generation, shows the face images generated at different development stages, and targets that generative adversarial network provides the possibility of generating fake face images with high resolution, real look and feel, diversified styles and fine details;furthermore, it introduces face editing technology like face swap, face reenactment and the open-source implementation of the current face swap and face reenactment technology on the aspects of network structure, versatility and authenticity of the generated image. In particular, face exchange and face reconstruction technologies both decompose the face into two spaces of appearance and attributes, design different network structures and loss functions to transfer targeted features, and use an integrated generation adversarial network to improve the reality of the generated results. 2) The technologies for fake face detection, according to the difference of media carriers, can be divided into fake face image detection and fake face video detection. Our review first details the use of statistical distribution differences, splicing residual traces, local defects and other features to identify fake facial image generated from straightforward generative adversarial network and face editing technologies. Next, in terms of the difference analysis of extracting forged features, the fake facial video detection technology is classified into technology based on inter-frame information, intra-frame information and physiological signals. The methodology of extracting features, the design of network structures and the use scenarios were illustrated in detail. The current fake image detection technology mainly uses convolutional neural networks to extract fake features, and realizes the location and detection of fake regions simultaneously, while fake video detection technologies mainly use a integration of convolutional neural networks and recurrent neural networks to extract the same features inter and inner frames; after that, the public data sets of fake-face detection are sorted out, and the comparison results of multiple fake-face detection methods are illustrated for multiple public data sets. 3) The summary and the prospect part analyze the weaknesses of the current face forgery technologies and forged-face detection technologies, and gives feasible directions for improvement. The current face video forgery technology mainly uses the method of partially modifying the face area with the following defects. There are forgery traces in a single video frame, such as blurred side faces and missing texture details in the face parts. The relevance of video frames was not considered and there were inconsistencies amongst the generated video frames, such as frame jumps, and the large difference in the position of key points of the two frames before and after; and the generated face video lacks normal biological information, such as blinks and micro expressions. The current forgery-detection technologies have poor robustness to real scenes and poor robustness against image and video compression algorithms. The detection methods trained on high-resolution datasets are not suitable for low-resolution images and videos. Forgery detection technologies are difficult to review the issue of continuous upgrade and evolution of forged technology. The further improvement is illustrated on forgery-detection technologies. For instance, when generating videos, it would be useful to add the location information of the face into the network to improve the coherence of the generated video. In related to forgery detection, the forgery features in the space and frequency domains can be fused together for feature extraction, and the 3D convolution and metric learning can be used to form a targeted feature distribution for forged faces and the genuine faces. The development of face forgery is featured by few-shot learning, strong versatility and high fidelity. Forgery-face detection technology is intended to high versatility, strong compression resistance, few-shot learning and efficient computing.
Dataset
- MTMS300: a multiple-targets and multiple-scales benchmark dataset for salient object detection Li Chuwei, Zhang Zhilong, Li Shuxindoi:10.11834/jig.200612
21-04-2022
372
278
Abstract:Objective Benchmark dataset is essential to salient object detection algorithms. It conducts quantitative evaluation of various salient object detection algorithms. Most public datasets have a variety of biases like center bias, selection bias and category bias, respectively. 1) Center bias refers to the tendency of the photographer to place the object in the center of the camera’s field of view when shooting the object, which is called camera-shooting bias. 2) Selection bias means the designer has a specific tendency when choosing images in the process of dataset construction, such as simple background option or large object orientation. 3) Category bias refers to the category imbalance in the dataset, which is often in the training process of deep convolutional neural network (DCNN). Based on the dataset bias, current visual saliency algorithms are aimed at daily scenes images. Such images are usually shot from close distance with a single background, and the saliency of the object is related to the object size and position. A salient object algorithm can easily judge its saliency when the large-scale object is located in the center of the image. The current saliency detection benchmark datasets are constrained of these biases. Our demonstration illustrates the statistical differences of several commonly used benchmark datasets quantitatively and proposes a new high-quality benchmark dataset. Method Centers bias, clutter and complexity, and label consistency of benchmark datasets are the crucial to design and evaluate a benchmark dataset. First, the commonly used evaluation metrics are discussed, including average annotation map (AAM), normalized object distance (NOD), super-pixels amount, image entropy, and intersection over union (IoU). Next, a new benchmark dataset is constructed, we split the image acquisition and annotation procedure to avoid dataset bias. In terms of the image acquisition requirement, we yield 6 participants to collect images based on Internet surfing and employ the similarity measurement to conduct similar or replicated images deduction. The image annotation process is divided into two stages to ensure the consistency of the annotation as well. At the beginning, a roughly annotated bounding-box is required 5 participants to label the salient objects in the image with a box and use the IoU to clarify further labeled objects. Next, a pixel-level annotation map generation, which is labeled by 2 participants. The mechanisms of pixel-level labeling are proposed as below: 1) The unoccluded parts of salient objects are only labeled; 2) The adjacent objects are segmented into independent parts based on the targets orientation. For overlapped objects, we do not separate them into dis-continuous parts deliberately; 3) Objects can be identified just by looking at their outlines. We built a benchmark dataset in the context of 300 multi-facets images derived of sea, land and sky and called it multiple targets and multiple scales (MTMS300). Third, we conducted a quantitative analysis of the current benchmark datasets and our dataset based on 6 factors and ranked them according to their difficulty degree. After that, we test and compare 18 representative visual saliency models quantitatively in the span of the public benchmark datasets and our new dataset. We reveal the possibilities of the failure of the models based on benchmark datasets. At the end, we leverage a set of images from the benchmark datasets and construct a new benchmark dataset named DSC (difficult scenes in common). Result The demonstration is divided into two parts: statistical analysis of benchmark datasets and quantitative evaluation of visual saliency algorithms. In the first part, we utilize average annotation map and normalized object distance to analyze the dataset center bias. Normalized object size, Chi-square distance of histograms, the number of super-pixels and the image entropy are used to analyze the dataset complexity simultaneously. Compared with other public datasets, MTMS300 dataset has a smaller center bias. MTMS300 dataset is also prominent in terms of object quantity and object size. The priority of the DSC dataset is derived of its small foreground/background difference, large number of super-pixels, and high image entropy. In the second part, two most widely adopted metrics are adopted to evaluate existing visual saliency algorithms. In terms of the experiments and evaluations of 18 algorithms on 11 datasets, we discovered that there is a correlation between the metric score of the algorithm and the difficulty of the dataset. Meanwhile, we analyzed the limitations of the current algorithms running on the new dataset. Conclusion We demonstrate a benchmark dataset for salient object detection, which is characterized by less center bias, balanced distribution of salient object size ratio, diverse image resolution, and multiple objects scenarios. Moreover, the multi-labelers-based annotation guaranteed that most objects in our dataset are clear and consistent. Our MTMS300 dataset includes 300 color images containing multiple targets and multiple scales. We evaluated 18 representative visual saliency algorithms on this new dataset and review the challenging issues of various algorithms. At last, we have found a set of images based on the 9 existing benchmark datasets and construct a dataset named DSC. These two datasets can evaluate the performance of various visual saliency algorithms, and facilitate the development of salient object detection algorithms further, especially for task-specific applications.
Image Processing and Coding
- Light field image re-focusing based on conditional generative adversarial networks leverage Xie Ningyu, Ding Yuyang, Li Mingyue, Liu Yuan, Lyu Ruimin, Yan Taodoi:10.11834/jig.200471
21-04-2022
311
379
Abstract:Objective Light field images like rich spatial and angular information are widely used in computer vision applications. Light field information application can significantly improve the visual effect based on the focal plane and depth of field of an image. The current methods can be divided into two categories as mentioned below: One of the categories increases the angular resolution of a light field image via light field reconstruction. Since aliasing phenomenon is derived of disparity amongst light-field-images-based of the sub-aperture views. These methods require high computational costs and may introduce color errors or other artifacts. In addition, these methods can just improve the quality of refocusing straightforward under original focus plane and depth of field. Another category illustrates various filters derived of the circle of confusion (COC) map to defocus/render the center sub-aperture view to produce bokeh rendering effect. A rough defocusing visual effect can obtained. This above category has low computational cost and can sort both the focus plane and depth of field out. Deep convolutional neural network (DCNN) has its priority in bokeh rendering. To this end, we facilitate a novel conditional generative adversarial network based (C-GAN-based) for bokeh rendering. Method Our analysis takes a light field image as input. It contains three aspects as following: First, it calculates the COC map with different focal planes and depths of field derived of the disparity map for the input light field image estimation. The obtained COC map and the central sub-view of the light field image are fed into the generator of the conditional GAN. Next, the generator processes two input data each based on two four-layer encoders in order to integrate two-encoders-based features extraction, which add the four consecutive residual modules. At the end, the acquired refocused image is melted into the discriminator to identify that the obtained refocused image corresponding to the COC map. To enhance the high-frequency details of the refocused/rendered image, we adopt a pre-trained Visual Geometry Group 16-layer (VGG-16) network to calculate the style loss and the perceptual loss. L1 loss is used as the loss of the generator, and the discriminator adopts the cross-entropy loss. The Blender is used to adjust the position of focus planes and depths of field and render corresponding light field images. A digital single lens reflex(DSLR) camera plug-in tool of the Blender is used to render the corresponding refocused images as the ground truth. Our network is implemented based on the Keras framework. The input and output sizes of the network are both 512×512×3. The network is trained on a Titan XP GPU card. The number of epochs for training our targeted neural network is set to 3 500. The initial learning rate is set to 0.000 2. The training process took about 28 hours. Result Our synthetic dataset and the real-world dataset are compared with similar algorithms, including current refocusing algorithms, three different light field reconstruction algorithms, and defocusing algorithm using anisotropic filtering with COC map. Our quantitative analysis uses the peak signal to noise ratio (PSNR) and structural similarity (SSIM) for evaluation. Our proposed network-structure-based qualitative evaluation can obtain refocused images with different focus planes and depths of field in terms of the input COC map analysis. In the process of quantitative analysis, our average PSNR obtained is 1.82 dB. The average SSIM was improved by 0.02. Compared with the methods that use COC map and anisotropic filtering, our average PSNR was improved 7.92 dB and the average SSIM is improved 0.08. The methods had achieved poor PSNR values in the context of reconstruction/super-resolution due to the chromatic aberration of the generated sub-views. Conclusion Our algorithm can generate the disparity-map-based corresponding COC map obtained from the input light field image, refocusing plane and depth of field. To produce the corresponding refocused image, our conditional generative adversarial network demonstration can perform bokeh rendering on the central sub-view image based on differentiate COC map.
- Restart fast ADMM methods for second-order variational models of image restoration Song Tiantian, Pan Zhenkuan, Wei Weibo, Li Qingdoi:10.11834/jig.200656
21-04-2022
244
280
Abstract:Objective To develop image processing and computer vision, variational models have been widespread used in image de-noising, image segmentation and image restoration. Variational model of image restoration has a fundamental position. Variational model of image restoration can maintain the image edge and smooth features based on the second-order derivative. However, its regular terms are generally non-linear, non-smooth, or even non-convex. These features have their numerical algorithm design difficulty and the low computational efficiency of its numerical method. These features restrict the design of its fast algorithm as well. The designated acceleration method is essential to design optimal inertial parameters. The variational image processing models are often locally strongly convex or completely non-convex, which makes it difficult or time-consuming to estimate the optimal inertial parameters. Its inertial acceleration algorithms can cause ripples and fail to achieve the targeted acceleration effect. The analyzed results of developed monotonic algorithm, backtracking algorithm and restart algorithm can yield ripples phenomenon to keep algorithm convergence rate. Our research facilitates framework-based fast alternating direction methods of multipliers (ADMM) method to explore possibility of the restart fast algorithm in second-order variational models. Total-Laplacian based model (TL-based) and Euler’s elastic based model (EE-based) are illustrated to develop testart fast algorithms. Method Our research demonstrated second-order variational model of image restoration with nonlinear, non-smooth TL regular terms and non-linear, non-smooth, non-convex EE regular terms. The following restart fast ADMM algorithm is developed based on the alternation of direction methods of multipliers, Nesterov’s inertial acceleration method and ripples-yielded restart idea. TL model transformed into constrained equivalent convex optimization based on auxiliary variables and linear constraint equations. EE model transformed into equivalent constrained convex optimization based on auxiliary variables, linear constraint equations and relaxed nonlinear constraint equations. Restart fast ADMM algorithm determine the requirement of restarted algorithm based on the integrated residual scale. Our demonstrated restart fast ADMM algorithm can identify a reference for the context fast algorithm models. The number of iterations, total CPU running time and peak signal-to-noise ratio (PSNR) are tracked in our tests. Each of the algorithms described energy change curve and convergence curve respectively. Result The PSNR of three algorithms shows that de-noising effect of fast algorithm and original ADMM algorithm keeps same in terms of denoising effect. Fast algorithm maintains the quality the original model and image quality of the original ADMM algorithm. In terms of computational efficiency, three algorithms are compared in the TL model and the EE model. Compared with original ADMM algorithm, fast ADMM algorithm improves 6%50% and 14%54% each. Compared with original ADMM algorithm, restart fast ADMM algorithm improves 100%433% and 100%900% respectively. In addition, the result of iterations to restart the fast ADMM algorithm obtained value 3 all in same scenario, and running time significantly reduced as well. This shows that restart fast ADMM algorithm is very robust. It is obvious from the energy change curve and the convergence curve that the fast ADMM algorithm generated ripples and decreased the restart fast ADMM algorithm. In calculation process, restart fast ADMM algorithm adaptively adjust step size to eliminate ripples and improve computational efficiency in accordance of the scale of the integrated residual. Conclusion The sub issues of alternate optimization are resolved in each iteration loop via fast Fourier transform method (FFT) or generalized soft thresholding formulas. Numerical experiments show that restart strategy can improve computational efficiency of original ADMM greatly as well as algorithm robustness on penalty parameters. Our contribution provides a qualified reference for fast algorithm of the second-order variational model in context of image restoration. The restart fast algorithm can be extended to variational model of image analysis with second-order derivative regular terms. However, ADMM algorithm and its restart fast algorithm lack sufficient theoretical support for the design of nonlinear, non-smooth, and non-convex variational models with high-order derivatives. Current theoretical research is limited to the optimization issue of objective function composed of two functions. For fast algorithm research of non-smooth and non-convex variational models of high-order derivatives in computer vision, our research is limited to tentative algorithm design and numerical verification.
- Gradual model reconstruction of Dongba painting based on residual dense structure Jiang Mengjie, Qian Wenhua, Xu Dan, Wu Hao, Liu Chunyudoi:10.11834/jig.200523
21-04-2022
297
275
Abstract:Objective Dongba painting is an essential part of Na’xi culture in China. It is the fundamental medium to develop Dongba culture in art theory, natural aesthetics, religion and history. However, the current low resolution digital image of Dongba painting has affected the application, inheritance and development of Dongba culture. Super-resolution reconstruction technology is the process of recovering high-resolution image from low-resolution image. Because Dongba painting has a unique artistic style, compared with natural images, there are no dimension, distance and depth factors, and there is no light and shadow effect via natural light. While the existing super-resolution algorithm for natural images applied to Dongba painting straitforward, the reconstruction effect of lines, color blocks and materials of Dongba painting is not ideal. Therefore, the super-resolution reconstruction of Dongba painting is adopted and high-resolution Dongba painting obtained via the reconstruction of low-resolution Dongba painting. Method First, the Dongba painting dataset for network training is constructed, which makes the learning of Dongba painting image characteristics more targeted. The data set contains 298 high-definition Dongba paintings, and each Dongba painting has one dimension with a resolution of 2 K at least. Among them, 278 paintings are used for training, 20 paintings are used for testing, and the training set is randomly cropped for data enhancement. Next, in accordance with the characteristics of Dongba painting image which is rich in high frequency information, a reconstruction network is built and named Dongba super-resolution network (DBSRN): the overall structure of the network uses multi-level sub network cascade to gradually reconstruct high-resolution Dongba painting. During the feedforward process of a large-scale reconstruction, multiple intermediate Super-resolution predictions were produced. The final reconstruction results are constrained by multiple intermediate prediction values, so as to reconstruct Dongba paintings of different scales from small to large gradually. The reconstruction results of each sub network and the corresponding scale tags are calculated at the pixel level, and the tags of different scales jointly guide the reconstruction. The loss of high-frequency details is reduced in the up sampling process of Dongba painting image. To extract and fuse features at different levels, each level of super-resolution sub-network has a shallow feature extraction module, a deep feature extraction module, and a global feature fusion module. In this way, the disappearance of features with the deepening of network can be avoided. The extracted feature maps are input to the up-sampling module for reconstruction. The residual dense structure, which combines residual connection and dense connection, can effectively enhance feature reuse and slow down gradient disappearance. The demonstrated algorithm takes the residual dense structure as the core in deep feature extraction module, extracts the deep feature of Dongba painting for fusion, reduce the feature loss caused by simple chain stacking in the convolution layer; At the end, added-discriminator, perception loss and adversarial loss for adversarial training are implemented to improve the visual quality of Dongba painting based on pixel level loss. This network is named DBGAN (Dongba generative adversarial network). Result This illustration demonstrate the Dongba paintings reconstructed in this paper get better results in both subjective visual quality and objective indicators, compared to bicubic interpolation (Bicubic), super-resolution convolutional neural network (SRCNN), super-resolution residual network (Srresnet) and information multi-distillation network (IMDN).The peak signal to noise ratio (PSNR)/structural similarity index (SSIM) in DBSRN reach 33.46 dB and 0.911 2 when upsampling factor is 2, 28.54 dB and 0.776 2 when upsampling factor is 4, and 24.61 dB and 0.643 0 when upsampling factor is 8. The objective indicators of DBSRN are improved to varying degrees compared with Bicubic, SRCNN, and Srresnet. Compared with Srresnet, when the upsampling factor is 2, PSNR and SSIM are increased by 0.10 dB and 0.000 8, respectively, when the upsampling factor is 4, they are increased by 0.18 dB and 0.003 2, and when the upsampling factor is 8, they are increased by 0.23 dB and 0.004 4. DBGAN improves the clarity and fidelity of the reconstruction results further and reconstruct Dongba paintings with more edge and texture details. Conclusion This super-resolution universal network model can improve the resolution and clarity of low-resolution Dongba paintings effectively.
Image Analysis and Recognition
- The cross-view gait recognition analysis based on generative adversarial networks derived of self-attention mechanism Zhang Hongying, Bao Wenjingdoi:10.11834/jig.200482
21-04-2022
373
306
Abstract:Objective Gait is a sort of human behavioral biometric feature, which is clarified as a style of person walks. Compared with other biometric features like human face, fingerprint and iris, the feature of gait is that it can be captured at a long-distance without the cooperation of the subjects. Gait recognition has its potential in surveillance security, criminal investigation and medical diagnosis. However, gait recognition is changed clearly in the context of clothing, carrying status, view variation and other factors, resulting in strong intra gradient changes in the extracted gait features. The relevant view change is a challenging issue as appearance differences are introduced for different views, which leads to the significant decline of cross view recognition performance. The existing generative gait recognition methods focus on transforming gait templates to a specific view, which may decline the recognition rate in a large variation of multi-views. A cross-view gait recognition analysis is demonstrated based on generative adversarial networks (GANs) derived of self-attention mechanism. Method Our network structure analysis is composed of generator G, view discriminator D and identity preserver Φ. Gait energy images (GEI) is used as the input of network to achieve view transformation of gaits across two various views for cross view gait recognition task. The generator is based on the encoder-decoder structure. First, the input GEI image is disentangled from the view information and the identity information derived of the encoder Genc, which is encoded into the identity feature representation f(x) in the latent space. Next, it is concatenated with the view indicator v, which is composed of the one-hot coding with the target view assigned 1. To achieve different views of transformation, the concatenated vector as input is melted into the decoder Gdec to generate the GEI image from the target view. In order to generate a more accurate gait template in the target view for view transformation task, pixel-wise loss is introduced to constrain the generated image at the end of decoder. In the discriminant network, the view discriminator learning distinguishes the true or false of the input images and classifies them to its corresponding view domain. It is composed of four Conv-LeakyReLU blocks and in-situ two convolution layers those are real/fake discrimination and view classification each. For the constraint of the generated images inheriting identity information in the process of gait template view transformation, an identity preserver is introduced to bridge the gap between the target and generated gait templates. The input of identity preserver are three generated images, which are composed of anchor samples, positive samples from other views with the same identity as anchor samples, and negative samples are from the same view in related to different identities. The following Tri-Hard loss is used to enhance the discriminability of the generated image. The GAN-based gait recognition method can achieve the view transformation of gait template but it cannot capture the global, long-range dependency within features in the process of view transformation effectively. The details of the generated image are not clear result in blurred artifacts. The self-attention mechanism can efficiently sort the long-range dependencies out within internal representations of images. We yield the self-attention mechanism into the generator and discriminator network, and the self-attention module is integrated into the up-sampling area of the generator, which can involve the global and local spatial information. The self-attention module derived discriminator can clarify the real image originated from the generated. We update parameters of one module while keeping parameters of the other two modules fixed, and spectral normalization is used to increase the stable training of the network. Result In order to verify the effectiveness of the proposed method for cross-view gait recognition, several groups of comparative experiments are conducted on Chinese Academy of Sciences’ Institute of Automation gait database——dataset B (CASIA-B) as mentioned below: 1) To clarify the influence of self-attention module plus to identify positions of the generator on recognition performance, the demonstrated results show that it is prior to add self-attention module to the feature map following de-convolution in the second layer of decoder; 2) Ablation experiment on self-attention module and identity preserving loss illustrates that the recognition rate is 15% higher than that of GaitGAN method when self-attention module and identity preserving loss are introduced simultaneously; 3) The frame-shifting method is used to enhance the GEI dataset on CASIA-B, and the improved recognition accuracy of the method is significantly harnessed following GEI data enhancement. Our illustration is derived of the OU-MVLP (OU-ISIR gait database-multi-view large population dataset) large-scale cross-view gait database, which has a rank-1 average recognition rate of 65.9%. The demonstrated results based on OU-MVLP are quantitatively analyzed, and the gait templates synthesized at four views (0°, 30°, 60°, and 90°) are visualized in the dataset. The results show that the generated gait images are highly similar to the gait images with real and target views even when the difference of views is large. Conclusion A generative adversarial network framework derived of self-attention mechanism is implemented, which can achieve view transformation of gait templates across two optioned views using one uniform model, and retain the gait feature information in the process of view transformation while improving the quality of generated images. The effective self-attention module demonstrates that the generated gait templates in the target view is incomplete, and improves the matching of the generated images; The identity preserver based on Tri-Hard loss constrains the generated gait templates inheriting identity information via input gaits and the discrimination of the generated images is enhanced. The integration of the self-attention module and the Tri-Hard loss identity preserver improves the effect and quality of transformation of gaits, and the recognition accuracy is improved for qualified cross-view gait recognition. As the GEI input of the model, the quality of pedestrian detection and segmentation will intend to the quality loss of the synthesized GEI images straightforward in real scenarios. The further research will focus on problem solving of cross-view gait recognition in complex scenarios.
- Dual branch network for human pose estimation in dressing scene Lyu Zhongzheng, Liu Li, Fu Xiaodong, Liu Lijun, Huang Qingsongdoi:10.11834/jig.200642
21-04-2022
244
253
Abstract:Objective Human pose estimation aims at the human joints recognition and orientation in a targeted image of different scenes and the joint point positioning accuracy optimization. Current methods of human pose estimation have a good performance in some targeted dressing scenes where the visibility of body joints was constrained by occasional clothes wearing and failed in some complicated dressing scenes like fashion street shot. There are two main difficulties of human pose estimation in the dressing scene which result in the low accuracy of human body joints positioning and human pose estimation. One aspect is that various styles of clothes wearing leads to human body joints partially occluded and various texture and color information caused the failure of human joint point positioning. Another one is that there are various body postures in dressing scene. A method of dual branch network is required for human pose estimation in dressing scene. Method First, human detection is implemented on the input image to obtain the area of dressed human body. The pose representation branch and the dress part segmentation branch are segmented each. Next, to avoid the interference of the joint point feature extraction in the context of the variety of clothing styles and complex background, the multi-scale loss and feature fusion pose representation branch generate the joint point score map based on the stacked hourglass network. To overcome the problem of human pose with different angles of view in the dressing scene, the pose category loss function is harnessed based on pose clustering. Then, the dress part segmentation branch is constructed based on the shallow connection, deep features of the residual network and feature fusion performance based on the targeted label of dressed part to build the dressed part score map. At the end, in order to resolve the clothing occlusion of joints issue, the dress part segmentation result is used to constrain the position of human body joints, and the final human pose estimation is obtained for pose optimization. Result The illustrated method is validated on the constructed image dataset of the dressed people. Our demonstration show that the constructed pose representation branch improves the positioning accuracy of human body joints effectively, especially the introduced pose category loss function improved the robustness of multi-angles human pose estimation. In terms of the optimization integrated with the semantic segmentation of dressed parts, the estimation accuracy of human body pose is improved to 92.5%. Conclusion In order to handle low accuracy of human pose estimation derived from various clothing styles and various human body postures in dressing scene, a dual-branch network for human pose estimation is facilitated in dressing scene. To improve the positioning accuracy of human body joints, we construct pose representation model to fuse global and local features. A pose category loss is melted to improve the robustness of multi-view angles of human pose estimation. We integrate the semantic segmentation of dressed parts to constrain the position of human body joints which improves the accuracy of human body pose estimation in dressing scene effectively. The constructed image dataset of human dresses demonstrates that the proposed method can improve the estimation accuracy of human body pose in dressing scene. The clear estimation ratio of joint points reaches 92.5%. The estimation accuracy of the human pose is still low, especially in the occasion of dresses wear; overcoat and multi-layer clothes cover human body joints seriously. Meanwhile, it is required to improve the algorithm of the positioning accuracy of human body joints when people have bags and other accessories. The accuracy of human pose estimation is improved in multi-oriented dressing scenes further.
- A embedded dual-scale separable CBAM model for masked human face poses classification Chen Senqiu, Liu Wenbo, Zhang Gongdoi:10.11834/jig.200736
21-04-2022
247
260
Abstract:Objective Human face poses classification is one of the key aspects of computer vision and intelligent analysis. It is a potential technology for human behavior analysis, human-computer interaction, motivation detection, fatigue driving monitoring, face recognition and virtual reality. Wearing masks is regarded as an intervened method during the outbreak of the corona virus disease 2019 (COVID-19) pandemic. It is a new challenge to achieve masked face poses classification. The convolutional neural network is widely applied to identify human face information and it is using in the face pose estimation. The convolutional neural network (CNN) research is to achieve face pose estimation based on low resolution, occlusion interference and complicated environment. In terms of the stronger capability of convolutional neural network and its successful application in face pose classification, we implement it to the masked face pose classification. Face pose estimation is one of the mediums of computer vision and intelligent analysis technology, and estimation results are used for subsequent analysis and decision. As an intermediate part of face pose estimation technology, the lightweight and efficient network structure can make the estimation to play a greater role within limited resources. Therefore, our research focuses on an efficient and lightweight convolutional neural network for masked face pose estimation. Method The core of the designed network is an efficient and lightweight dual-scale separable attention convolution (DSAC) unit and we construct the model based on 5 DSAC units stacking. The DSAC unit is constructed via two depthwise separating convolution with 3×3 and 5×5 kernel size in parallel, and we embed convolutional block attention module (CBAM) in these two convolutional ways. But, the embedding method that we proposed is different from the traditional CBAM embedding method. We split CBAM into spatial attention module (SAM) and channel attention module (CAM). We embed SAM following the depthwise (DW) convolution and embed CAM following the pointwise (PW) convolution respectively, and we cascade these two parts at final. In this way, the features of DW convolution and PW convolution can be matched. Meanwhile, we improve the SAM via the result of 1×1 pointwise convolution supplements, which can enhance the utilization of spatial information and organize a more effective attention map, we make use of 5 DSAC units to build the high accuracy network. The convolution channels and the number of units in this network are manipulated in rigor. The full connection layers are discarded because of their redundant number of parameters and the computational complexity, and a integration of point convolutional layer to global average pooling (GAP) layer is used to replace them. Therefore, these operations make the network more lightweight further. In addition, a large scale of human face data collection cannot be achieved temporarily because of the impact of COVID-19. We set the mask images that are properly scaled, rotated, and deformed on the common face pose dataset to construct a semisynthetic dataset. Simultaneously, a small amount of real masked face poses images are collected to construct a real masked face poses dataset. We use the transfer learning method to train the model under the lack of a massive real face poses dataset. Result The embedded demonstration results show that our proposed accuracy of the model is 2.86%,6.41% and 12.16% higher than that of the model which is embedded separable CBAM without improved SAM module, the model embedded standard CBAM and the model without embedded CBAM. The results show that the model which is embedded separable CBAM with improved SAM module has an efficient and lightweight structure. It can effectively improve the performance of the model with little parameters and computational complexity. Depth wise separable convolution makes the model more compact and the CBAM attention module makes the model more efficient when the model is regressing. In addition, the dual scale convolution is used to tailor the features and enhance the feature extraction capability within the limited convolution unit. The method of extracting different scale features to enhance the performance of the model can avoid over-fitting and rapid growth of parameters derived of stacked convolutional layers. Compared with the classical convolutional neural network, such as AlexNet, Visual Geometry Group network(VGGNet), ResNet, GoogLeNet, the parameters and computational complexity of our designated model decrease significantly, and the accuracy is 3.57%30.71% higher than AlexNet, VGG16, ResNet18 and GoogLeNet. Compared with the classical lightweight convolutional neural network like SqueezeNet, MobileNet, ShuffleNet, EfficientNet, the demonstated model has the lowest parameters, computational complexity, and the highest accuracy, which only has 1.02 M parameters and 24.18 M floating-point operations per second (FLOPs), and achieves 98.57% accuracy. Conclusion A lightweight and efficient convolution unit is designed to construct the network, which has low parameters, computational complexity, and high accuracy.
- Joint loss optimization based high similarity identification for milch goats Shang Cheng, Wang Meili, Ning Jifeng, Li Qunhui, Jiang Yu, Wang Xiaolongdoi:10.11834/jig.200619
21-04-2022
411
233
Abstract:Objective It is essential for the quick response in tracking information of animals for intelligent agriculture and animal husbandry nowadays. Individual identification of animals has been one of the challenging issues in real-time monitoring. Different from traditional methods with high harmfulness such as imprinting, our deep learning based method is adopted to implement image recognition for the several of animal and human as well as the unclear multi-features relationship. Method First, our computer vision method is demonstrated for individual recognition of dairy goat based on deep learning. The 26 goats’ pictures-oriented are acquired including the head and other parts. Fancy fancy principal components analysis (PCA) is adopted as the data expansion methods to expand the dataset. A sum of 1 040 goats’ images are randomly selected for training, and 260 images are used as independent test sets; single shot MultiBox detection (SSD) network based dataset preprocessing is initial to be required. Our demonstration uses the siamese network for preliminary learning. The network structure and learning rate optimization algorithm are employed to adjust the parameters but not suitable for individual identity classification. It verifies the goat itself in terms of the highly similar data set. The effect of whole goat image is better than single head image. This obtained result has been greatly improved from the training of original head to the whole body in the context of is the solo Triplet-Loss function. The original image input of the Triplet-Loss function is composed of three pictures. Because the dairy goat is proven to have high similarity of individual based on the siamese network, it is not required to conduct the data sets integration derived of the Triplet-Loss function as well as the set of different goats images is complicated based on manual method. Next, Triplet-Loss function in dataset has its potentials compared with the siamese network method. Our joint loss function and transfer learning model residual neural network(ResNet18) obtain the goat information in terms of deep network structure. Finally, the joint loss function takes Adam as the optimizer algorithm, our demonstration can get qualified recognition as the hard batch (difficult triplets) of Triplet-Loss function are not required in the context of Triplet-Loss function and CrossEntropy-Loss function and related parameters. In addition of goats, we use the Triplet-Loss function and siamese network to option the feature for the goat face region and the whole goat region verifies the features of the goat face region recognition is not good in terms of high accuracy rate. Our illustration is not only uses you only look once (YOLOv3) network and siamese network to identify goats, but also uses transfer learning model to learn. The siamese network verifies that the goat itself based on high similarity data set, and the Triplet-Loss function and CrossEntropy-Loss function are used as final loss function to verify the effectiveness of the method. Result The SSD network was used to preprocess the dataset. The demonstrated results illustrate that the accuracy can be improved from 86% to 93.077% by combining the joint loss function with Adam algorithm. When the joint loss function is used with Adam optimization algorithm as well as the joint loss function accounts for a certain proportion, other correlation can be realized by adjusting the parameters, the result will be obtain better recognition effect. Just compared with 74.615% of Triplet-Loss function and 89.615% of CrossEntropy-Loss function, the highest recognition accuracy is 93.077%. Conclusion Our higher recognition effect of goat analysis is based on the model of deep learning. These goats cannot just get more effective facial features recognition but can obtain higher accuracy derived of the whole goat body. Intelligent research of individual goat should be conducted based on the segmented attribute of each part of the goat body. Furthermore, our research can lower high labor costs issues based on deep learning model in terms of computer vision archives.
- Unsupervised domain adaptation insulator detection based on adversarial consistency constraints Li Meiyu, Li Shilin, Zhao Ming, Fang Zhengyun, Zhang Yafei, Yu Zhengtaodoi:10.11834/jig.200418
21-04-2022
188
234
Abstract:Objective Insulator is widely used in overhead transmission line nowadays. It is a unique insulation device which can withstand voltage and mechanical stress. In order to reduce the potential safety hazards caused by insulator failures, overhead transmission lines need to be inspected regularly. It is necessary to detect insulators from the inspection images quickly and effectively in order to locate and analyze the defects. The electrical grid insulators applications are mainly divided into two categories: glass insulators and composite insulators. The color and shape are quite different for the two types of insulators, which results in a severe domain bias in the feature space. In most cases, we can only obtain the data for a single type of insulator and train the model by them. The detection of other types of insulators will cause the performance of the trained model to drop sharply due to the domain bias between the source data and the target data. Hence, it is required to improve the generalization ability of the model to maintain good detection performance. Unsupervised domain adaptation is a widely used method for cross-domain detection and recognition. This method uses labeled samples in the source domain and unlabeled samples in the target domain in the training process. A domain-invariant (or domain-aligned) feature representation learning method can effectively release the performance degradation caused by domain bias. Our demonstration illustrates an unsupervised domain adaptation insulator detection method to improve the efficiency of transmission line intelligent inspection and maintenance. Method In order to improve the model’s generalization ability for insulators in the target domain in complicated transmission line images without the target domain labels, an unsupervised domain adaptation insulator detection algorithm is harnessed based on adversarial consistency constraint. The proposed algorithm is divided into two stages including pre-training and adversarial learning. In the pre-training stage, the labeled source domain samples and unlabeled target domain samples are fed into the network to extract features. The extracted two sets of features are input into two classification networks. The unique feature representation of two different types of insulators is obtained based on constraining the two classifiers with binary cross-entropy loss. The feature encoder and two classifiers are trained as well. In the process of adversarial consistency learning, an extra classifier is involved to obtain robustness feature representation. The features obtained by the source domain and target domain samples through the network are sent to a new initialized classification network, and the classifier is trained separately through binary cross-entropy to make the backbone unable to correctly classify the two features. The classifier is then fixed to train the backbone network, and the classification results of the two groups of features are limited to the same label. The network can extract the consistent and robust features of different types of insulators. Result This demonstration illustrates that our method significantly improves the cross-domain insulator detection performance, and the mean average precision (mAP) reaches 55.1% and 23.4% on the two tasks of glass→composite and composite → glass, respectively. The analyzed result of our method is qualified on the public dataset common objects in context (COCO). The mAP reaches 61.5%, which verifies our illustrated generality and extensibility. In the ablation study, the proposed mAP achieves 11.5% and 6.4% in benchmark performance improvement on the two tasks of glass →c omposite and composite → glass, respectively. Conclusion This method reduces the discrepancy-derived domain bias amongst various types of insulators. The generalization of the model is improved in cross-domain insulator detection tasks. Our method can improve the efficiency of the insulator detection in the transmission line inspection. The demonstrated results indicate that our method optimized unsupervised domain adaptation object detection methods. Both of the proposed loss functions can significantly improve the performance of the benchmark, which illustrates that the model is capable of learning a robustness feature representation. The COCO dataset is demonstrated for further verification.
- Attention-mechanism-based light single shot multiBox detector modelling improvement for small object detection on the sea surface Jia Kexin, Ma Zhenghua, Zhu Rong, Li Yonggangdoi:10.11834/jig.200517
21-04-2022
409
319
Abstract:Objective Object detection on the sea surface plays a key role in the development and utilization of marine resources. The sea environment is complex and changeable, and there are many kinds of objects. Considering the factors such as safety and obstacle avoidance, the shooting process of sea surface object detection images will target on the amount of small and medium-sized objects in the image majority, which puts forward higher requirements for accurate detection of objects on the sea surface. Although some regular object detection methods with good detection results have been proposed, they still face the problems of low detection accuracy and slow detection speed. With the rapid development of deep learning theory, the feature extraction capability of deep learning model is gradually mature, and it is widely used in object detection technology. Compared with the original object detection methods, deep-learning-based object detection method has its priority in speed and accuracy. Deep-learning-based object detection method focuses on the construction of deeper network to improve the detection accuracy. The network model usually has the difficulties with too large parameters, which leads to the slow detection speed. Most of the good detection network can only run on high-performance graphics processor unit (GPU), which requires higher computing power equipment. It will also interfere the detection accuracy of the network if the model is compressed. In addition, the initial deep-learning-based object detection method is a detection model designed for the general object dataset. For the small object in the image, the detection effect is not very ideal. In terms of the characteristics of the sea object detection image, the general object detection model will miss the detection of small objects, and the detection effect of some small-targeted object detection models for sea objects needs to be verified. Method The original data of this demonstration is based on the marine obstacle detection dataset 2 (MODD 2), which is mainly composed of boats, buoys and other sea objects. Total 5 050 images of them are used in the illustrated data. To construct the sea surface object dataset, the boats and buoys are calibrated by calibration software called LabelImg, and processed in accordance with the format of visual object class 2007 (VOC2007) dataset. First, on the basis of standard single shot MultiBox detector (SSD) object detection model, Visual Geometry Group network-16 (VGG-16) backbone network is substituted via depth wise separable convolution feature extraction network based on Xception network. The detection effect of different network models is compared based on variables application, including VGG-16-based SSD network, Mobilenet-based SSD network and Xception-based SSD network. In the process of training, the size of the input image is scaled to the RGB image of 300×300 pixels. The following input images are normalized. The trained model is based on the Xception pre-trained model on common objects in context(COCO) dataset. Next, the SSD + Xception object detection model is used as lightweight SSD model based on Xception feature extraction network. The lightweight attention mechanism module is evolved into exit flow layer and Conv1 layer in feature extraction network to improve the detection accuracy, and the detection effect is compared with the model of lightweight attention mechanism module in other layers. The model parameters (params), floating-point operations per second (FLOPs) and the quantity of images can be processed via frames per second (FPS). Precision rate and miss rate are used to evaluate the model. The mean average precision (mAP) are used to evaluate the performance of the model. At last, the small object and normal object in the sea object detection dataset are tested via the lightweight SSD object detection model with improved attention mechanism and other models. Result In order to prove the effectiveness of this model, a quantity of comparative experiments are conducted. Firstly, the parameters and floating-point operation of each model are compared, and the reason of network lightweight is analyzed. The demonstrated illustration analyzed that the model can improve the memory reading and writing speed to achieve the network lightweight effect via Mul deduction and operations adding. But, the compression model will lead to the reduction of network feature expression ability, thus affecting the detection accuracy to a certain extent. The SSD object detection model with Mobilenet as feature extraction network is the lightest, but its detection accuracy is interfered at most, and the mAP is reduced by 2.28%. The SSD + Xception object detection model is opted as the lightweight SSD sea object detection model based on Xception feature extraction network. The model transforms a certain amount of Mul and Add operations only, which reduces the parameter amount by 19.01% and the floating-point operation amount by 18.40%. It garantees the feature expression ability of the model, and maintains the amount of images processed per second based on cutting parameters and floating-point operations. The quantity is basically unchanged, and the detection accuracy is reduced less, which achieves the effect of lightening the network under the condition of a certain detection accuracy. In order to improve the detection accuracy of the lightweight SSD sea object detection model, a lightweight attention mechanism module is issued in the lower layer of the model to focus on some significant or interesting information, which facilitates the illustration of the feature semantic information of small objects. In comparison of the standard SSD target detection model, the analyses demonstrate that the average accuracy of the buoy is increased by 1.1%, the miss detection rate is reduced by 3%, and the mAP is increased by 0.51%, the parameters are reduced by 16.26% and the floating-point operation is reduced by 15.65%. Simultaneously, the average detection accuracy of the boat is guaranteed, and the miss detection error is not generated more. The illustrated model also shows qualified detection effect based on small objects in the dataset verification. Conclusion For the small object detection in the sea image, this small object detection model can identify the detection speed of the model and guarantee the detection accuracy of the model, and achieve the effect of network lightweight. Moreover, this model reduces the rate of missing detection of small objects to realize the detection of small sea objects effectively as well.
- The salient object detection based on attention-guided network He Wei, Pan Chendoi:10.11834/jig.200658
21-04-2022
288
273
Abstract:Objective The salient object detection is to detect the targeted part of the image, and to segment the shape of salient objects. The distractibility allows humans to allocate limited resources of brain to the most important information in the visual scene. It achieves the high efficiency and precision of visual system. The salient object detection is used to simulate the attention mechanism of the human brain. This image processing issue is usually applied in image editing, visual tracking and robot navigation. The existing visual feature information method is widespread used to detect salient objects in accordance with, brightness, color, and movement. The lack of high-level semantic information constraints their capability to detect salient objects in complex scenes. The pyramid structure of deep convolutional neural networks (DCNNs) realizes the extraction of low-level information and semantically high-level information through multiple convolution operations and pooling operations. The feature extraction capabilities of convolutional neural networks have applied in the context of computer vision. The full convolutional neural network (FCN) is proposed to harness salient object detection. Multi-level feature fusion strategies are commonly used like addition and cascade. But these adopted strategies often ignore the difference in the contribution of different features to salient objects and lead to sub-optimal solutions. The low-level and fuzzy boundaries at the high-level reduce salient detection accuracy. Hence, we design a new model for salient object detection. Our model yields different weights to attention features and a variety of attention mechanisms are used to guide the fusion of feature information block by block. Method A feature aggregation network based on attention mechanisms is conducted for saliency object detection. Our new network proposed uses a variety of attention mechanisms to melt different weights into the information of different feature maps. It clarifies the effective aggregation of deep features and shallow features. The network is mainly composed of feature extraction module (FEM), channel-spatial attention aggregation module (C-SAAM) and attention residual refinement module (ARRM). Our trained network is minimized the pixel position aware loss (PPA). FEM obtains rich context information based on multi-scale feature extraction. C-SAAM aims to option aggregate edge information of shallow feature and extract semantic high-level features. Unlike addition and concatenation, C-SAAM uses channel attention and spatial attention to aggregate multi-layer features and release redundant information fusing problems. We also design a residual refinement module based on ARRM to further refine the fused output and improve the input function. We use ResNet-50 as the backbone network of our encoder part, and use transfer learning to load the parameters of the trained model on ImageNet to initialize the network. The DUTS-TR dataset is used to train our network as well. In the training stage, the input images and ground truth masks are resized to 288×288 pixels, and NVIDIA GTX 2080Ti GPU device are used for training. Small batch random gradient descent (SGD) is utilized to optimize our network. The learning rate is set to 0.05, the momentum is set to 0.9, the weight decay is set to 5E-4, and the batch size is set to 24. With no validation set, our model was trained 30 epochs, and the whole training process took 3 hours. In the test process, the inference time for 320×320 pixels images reaches 0.02 s (50 frame/s), which achieves the real-time requirements. Result we compared our model with the 13 models on five public datasets. In order to comprehensively evaluate the effectiveness of our proposed model, we used the precision-recall (PR) curve, the F-measure score and curve, the mean absolute error (MAE) and E-measure were adopt to evaluate our model. In terms of complex DUT-OMRON dataset analysis, the F-measure is increased by 1.9% and MAE is reduced by 1.9% compared with the second performance model. In addition, we also design PR curve and F-measure curve of the five datasets in order to evaluate the segmented salient objects. Compared with other methods, the F-measure curve is the core under different thresholds, which proves the effectiveness of the demonstrated model. It is shown in the visualize example that our model can predict qualified saliency map and filter the non-salient areas out. Conclusion Our aggregation network based on channel-spatial attention guidance has its priority to extract high-level and low-level features from the input image effectively.
- Double template fusion based siamese network for robust visual object tracking Chen Zhiliang, Shi Fanhuaidoi:10.11834/jig.200660
21-04-2022
259
250
Abstract:Objective Visual object tracking (VOT) analysis has challenged for computer vision research. Current trackers can be roughly segmented into two categories like correlation filter trackers and Siamese network based trackers. Correlation filter trackers train a circular correlation based regressor analysis in the Fourier domain. Siamese network based trackers have improved the speed and accuracy issue of deep features. A Siamese network consists of two branches which implicitly encodes the original patches to another space and then fuses them with an identified tensor to generate a single output. However, most Siamese network based trackers utilize the single fixed template to resolve occlusion, appearance change and distractors problems. We illustrate an efficient and robust Siamese network based tracker via double template fusion, referred as Siamese tracker with double template fusion (Siam-DTF). The demonstrated Siam-DTF has a double template mechanism in related to qualified robustness. Method Siam-DTF consists of three emerging branches like initial template z, appearance template za and search area x. First, we facilitate the appearance template search module (ATSM) which fully utilizes the information of historical frames to efficiently obtain the appropriate and high-quality appearance template when the initial template is not consistent with the current frame. The appearance template, which is flexible and adaptive to the appearance changes of the object, can represent the object well when facing hard tracking challenges. We choose the frame with the highest confidence in the historical frames to crop the appearance template. To filter out low-quality template, we drop the appearance template if its predicted box has a lower intersection-of-union or its confidence score is lower than that of the initial template. In order to balance the accuracy and speed of our tracker, we use a sparse update strategy on the appearance template. In terms of theoretical analysis and experimental validations, we clarify that the confidence score change of tracker reflects the tracking quality more. When the max confidence of current frame is lower than average confidence of the historical N frames with a certain margin m, we conduct the ATSM to update the appearance template. Next, our fusion module illustration achieves more robust results based on these two templates. The initial template and the appearance template branch are integrated in terms of fusion of score maps and fusion of features. Result The nine tailored trackers model including the correlation filter trackers and Siamese network based trackers demonstrated on three public tracking datasets in the context of object tracking benchmark 2015 (OTB2015), VOT2016 and VOT2018. In OTB2015, the quantitative evaluation metrics contained area under curve (AUC) and precision. The proposed Siam-DTF is capable to rank 1 st both in terms of AUC and precision. Compared with the baseline tracker Siamese region proposal network++ (SiamRPN++), our Siam-DTF improves 0.6% in AUC and 1.3% in precision. Since we unveiling the power of deep feature of historical frames, Siam-DTF precision is prior to correlation filter tracker efficient convolution operators (ECO) by 0.8%. In VOT2016 and VOT2018, the quantitative evaluation metrics contained accuracy (average overlap while tracking successfully) and robustness (failure times). The overall performance is evaluated using expected average overlap (EAO) which takes account of both accuracy and robustness. In VOT2016, Siam-DTF achieves the qualified EAO score 0.477 and the least failure rate 0.172. For EAO score, our method outperforms the baseline tracker SiamRPN++ and the second best tracker SiamMask_E by 1.6% and 1.1%, respectively. Also our method decreases the failure rate from 0.200 to 0.172 compared to SiamRPN++, indicating that our Siam-DTF tracker robustness is well. In VOT2018, Siam-DTF obtains the good result with accuracy of 0.608. Siam-DTF also obtains the second good EAO score 0.403. As for tracking speed, Siam-DTF tracker not only achieves a substantial improvement, but also running efficiently at 47 frame per second (FPS). In summary, it is concluded that all these consistent results show the strong generalization ability of our tracker Siam-DTF. Conclusion We propose an efficient and robust Siamese tracker with double template fusion, referred as Siam-DTF. Siam-DTF fully utilizes the information of historical frames to obtain the appearance template with good adaptability. All 3 benchmarks analysis demonstrate the effectiveness based on and Siam-DTF consistent results.
- Image semantic segmentation based on manifold regularization constraint Xiao Zhenjiu, Zong Jiaxu, Lan Hai, Wei Xian, Tang Xiaoliangdoi:10.11834/jig.200527
21-04-2022
224
294
Abstract:Objective Image semantic segmentation is one of the essential issues in computer vision and image processing. It aims to divide pixels in the image into different categories semantically, and to foresee pixel-level predictions. It has been widely used in various fields, such as scene information understanding, automatic driving and medical assisting diagnosis. Competitive performance has still suffered from challenges such as low contrast, uneven luminance and complicated scenarios currently. The performance of semantic segmentation algorithms have mainly constrained by the spatial context information. Current methods based on deep learning algorithms for image semantic segmentation has focused on harnessing the context information between pixels. For instance, the attention mechanism builds an element-wise weight matrix to capture the similarity between pixels which can be used as coefficient to summate the input. Meanwhile, probabilistic graphical models have been utilized in the spatial context as prior to enhance the classification confidence. However, these methodologies require massive computational resource (e.g. GPU memory). A contextual information capturing method is demonstrated based on manifold regularization. By assuming the data in the input image and the segmentation prediction share the same locally geometric structure in the low-dimensional manifold, this research illustrated possibility to harness the relevancy among pixels in more efficient way. As a result, the novel algorithm based on manifold regularization is issued to exploit the spatial context relation from a geometric perspective, which can be embedded into the deep learning framework to improve the performance with no increasing on both parameter amount and reasoning time. Method The contextual information analysis in the image can be effectively captured by manifold regularization. The DeepLab-v3 architecture is extracted the image features, which uses the residual network(ResNet) as the backbone network. The last two down-sampling layers of the model are pruned, and dilated convolution is employed in the subsequent convolutional layer to control the resolution of the features. For the methodology of regular segmentation, the cross-entropy of single pixel between prediction and ground truth is only involved in the cost function and sum up in total loss without any context information simply. A detailed manifold regularization penalty designation is integrated to single pixel information and the neighborhood context information. This geometric intuition for the initial image data has the same locally geometric shape with those in the segmented result. It indicates that the correspondences between clusters of data points in the input image and output result data points. For instance, when the distance of two input data points in the manifold sub-space is close, the corresponding segmentation result data points are close, and vice versa. Furthermore, the image into sub-image patches to capture the relationship between to customize the constraints between pixels. The hierarchical manifold regularization constraints are achieved via sub-image patch divides into different sizes. When the patch size is minimized, the constraint is between pixels substantially and the approach acts like other pixel-wise context aware algorithms such as fully connected conditional random field (CRF) model. On the contrary, the maximum patch size which equals to the input image size makes the approach become semi-supervised learning algorithm based on interconnected samples. The analyzed model gets improved on segmentation accuracy and achieves state-of-the-art performance. This model is based on two public datasets, Cityscapes and PASCAL VOC 2012 (pattern analysis, statistical modeling and computational learning visual object classes 2012). The performance is measured via mean intersection-over-union (mIoU) averaged across all the classes. The open source toolbox Pytorch is used to build the model. The stochastic gradient descent (SGD) method is adopted as the optimization. In addition, data augmentation is conducted by means of random cropping and inversion in accordance with probability levels. The operating system of the experimental platform is Centos7, with a GPU of model NVIDIA RTX 2080Ti and a CPU of Intel(R) Core(TM) i7-6850. Result The tests are conducted with the effect of manifold regularization. The algorithm achieves a good accuracy of the segmentation model without increasing computational complexity in the process of model implementation. On the benchmark, the ResNet50 backbone model improves the performance by 0.8% with manifold regularization adopted on the PASCAL VOC 2012 dataset, while the ResNet101 backbone models bring 2.1% mIoU gain. These results demonstrated that the manifold regularization get qualified performance with larger network model, and the analyszed results on the Cityscapes dataset also prove this inference, the ResNet50 model increases by 0.3% while the ResNet101 model increases by 0.5%. With the comparison of other context aggregation methods, we achieve mIoU of 78.0% on the Cityscapes dataset and 69.5% on the PASCAL VOC 2012 dataset. Furthermore, visualization of the segmentation results is implemented. The generated segmentation results are more accurate at the edges and have less error rate based on the algorithm with manifold regularization constraints. Conclusion This demonstration illustrates a novel algorithm for the context information image semantic segmentation via the manifold regularization constraints, which can be melted into the deep learning network model to improve the segmentation performance without changing the network structure. The results verify that the illustrated algorithm has good generalization capability in semantic segmentation.
- Real-time semantic segmentation analysis based on cavity separable convolution and attention mechanism Wang Nan, Hou Zhiqiang, Pu Lei, Ma Sugang, Cheng Huanhuandoi:10.11834/jig.200729
21-04-2022
326
346
Abstract:Objective Image semantic segmentation is an essential part in computer vision analysis, which is related to autonomous driving, scenario recognitions, medical image analysis and unmanned aerial vehicle (UAV) application. To improve the global information acquisition efficiency, current semantic segmentation models can summarize the context information of different regions based on pyramid pooling module. Cavity-convolution-based multi-scale features extraction can increase the spatial resolution at different rates without changing the number of parameters. The feature pyramid network can be used to extract features and the multi-scale pyramid structure can be implemented to construct networks. The two methods mentioned above improve the accuracy of semantic segmentation. The practical applications are constrained of the size of the network and the speed of reasoning. Hence, a small capacity, fast and efficient real-time semantic segmentation network is a challenging issue to be designed. To require accuracy and real-time performance of semantic segmentation algorithm, a real-time semantic segmentation method is illustrated based on cavity separable convolution module and attention mechanism. Method First, the depth separable convolution is integrated to the cavity convolution with different rates to design a cavity separable convolution module. Next, the channel attention module and spatial attention module are melted into the performance of ending network-to enhance the representation of the channel information and spatial information of the feature, and integrate with the original features to obtain the final fused features to further improve the of the feature illustration capability. At the end, the fused features are up-sampled to the size of the original image to predict the category and achieve semantic segmentation. The targeted implementation can be segmented into feature extraction stage and feature enhancement stage. In the feature extraction stage, the input image adopts the cavity separable convolution module for intensive feature extraction. The module first uses a channel split operation to split the number of channels in half, splitting them into two branches. The following standard convolution is substituted to extract features more efficiently and shrink the number of model parameters based on deep separable convolution for each branch while. Meanwhile, the cavity convolution with different rates is used in the convolution layer of each branch to expand the receptive field and obtain multi-scale context information effectively. In the feature augmenting stage, the extracted features are re-integrated to enhance the demonstration of feature information. Our demonstration is illustrated as bellows: First, channel attention module and spatial attention module branch are melted into the model to enhance the expression of channel information and spatial information of features. Next, the global average pool branch is integrated to global context information to further improve the semantic segmentation performance. At the end, the branching features are all fused and the up-sampling process is used to match the resolution of the input image. Result Cityscapes dataset and the CamVid dataset are conducted on our method in order to verify the effectiveness of our illustrated method. The segmentation accuracy of Cityscapes dataset and CamVid dataset are 70.4% and 67.8% each. The running speed is 71 frame/s, while the model parameter amount was only 0.66 M. The demonstration illustrated that our method improves the segmentation accuracy to 1.2% and 1.2% each compared with the original method without low speed. Conclusion To customize the requirements of accuracy and real-time performance of semantic segmentation algorithm, a real-time semantic segmentation method is facilitated based on the cavity separable convolution module and the attention mechanism. This redesign depth method the can be combined with an efficient separation of convolution and cavity convolution in the depth of each separable branches with different cavity rate of convolution to obtain a different size of receptive field. The channel attention and spatial attention module are melted. Our method shrinks the number of model parameters and conducts feature information learning. Deeper network model and context aggregation module are conducted to achieve qualified real-time semantic segmentation simultaneously.
- Automatic art analysis based on adaptive multi-task learning Yang Bing, Xiang Xueqin, Kong Wanzeng, Shi Yan, Yao Jinliangdoi:10.11834/jig.200648
21-04-2022
202
281
Abstract:Objective To improve learning efficiency and prediction accuracy, multi-task learning aims to tackle multiple tasks based on the generic features assumption those are prior to task-related features. Multi-task learning technique has been applied in a variety of computer vision applications on the aspects of object detection and tracking, object recognition, human-based identification and human facial attribute classification. The worldwide digitization of artwork has called to art research from the aspect of computer vision and further facilitated cultural heritage preservation. Automatic artwork analysis has been developing the art style, the content of the painting, or the oriented attributes analysis for art research. Our multi-task learning for automatic art analysis application is based on the historical, social and artistic information. The existing multi-task joint learning methods learn multiple tasks based on a labor cost and time consuming weighted sum of losses. Our method illustrates art classification and art retrieval tools for the application of Digital Art Museum, which is convenient for researchers to deeply understand the connotation of art and further harness traditional cultural heritage research. Method A multiple objectives learning method is based on Bayesian theory. In terms of Bayesian analyzed results, we use the correlation between each task and introduce task cluster (clustering) to constrain the model. Then, we formulate a multi-task loss function via maximizing the Gaussian possibility derived of homoscedastic uncertainty via task-dependent uncertainty in Bayesian modeling. Result In order to slice into art classification and art retrieval missions, we identify the SemArt dataset, a recent multi-modal benchmark for understanding the semantic essence of the art, which is designed to retrieve the art paginating cross different modal, and could be readily modified for the classification of art paginating. This dataset contains 21 384 art painting images, which is randomly split into training, validation and test sets based on 19 244, 1 069 and 1 069 samples, respectively. First, we conduct art classification experiments on the SemArt dataset, and then evaluate the performance through classification accuracy, i.e., the proportion of properly predicted paintings to the total amount of paintings in test procedure. The art classification results demonstrate that our model is qualified based on proposed adaptive multi-task learning technique while in the previous multi-task learning model, the weight of each task in fixed. For example, in “Timeframe” classification task, the improvement is about 4.43% with respect to the previous model. In order to calculate the task-specific weighting, the previous model barriers are limited to twice back forward tracing. The art classification results also validate the importance of introducing weighting constraints in our model. Next, we also evaluate our model on cross-modal art retrieval tasks. Experiments are conducted through Text2Art Challenge Evaluation where painting samples are sorted out based on their similarity to an oriented text, and vice versa. The calculated ranking results are evaluated by median rank and recall rate at K, with K being 1, 5 and 10 on the test dataset and performances. Median rank denotes the value separating the higher half of the relevant ranking position amount all samples, whereas recall at rate K represents the rate of samples for which its relevant image is in the top K positions of the ranking. Compared with the most recent knowledge-graph-based model in the context of author attribute, the improvement is about 9.91% in average which is consistent of classification results. Finally, we compare our model with manual evaluators. Following an artistic text, which contains comment, title, author, type, school and time schedule, participants are required to pick the most proper painting image out from a collection of 10 images. There are two distinct levels in this task as mentioned below: the collection of painting images are easy to random selected from the test set, and the difficulty is where the 10 collected images have the identical attribute category (i.e., portraits, landscapes). All participants are required to conduct the task for 100 artistic texts in each level. The performance is reported as the proportion of clear feedbacks over all responses. Our demonstrated results also illustrate that our modeling accuracy is quite closer to human evaluators. Conclusion We harness an adaptive multi-task learning method to weight multiple loss functions based on Bayesian theory for automatic art analysis tasks. Furthermore, we conduct several experiments on the public available art dataset. The synthesized results on this dataset include both art classification and art retrieval challenges.
- An attention mechanism based inter-reflection compensation network for immersive projection system Lei Qinghua, Yang Ting, Cheng Pengdoi:10.11834/jig.200608
21-04-2022
197
247
Abstract:Objective Immersive projection system is focused on for the aspects of virtual reality and augmented reality system nowadays. In the context of immersive projection system, the inner-reflection issue is essential to the projection images quality and the fidelity of reality scenes. Inter-reflection refers to brightness redundancy problems derived of overlapping of projector light and screen reflection light in immersive projection system, which severely affects the imaging quality of the projection system. Meanwhile, it is a challenging issue to eliminate optics-based inner-reflection due to the complexity of light transmission in immersive environment. Method A new and simple image prior like inner-reflection channel (IRC) prior and a new attention guide neural network like Pair-Net generate the high-quality inner-reflection compensated projection image in immersive projection system. The IRC prior is a kind of statistic of projection image in immersive projection system. The scenario of most inner-reflection effected projection images are composed of some high intensity pixels. Those high intensity local patches are affected through inner-reflection, which can be used as an attention map to train our compensation net, IRC prior based Pair-Net, a new compensation network, learns the complex reflection and compensation function of immersive projection environment. Result Our experiment demonstrated the improvement in region of interesting (ROI) analysis indicators and human visual perception compared the four existing methods. Pair-Net is capable to learn the complex inner-reflection information and pay attention to the high inner-reflection region. The result of Pair-Net is qualified to the end-to-end projection compensation methods qualitatively and quantitatively. Conclusion Our method illustrates its qualitative and quantitative effectiveness based on significant margin. Immersive projection system have been widely using in those large-scare virtual-reality scene. But, inner-reflection almost exits in all immersive projection system which can heavily decrease the quality of projection image and the fidelity of reality scenes. These challenges often create bottlenecks for generalization of projector system and block the implementation of virtual reality projects. Inner-reflection compensation aims to compensate the projector input image to enhance the projection images quality and lower the effect of inner-reflection. The typical compensation system consists of in-situ projector-camera (pro-cam) pair and a curved screen. The geometric modeling sorts the light transmission and reflection function out. First, light transmission and reflection function in immersive projection environment need to invert a potential large-scale matrix. Next, it is hard for traditional inner-reflection compensation solution to produce high visually quality result due to the mathematical error are inevitable. Finally, current solutions compensate the whole images more and ignore multi-regions based single image intensity issue. A new convolutional neural network (CNN) is prior to photometric compensation domain based photometric compensation algorithm. We facilitated IRC prior and a Pair-Net for inner-reflection compensation. Pair-Net intends to the different patches of image in multiple light intensity immersive projection scenario. The adopted attention mechanisms for different intensity region compensation and use IRC prior to get the attention map. We design Pair-Net as composed of two sub-nets for paying different attention to the higher intensity and lower intensity region in single image. Two auto-encoder sub-net encourages rich multi-level interaction between the camera captured projection image and the ground truth image, and thus capturing the reflection information of the projection screen. Then, the IRC prior yields two sub-net to pay different attention to variance intensity region in immersive projection scenario summary, we first harness an attention guide inner-reflection compensation Pair-Net model in immersive projection system. In addition, the IRC prior is generated the attention map initially.
Remote Sensing Image Processing
- Spatial divide and conquer based remote sensing image quick matching Wei Chunyang, Qiao Yanyoudoi:10.11834/jig.200768
21-04-2022
229
329
Abstract:Objective Image matching is crucial to remote sensing. A number of feature matching algorithms have been designed to establish correspondences and filter outliers. The current remote sensing image matching methods have derived from region-based and feature based analyses. Feature-based methods illustrate greater robustness and accuracy in processing complex scenarios like brightness changes, homogeneous textures, and geometric distortions. Feature-based methods are implemented in at two stages as follows: First, robust feature points are extracted and nearest-neighbor distance ratios (NNDRs) is facilitated to clarify putative matches. Next, a geometric model or spatial structure is adopted to filter false matches. An initial putative matching step is time-consuming based on quick mismatches elimination and high inlier ratios. This initial matching step has challenged further for matching speeds improvement. Our new quick matching method decreases matching times significantly in terms of high inlier ratios. Method First, scale-invariant feature transform (SIFT) features are from image extraction and these feature points are sorted out in accordance with their scales to establish initial matches based on top 10% feature scales derived of NNDR threshold. Top-scale SIFT features is identified to extract the initial matches due to small quantity and high quality. Qualified matches can be obtained based on these features. Next, it samples regularly spaced feature coordinates in the query image. Initial matches based affine model estimation is conducted to transform the sampled points to the target image. The extraction of initial match accuracy is relatively high in terms of top-scale feature points. Thus, accuracy of the affine model estimated from this match is also high. The virtual center point (VCP) is targeted as center of a neighboring feature point set. The VCP is not represented the extracted feature position well as a search center for the neighboring feature points. A set of pairwise neighbors is obtained in terms of an inner rectangular window based feature points search of these pairwise sample points. Finally, feature point matching is performed independently within each pair of windows. Following the set of VCP neighbors derived of the range tree, correspondence is to be established within pairwise windows. As most features are grouped in small windows, feature correspondence can be established rapidly through traditional brute force (BF) matching. Thus, BF matching is utilized to implement feature matching in pairwise windows instead of a k-dimensional tree (kd-tree). The number of points assumed for each window considerably affects the performance of spatial divide and conquer (SDC) algorithm and it determines the number of VCPs and the size of the windows, consequently regulating the average number of feature points per window. This is significant for BF image matching procedures operating within the windows. To analyze the effect of window size on matching time and inlier ratio, a group of images are optioned with resolutions between 795×530 and 3 976×2 652 pixels. The demonstrations indicate that the window size is inversely proportional to running speed and inlier ratio. Result A multi-sensors based remote sensing data across China is obtained. These analyzed images are mentioned as bellows: 1) the Landsat 8 images of West Sichuan of China; 2) SPOT satellite images of Beijing; 3) GF-3 synthetic aperture radar (SAR) data for the southeast of Wuhan; 4) ZY-3 satellite data for Qingdao area of Shandong province. The images were sized between 816×716 and 2 976×2 273 pixels, and the maximum relative rotation angle was 30°. These images cover several types of geographical environment, including mountains, cities, coastal plains, forests, and farmland. Extensive experiments on various sizes and orientations of remote sensing images demonstrate that our proposed method is highly accurate and reduces the matching time by 1-2 orders of magnitude compared with traditional and the state-of-the-art methods. Conclusion A top-scale SIFT features based quick image matching method is illustrated and analyzed. Parallel computing can further improve the speed of our algorithm due to the independent matching procedures for separate windows.
Chinagraph 2020
- Gate recurrent unit and generative adversarial networks for scene text removal Wang Chaoqun, Quan Weize, Hou Shiyu, Zhang Xiaopeng, Yan Dongmingdoi:10.11834/jig.200764
21-04-2022
179
289
Abstract:Objective The textual information in digital images is ubiquitous in our daily life. However, while it delivers valuable information, it also runs the risk of leaking private information. For example, when taking photos or collecting data, some private information will inevitably appear in the images, such as phone numbers. Image text removal technology can protect privacy by removing sensitive information in the images. At the same time, this technology can also be widely used in image and video editing, text translation, and other related tasks. Tursun et al. added a binary mask as auxiliary information to make the model focus on the text area, which has made obvious improvements compared with the existing scene text removal methods. However, this binary mask is redundant because it covers a large amount of background information between text strokes, which means the removed area (indicating by binary mask) is larger than what needs to be removed (i.e., text strokes), and this limitation can be improved further. Considering the problems of unclean text removal in existing text removal methods and poor visual perception after text removal, we propose a gate recurrent unit (GRU)-based generative adversarial network (GAN) framework to effectively remove the text and obtain high-quality results. Method Our framework is fully “end-to-end”. We first take the image with text as input and the binary mask of the corresponding text area, the stroke-level binary mask of the input image can be accurately obtained through our designed detection module composed of multiple GRUs. Then, the GAN-based text removal module combines input image, text area mask, and stroke-level mask to remove the text in the image. Meanwhile, we propose the brightness loss function to further improve visual quality based on the observation that human eyes are more sensitive to changes in the brightness of the image. Specifically, we transfer the output image from the RGB space to the YCrCb color space and minimize the difference in the brightness channel of the output image and ground truth. The purpose of using the weighted text loss function is to make the model focus more on the text area. Using the weighted text loss function and brightness loss function proposed in this paper can effectively improve the performance of text removal. Our method applies the inverted residual blocks instead of standard convolutions to achieve a high-efficiency text removal model and balance model size and inference performance. The inverted residual structure first uses a point convolution operation with a convolution kernel of 1×1 to expand the dimension of the input feature map, which can prevent too much information from being lost after the activation function because of the low dimension. Then, a depth-wise convolution with the kernel of 3×3 is applied to extract features, and a 1×1 point convolution is used to compress the number of channels of the feature map. Result We conduct extensive experiments and evaluate 1 080 groups of real-world data obtained through manual processing and 1 000 groups of synthetic data synthesized using the SynthText method to validate our proposed method. In this work, we compare our method with several state-of-the-art text removal methods. For the evaluation metrics, we adopt two kinds of evaluation measures to evaluate the results quantitatively. The first type of evaluation indicators is PSNR (peak signal-to-noise ratio) and SSIM (structural similarity index), which are used to measure the difference between the results after removing the text and the corresponding ground truth. The second type of evaluation index is recall, precision, and F-measure, which are applied to measure the model’s ability to remove text. The experimental results show that our method consistently performs better in terms of PSNR and SSIM. In addition, we also compare the results of our proposed method qualitatively with state-of-the-art(SOTA) methods, and our method achieves better visual quality. The inverted residual blocks reduce the floating point of operations (FLOPs) by 72.0% with a slight reduction of the performance. Conclusion We propose a high-quality and efficient text removal method based on gate recurrent unit, which takes the image with text and the binary mask of the text area as inputs and obtains the image result after text removal in an “end-to-end” manner. Compared with the existing methods, our method can not only improve the problem of unclean image text removal effectively and the inconsistency of the text removal area with the background, but also reduce the model parameters and FLOPs effectively.
- Dual auto-encoder network for human skeleton motion data optimization Li Shujie, Zhu Haisheng, Wang Lei, Liu Xiaopingdoi:10.11834/jig.200780
21-04-2022
185
195
Abstract:Objective Human motion data are widely used in virtual reality, human computer interactions, computer games, sports and medical applications. Human motions capture technique aims to obtain highly precise human motion data. Motion capture sensors (MoCap) like Vicon and Xsens can offer high precision motion data costly. These MoCap systems are not fitted to wear for users. Low-cost motion capture technologies have been developed and can serve as alternatives for capturing human motion, including depth sensor-based and camera-based technologies. However, the raw 3D skeleton motion data captured derived from these low-cost sensors are constrained of calibration error, sensor noise, poor sensor resolution, and occlusion due to body parts or clothing. Thus, the raw MoCap data should be optimized, i.e., filling in missing data and de-noising in the pre-stage for users. The accuracy of optimized data for human motion in the context of convolutional auto-encoder (CAE) based multi-noises features and noise amplitudes. Raw MoCap data like the Kinect skeleton motion captured data contain mixed noise different noise types and amplitudes in the capture process due to scenarios changes or self-occlusion. Thus, the bi-directional recurrent auto-encoder (BRA) has been used to raw motion data based on heterogeneous mixed noise. However, the result of BRA has higher position accuracy but CAE is much smoother. Hence, we represent an optimized dual auto-encoder network named BCH, which consists of a BRA and a series of CAE. The BRA is used to make the optimized data on the aspect of the higher position accuracy network and the CAE is used to make the optimized data have better smoothness. Method First, a perceptual auto-encoder is pre-trained using high precision motion capture data. The loss function for the pre-trained perceptual auto-encoder consists of 3 factors including position loss, bone-length loss and smooth loss. Next, we train the dual auto-encoder for optimization via paired “noisy-clean” data. The perceptual autoencoder is composed of the convolution encoder and the convolution decoder. The convolutional encoder in the context of convolutional layer, max pooling layer, activation layer and the convolutional decoder is melted inverse-pooling layer and convolutional layer in. The dual autoencoder consists of BRA and CAE. The convolutional autoencoder network structure is similar to the perceptual autoencoder mentioned above. The BRA has two components, including the bidirectional recurrent encoder and the bidirectional recurrent decoder. The BRA consists of 2 overall interconnected layers followed by 1 bidirectional long short-term memory (LSTM) cell. The structure of the decoder is symmetric with that of the encoder. All those of the encoder and the decoder structure can be used to recover corrupted motion data derived of projection and inverse projection. Hidden-units constraint is imposed for training dual autoencoder, which is defined based on the perceptual autoencoder. Adam stochastic gradient descent is used to minimize the loss function of two networks. The batch size is set to 16 and the learning rate is set to 0.000 01. To avoid overfitting, we use a dropout of 0.2. The perceptual autoencoder is trained by 200 epochs and the dual autoencoder is trained by 300 epochs. Result The demonstrations based on synthetic noise dataset (Carnegie Mellon University (CMU) Graphics Lab Motion Capture Database) and raw motion dataset (the dataset synchronously captured by Kinect and the NOITOM MoCap system) are conducted for verification. Beyond the 3 deep learning methods (CAE, BRA and BRA with perceptual constraint, called BRA-P method), the ablation studies verifies each component of our approach on the above two datasets. The quantitative evaluation metrics contained position loss (mean square error, MSE), bone-length loss and smooth loss, and we illustrated a comparison of motion data optimization that add hidden constraint versus those that did not. The demonstrated results illustrates that our network structure based on synthetic noise dataset and raw motion dataset has its priority of 3 existed deep learning networks in terms of position loss, bone-length loss and smooth loss. The ablation studies on 2 datasets are used to facilitate the refined motion data based on our dual auto-encoder and hidden constraint. In addition, we also analyzed the time performance of the proposed method on raw motion testing dataset. The analyzed results represent that the time-consuming issue of BCH motion data refinement is approximately consistent to the sum of motion data optimization time cost derived of BRA and CAE method, which is close to by BRA method. Conclusion We harness an optimized network with dual autoencoder that contains hidden constraint. The results of synthetic noise data and raw motion data demonstrate that the proposed network and hidden-unit constraint yield the higher position accuracy and better smoothness optimized data and maintain the bone-length consistency of the motion data.
- Action recognition using ensembling of different distillation-trained spatial-temporal graph convolution models Yang Qingshan, Mu Taijiangdoi:10.11834/jig.200791
21-04-2022
208
237
Abstract:Objective Skeleton-based action recognition, which is an intensively studied field, aims to classify human actions represented by a sequence of selected key points of the human body into action categories. Skeleton-based action recognition has a wide variety of applications, including human-computer interaction, elderly monitoring, and video understanding. Recognition accuracy has improved significantly in recent years due to the development of deep learning. However, few studies have focused on the number of parameters and the robustness of the model. In previous skeleton-based action recognition methods, convolution with a big kernel size has been used to extract spatial and temporal features for the broad receptive field, leading to an increase in model parameters and more complicated calculations. Many previous studies have confirmed that graph convolution has better performance in skeletal data. However, graph convolution operators are designed manually, and their versatility and robustness are insufficient. Therefore, we hope to design a lightweight temporal convolution module that preserves the large receptive field for temporal feature learning. In the spatial dimension, we aimed for better robustness of the spatial convolution module constructed using two kinds of graph convolutions. We will improve the performance of the model with the help of data enhancement technology to increase the diversity of input data and improve the generalization ability of the model for different perspectives. To this end, a distillation training method that can improve the accuracy of lightweight models is used for model training and a multi-stream spatiotemporal graph convolutional ensemble model is constructed to improve the current methods and increase the accuracy of the skeleton-based action recognition. Method In this study, we propose a skeleton-based multi-stream ensemble model composed of six sub-networks for action recognition. These sub-networks are divided into two types: directed graph convolutional sub neural network (DGCNNet) and adaptive graph convolutional sub neural network (AGCNNet). Each sub-network is constructed with temporal convolution modules, spatial convolution modules, and attention modules. The temporal convolution module in the sub-network is designed with a 2D depth-wise group convolution layer with a convolution kernel size of 9×1 and a normal convolution layer with a convolution kernel size of 1×1. Two types of graph convolution-directed graph convolution and adaptive graph convolution-are used in the spatial convolution module to extract spatial features and enhance the robustness of the model. Three self-attention modules between the spatial convolution and the temporal convolution modules are applied over the channel dimension, spatial dimension, and temporal dimension of the features to underscore the informative features. We also introduce a cross-modal distillation training method that can be used to train lighter and more accurate student models with trained teacher models and ground truth to train the sub-networks. The distillation training consists of two steps: teacher model training and student model training. The teacher model is trained using training data and its weights are fixed after the training is completed. The student model is trained with the feature vector encoded by the frozen teacher model and the ground-truth labels of training data. Two previous methods, 2s-AGCN(two-stream adaptive graph convolutional networks for skeleton-based action) and directed graph neural networks (DGNN), are used as teacher models to train the DGCNNet sub-network and the AGCNNet sub-network, respectively, according to the type of graph convolution in the spatial convolution module. Cross-entropy loss is used for teacher model training and a combination of mean squared error loss and cross-entropy loss is used as the final loss for student model training. The student model trained using the distillation training method is not only lighter but also more accurate than the corresponding teacher model. In addition to joints, we also take bones and affine-transformation-augmented data as input to train the student model. Finally, a multi-stream spatiotemporal graph convolutional ensemble model is constructed with six lightweight sub-networks, and it has better robustness and higher accuracy. The accuracies of our model for cross-subject benchmark and cross-view benchmark of NTU RGB+D dataset are 90.9% and 96.5%, respectively, higher than many other currently best approaches. Result We compared our model with the other 14 best models thus far on the widely used NTU RGB+D dataset. Our model achieved a 90.9% cross-subject accuracy and a 96.5% cross-view accuracy in terms of the benchmark. A comparison of our model with the teacher model, 2s-AGCN, indicated that the accuracy increased by 2.4% and 1.4%. When compared with another teacher model, DGNN, the accuracy of our model increased by 1.0% and 0.4%; and when compared with the base-line method, spatial temporal graph convolution networks (ST-GCN), the accuracies of our model are 9.4% and 8.2% higher, respectively. In addition, extensive experiments indicated the effectiveness of knowledge distillation on this task and we also explored the effects of the different combinations of input modalities on the final accuracy of the model. Conclusion In this article, we propose a new multi-stream ensemble model that contains six sub-models trained using the distillation training method, and each sub-model is constructed with spatial convolution modules, temporal convolution modules, and attention modules. The results of the experiment indicate that our model outperforms several state-of-the-art skeleton-based action recognition approaches and that the ensemble algorithm can improve the performance.
- A feature visualization method for time-varying volume data Liu Lidoi:10.11834/jig.200781
21-04-2022
214
279
Abstract:Objective Scientific phenomena, such as combustion, ocean currents, and hurricanes are inherently time-varying processes that can be represented as data fields with time variables. Data fields with time variables are often referred to as time-varying volume data. Studying the dynamic aspects of scientific phenomena that change over time is critical to the solution of many scientific problems. With the rapid advancement in computing technologies, time-varying volume data have been created to simulate many physical and chemical processes in their spatial and temporal domains with unprecedented accuracy and complexity. Time-varying volumes usually have large sizes (millions or even billions of voxels), long duration (hundreds or even thousands of timesteps), and contain multiple variables. Presenting time-varying volume data providing a powerful impetus for the research on the visualization of time-varying volume data. It is important to first present the data information efficiently then allow scientists to have direct interaction with the data and glean insights into the simulated scientific phenomena. The ability of scientists to visualize time-varying phenomena is essential to ensuring the correct interpretation and analysis, fostering insights, and communicating those insights to others. Rendering time-varying volume data to achieve interactive visualization has long been of interest to the visualization community. Methods for visualizing time-varying volume data can be classified broadly into two types: time-independent and time-dependent. Time-independent algorithms process each timestep or multiple timesteps of time-varying data independently and display a sequence of timesteps as an animation. Methods generally include encoding data to make it more manageable (e.g., down-sampling in the time domain, data compression, contour extraction), preselecting transfer functions for direct volume rendering, and interactive hardware-accelerated volume rendering. Time-independent algorithms, which do not rely on domain and expert knowledge, have the advantages of easy operation and good flexibility but fail to consider the dynamic and time-varying characteristics of data. Moreover, the methods cannot highlight the information in data important to scientific discovery. Different from time-independent methods, time-dependent methods, which are usually referred to as “feature-based visualization” or “feature visualization”, focus on the features of data and track the variation tendency of data by using the consistency of feature movements and interactions between adjacent timesteps. In this context, a feature can be defined from two aspects: 1) regions of interest that can be extracted from original data, such as shape, structure, variation, and phenomena and 2) some subsets of interest in the original data. Using techniques from image processing and mathematical morphology, feature visualization algorithms extract amorphous regions from scalar or vector fields of data and create correspondences between consecutive timesteps with certain matching criteria. A major advantage of feature visualization over other methods is that it exploits the data coherence between consecutive timesteps and focus on just those regions of interest so that users can ignore redundant, unimportant, or noninterest regions. The resulting significant reduction in the storage requirement of data and rendering cost of visualization tasks makes feature visualization most suitable for investigating a temporal-spatial variation and motion process. Feature visualization of time-varying volume data generally includes four major steps: 1) defining features of data according to domain knowledge or research need; 2) extracting and quantifying features from data; 3) tracking the extracted features step by step; and 4) presenting features by isosurface rendering or direct volume rendering. Method In this paper, a method based on feature visualization is proposed to help scientists explore the characteristics and variations of regions of interest in time-varying volume data. The proposed method includes a feature-based data processing part which combines feature extraction, feature tracking over time, and event query and isolation in one workflow and three interactive visualizations: feature visualization of a data frame, feature visualization of an individual event, and feature visualization of multiple events in th瑥栠敳?灡牴潩灡潬猠散摯?浴敥瑸桴漮搠???扴??漠湦捥污畴獵楲潥渭??扳???敤湡整牡愠汰汲祯??瑳桳敩?灧爠潰灡潲獴攬搠?洠敲瑥桧潩摯?漭晧?瑯桷楩獮?瀠慡灬敧牯?桩慴獨?映楷癩整?洠慡樠潴牨?扥敳湨敯晬楤琠獰?????灥牤漠癢楹搠極湳来?慳?潩湳攠?獰瑰潬灩?獤漠汴畯琠楴潨湥?瑳潣?敬硡灲氠潦物敥?瑤栠敯?猠灴慨瑥椠慰汲??瑡敲浹瀠潶牡慲汩??慬湥搠?灥慴爠慢浹攠瑵敳牥?獳瀠愨捯敲猠?潨晥?瑯楮浬敹?癶慡牲祩楡湢杬?瘩漠汩畮洠敥?摥慲瑹愠?????桦楲条桭汥椮朠桃瑯楮湮来?瑴桥敤?摣祯湭慰浯楮捥?慴獳瀠敯捦琠獰?潩普?瑳椠海敩?癨愠牤祡楴湡朠?癡潬汵略浳攠?摢慯瑶慥??睨桥椠捴桨?桥敳汨灯獬?猠捡楲敥渠瑥楸獴瑲獡?畴湥摤攠牦獲瑯慭渠摥?督桨攠湤?慴湡搠?睲桡敭牥攠?楳渠瑦敥牡整獵瑲楥湳朮?敇癥敯湭瑥獴?潩捣挠異牲?楰湥?慴?摥慳琬愠獳敵瑣?????瑶桯敬?晭敥愬琠畭牡敳?洬攠瑡慮摤愠瑣慥?杴敲湯敩牤愠瑡敲摥?扣祡?瑣桵敬?晴敥慤琠畦牯敲?扥慡獣敨搠?摸慴瑲慡?灴牥潤挠敦獥獡楴湵杲?愠汤杵潲物楮瑧栠浴?敥渠獰畲牯散獥?敳映景楦挠楥數湴瑲?汣潴慩摯楮渮朠?慳湩摮?瀠牴潨捥攠獰獯楩湮杴?慭?汴慡牤条整?愠浧潥畮湥瑲?潴晥?搠慦瑲慯?????瑴桵敲?瘠楥數睴敲牡?摴潩敯獮?渠潴瑨?爠敥煸畴楲牡散?慥?挠汦楥敡湴瑵?獥楳搠敩?椠湥獡瑣慨氠汴慩瑭楥潳湴?慰渠摡?潥映晣敯牲獲?捬敡湴瑥牤愠汯楶穥敲搠?浩慭楥渠瑷敩湴慨渠捡攠??浡慴歵楲湥朠?楲瑡?浫潩牮敧?慡捬捧敯獲獩楴扨汭攠?瑡潳?浤漠牯敮?當獯敬牵獭??慯湶摥????琠桴敯?癤楥獴略慲汭楩穮慥琠楯潣湣?牲敲獥畮汣瑥獳?慯湦搠?瑨桥攠?癡業敥眠敦牥?楴瑵獲敥氠晩?挠慣湯?扳敥?摵楴獩瑶牥椠扴畩瑭敥搠?整慥獰楳氮礠?愠浦潥湡杴?浲略氠瑭楡灹氠敧?甠獴敨牲獯??瑨栠敦物敶扥礠?灴牡潴浥潳琠楩湮朠?捴潳氠汥慶扯潬牵慴瑩楯癮攺?牢敩獲整慨爬挠档?ntinuation, merge, split, and death. With numerous features spanning dozens or even hundreds of timesteps, it is necessary to isolate all occurrences of the same features from the tracking history to help understand the dynamics in the data. In this context, the temporal and spatial evolution of a feature is referred to as an event. A state graph-based event query algorithm is utilized to capture events defined by scientists. After the event query, a list is created to record all isolated events as a sequence of the triplet: timestep, index of the feature in this timestep, and the state in the evolution process. In the visualization part of the method, a web-based viewer is developed to provide a user interface to explore the feature metadata generated from the feature-based data processing program with three interactive visualizations. Result The proposed method is applied to four time-varying volume datasets: turbulent vortex, hurricane Isabel, ocean simulation, and hydrothermal plume. The visualization results demonstrate the events of interest from each dataset and further allow users to explore the data from different perspectives from an instance to the entire dataset, which confirmed the usability and effectiveness of
ChinaVR 2020
- Implicit T-spline curve reconstruction with normal constraint Ren Haojie, Shou Huahao, Mo Jiahui, Zhang Hangdoi:10.11834/jig.200596
21-04-2022
192
222
Abstract:Objective In computer-aided geometric design and computer graphics, fitting point clouds with the smooth curve is a widely studied problem. Measurement data can be taken from real objects using techniques such as laser scanning, structure light source converter, and X-ray tomography. We use these scanned discrete data points for data fitting to perform a general model reconstruction and functional recovery for the original model or product, which is widely used in the field of geometric analysis and image analysis. The data points used in this paper are unstructured scattered data points. Compared with the parametric curves, implicit curves do not need to parameterize scattered data points. Therefore, they are widely studied because of their ability to describe objects with complicated geometry and topology. Because the control points of the conventional implicit B-spline curve need to be arranged regularly in the entire area, a large number of redundant control points are required to satisfy the topological constraints and has some limitations in the local subdivision, which will lead to the phenomenon of control point redundancy. The T-spline effectively solves this problem. Based on the advantages of B-spline curves and surfaces, it admits the structure of T-nodes, and thus, it has many advantages, such as fewer control points and convenient local subdivision. This is the reason why we chose T-spline to perform the implicit curve reconstruction. In some cases, the data points we obtain may not only be scattered coordinate information but also contain some shape constraint conditions, such as the processing of data points with the normal constraints in the field of optical engineering. Therefore, we not only need to constrain the errors of the data points but also have certain requirements for the normal errors. Hence, an implicit T-spline curve reconstruction algorithm with normal constraints is proposed in this paper. Method We first preprocess the data, which adjusts the density of sampling points adaptively by combining with the curvature to remove the redundant data points and add auxiliary points. The step of adding auxiliary points not only avoids the singular solutions but also helps to eliminate the zero level set. Two-dimensional T-meshes are constructed from the scattered point set by using binary tree and subdivision process. Here, we define a maximum number of subdivisions and then count the number of data points in each sub-rectangular block. If the number of data points is greater than the given number of subdivisions, it is subdivided until the number of data points is less than the maximum number of subdivisions and we obtain the initial T-mesh. Then, an effective curve fitting model is proposed based on the implicit T-spline function. Because the number of equations is far more than the number of unknowns, we transform the problem of the implicit T-spline curve reconstruction into a quadratic optimization problem to obtain the objective function. The objective function of our model is divided into three parts: the fitting error term, normal term, and smoothing term. The fitting error term includes the error of data points and auxiliary points. We eliminate the extra zero-level sets by adding offset points and smoothing the term. We also add the normal term to reduce the normal error of the constructed curve. According to the optimization principle, we take the partial derivative of the objective function concerning each of the control coefficients and set it equal to zero. In this case, the original problem is transformed into linear equations. The unknown control coefficients can be obtained by solving the system of linear equations to solve the problem of implicit curve reconstruction. Finally, we insert the control coefficient into the area of the large error to carry out the local subdivision of T-mesh until the precision requirement is reached to improve the accuracy of the reconstruction of the implicit curve. Result The experiment is compared with two existing methods on three datasets, including two concave and convex curves and a complicated hand curve. From the figures in paper, we can see that although the proposed method and the existing method 1 which contructs implicit equations with normal vector constraints, reconstruct the shape of the implicit T-spline curve, some extra zero level sets appear around the curve of method 1, which destroy the quality of the reconstructed curves. The reconstructed results of the proposed method do not have the extra zero level set. The experimental data show that in terms of the error of the data points, the algorithm presented in this paper differs little from the two methods in terms of the average error and the maximum error of data points, which are in the same order of magnitude. However, in the normal error, the proposed algorithm has a significant reduction. In the curves of examples 1 and 2, the proposed algorithm reduces the average error of the normal direction from the order of 10-3 to the order of 10-4 and the maximum error of the normal direction from the order of 10-2 to the order of 10-3. Meanwhile, in the curve of example 3, the proposed algorithm can still significantly reduce the normal error while method 2 which uses least square fitting method by adding auxiliarg points, has the worst normal constraint. In terms of the quality of the reconstructed curve, the extra zero level set is eliminated by the proposed method while the obvious zero level set exists in method 1 and the reconstruction effect is poor. Meanwhile, compared with the implicit B-spline control grid, the number of control points in the T-mesh of the three data sets was only 55.88%, 39.80%, and 47.06% of that in the B-spline grid. Conclusion Experimental results indicate that the proposed algorithm effectively reduces the normal errors under the premise of ensuring the accuracy of data points. The proposed algorithm also successfully eliminates extra zero-level sets and improves the quality of the reconstructed curve. Compared with the implicit B-spline curves, the proposed method reduces the number of control coefficients and improves the operation speed. Hence, the proposed method successfully solves the problem of implicit T-spline curve reconstruction with normal constraints.
- Saliency detection algorithm of panoramic images using joint weighting with observer’s attention longitude Sun Yao, Chen Chunyi, Hu Xiaojuan, Li Ling, Xing Qiweidoi:10.11834/jig.200682
21-04-2022
155
245
Abstract:Objective Considerable development in immersive media technologies has taken place with the aim of providing a complete audiovisual experience to the users, especially sense of being in the visualized scene. It has been used in many fields such as entertainment, tourism, exhibition, etc. The image resolution of virtual reality (VR) panoramic images is much higher than that of traditional images, making the storage and transmission of VR panoramic images very difficult. However, a human’s visual attention mechanism has a selective attention ability, and when faced with a scene, the human can automatically deal with the area of interest, selectively ignoring the area of no interest. In daily tasks, humans face far more information than they can handle, and selective visual attention enables them to process a large amount of information by prioritizing certain aspects of the information while ignoring others. Therefore, it is necessary to detect the saliency of panoramic images to reasonably reduce the redundant information in it. For the saliency detection of panoramic images, current research can be divided into the following directions: 1) improved traditional saliency detection algorithm and 2) panoramic image saliency algorithm by deep learning. The improved traditional saliency detection algorithm involves two aspects: projection conversion and equator bias. According to the characteristics of VR panoramic images with multiple projection modes, the saliency detection of VR panoramic images can be used in different projection domains. Equator bias refers to a phenomenon that the saliency of panoramic images tends to be concentrated near the equator because of human observation habits. The saliency detection algorithm can weigh the saliency according to the latitude position of pixels. The panoramic image saliency algorithm for deep learning uses neural networks to extract image features and detect the image’s saliency. It is also necessary to improve the saliency detection effects when using the neural network algorithm, such as combining equator bias because of insufficient contents in the current panoramic image dataset. Although the existing algorithm optimizes the influence of latitude location attributes by combining equator bias, no research has focused on the influence of longitude location attributes on saliency. Hence, this study proposes a saliency detection algorithm of panoramic images using joint weighting with the observer’s attention longitude. Method First, a spatial saliency prediction network is used to obtain the preliminary saliency images, and then the equator bias is used to increase the accuracy of the saliency detection at different latitudes. The saliency image is weighted by the attention longitude weighting to combine the observer’s behavior with saliency image. This study first adds up the saliency value of each longitude in the reference saliency images in the dataset to obtain the prime attention longitude weight graph. Then, the center of prime attention longitude weight graph is aligned with the prime observation center of the original panorama by translating the prime attention longitude weight graph. The weight of the prime attention longitude weighting is multiplied by the value of the saliency. A strong salient area out of the prime observation viewport is observed, the most salient part of the predicted panorama saliency image is used as the secondary observation center, and the converted attention longitude weighting will work. There are two differences between the prime attention longitude weighting and the converted attention longitude weighting. One is that the datasets they use are different, and we choose images more similar to human viewing habits to get the converted attention longitude weighting. The other one is that their effect is different, and the converted attention longitude weighting’s effect is weaker than the other. The second step is “weighting of different viewports and longitude”. First, the panoramic image will be double-cube projected, and the panoramic image in ERP (equirectangular projection) format was cube projected into six squares. Then, it will be translated for 45 degrees and use cube projection will be used again. Then, the RGB format image is converted into LAB format to extract the brightness feature of the panoramic image and the mrharicot-monodepth2 is used to obtain the depth feature. The different longitude weights of each viewport were calculated based on the difference between the features of each viewport and other viewports, and the longitude weights of each pixel point were calculated based on the difference between the features of each pixel point and other pixels. Combined with the two weights, the different longitude weights of the viewport were obtained and used to weigh the saliency image. Finally, by using the saliency graph with the prime attention longitude weighting and the weighting of different viewports and longitude, we can obtain the final saliency graph. Result This study compared our result with other algorithms’ results on a dataset. The dataset is provided by the International Conference on Multimedia & Expo(ICME) 2017 Salient360! Grand Challenge. Other algorithms include “a saliency prediction model based on sparse representation and the human acuity weighted center-surround differences” (CDSR), “deepauto encoder-based reconstruction network” (AER), and “panoramic-CNN-360-Saliency” (PC3S) algorithm. CSDR is an improved traditional algorithm, and AER and PC3S are deep learning algorithms. For evaluation, the evaluation metrics we use various evaluation metrics for eye fixation prediction, including the saliency of the standardized scan path, correlation coefficient, similarity, and Kullback-Leibler (KL) divergence, and reached 1.979 3, 0.806 2, 0.709 5, and 0.323 9, respectively. The results show that the proposed algorithm is superior to other algorithms in the four evaluation metrics, and the detection results of saliency are better, and the detection accuracy of saliency at different longitude positions is more accurate. Conclusion In this study, we proposed a saliency detection algorithm of panoramic images using joint weighting with observer’s attention longitude. This algorithm improves the effect of saliency detection at different longitude position’s accuracy. Experiments show that our algorithm is superior to the current algorithms, especially the detection accuracy of saliency at different longitude is improved.