
Ranking
- Current Issue
- All Issues
- 1301
- 2253
- 3243
- 4221
- 5
Object detection techniques based on deep learning for aeria...
174 - 6
Lightweight object detection model in remote sensing image b...
171
- 1135833
- 222917
- 321477
- 419656
- 519513
- 617553
About the Journal
Journal of image and Graphics(JIG) is a peer-reviewed monthly periodical, JIG is an open forum and platform which aims to present all key aspects, theoretical and practical, of a broad interest in computer engineering, technology and science in China since 1996. Its main areas include, but are not limited to, state-of-the-art techniques and high-level research in the areas of image analysis and recognition, image interpretation and computer visualization, computer graphics, virtual reality, system simulation, animation, and other hot topics to meet different application requirements in the fields of urban planning, public security, network communication, national defense, aerospace, environmental change, medical diagnostics, remote sensing, surveying and mapping, and others.
- Current Issue
- Online First
Cover&Content
intelligent methods for object detection under complex scene
- Recent advances in drone-view object detection Leng Jiaxu, Mo Mengjingcheng, Zhou Yinghua, Ye Yongming, Gao Chenqiang, Gao Xinbodoi:10.11834/jig.220836
20-09-2023
221
222
Abstract:Given the support of artificial intelligence technology, drones have initially acquired intelligent sensing capabilities and have demonstrated efficient and flexible data collection in practical applications.Drone-view object detection, which aims to locate specific objects in aerial images, plays an irreplaceable role in many fields and has important research significance.For example, drones with highly mobile and flexible deployment have remarkable advantages in accident handling, order management, traffic guidance, and flow detection, making them irreplaceable in traffic monitoring.As for disaster emergency rescue, drones with aerial vision and high mobility can achieve efficient search and safe rescue in large areas, locate people quickly and accurately in distress, and help rescuers control the situation, thereby ensuring the safety of people in distress.This study provides a comprehensive summary of the challenges in object detection based on the unmanned aerial vehicle(UAV)perspective to portray further the development of drone-view object detection.The existing algorithms and related datasets are also introduced.First, this study briefly introduces the concept of object detection in drone view and summarizes the five imbalance challenges in object detection in drone view, such as scale imbalance, spatial imbalance, class imbalance, semantic imbalance, and objective imbalance.This study analyzes and summarizes the challenges of drone-view object detection based on the aforementioned imbalances by using quantitative data analysis and visual qualitative analysis.1)Object scale imbalance is the most focused challenge in current research.It comes from the unique aerial view of drones.The changes in the drone's height and angle bring drastic changes to the object scale in the acquired images.The distance of the lens from the photographed object under the drone view is often far.This scenario results in numerous small objects in the image and makes capturing useful features for object detection difficult for the existing detectors.2)Different regions of drone-view images have great differences, and most objects are concentrated in the minor area of images, i.e., the spatial distribution of objects is enormously uneven.On the one hand, the clustering of dense objects in small areas generates occlusion.The detection model needs to devote considerable attention to this occlusion to distinguish different objects effectively.On the other hand, treating equally different areas wastes many computational resources in vanilla areas, limiting the improvement of object detection performance.3)The problem of class imbalance in the drone view is divided into two categories.One is the positive-negative sample imbalance problem caused by the gap between the front and rear views shared in the image.The other is the imbalanced numbers of different categories caused by the number of samples in the real world.4)The semantic pieces of information defined by different category labels in the drone-view object detection dataset are often similar, resulting in only subtle differences between different categories.However, significantly different representations of objects exist in the same category, which together form the semantic imbalance problem.5)Drone-view object detection often faces the problem of unbalanced optimization targets, i.e., the contradiction between the high computational demand for high-resolution images and the limited computing power of low-power chips is difficult to balance.These unbalanced problems bring enormous challenges to object detection from the UAV viewpoint.However, even the most advanced object detection algorithms currently available can hardly achieve an average accuracy rate of 40% on aerial images, which is far below the performance of general object detection tasks.Therefore, many scholars have conducted many studies.These research methods can be summarized as optimization ideas to solve these imbalance problems.In this study, we collect relevant research works, which are sorted and analyzed according to the countries of authors, institutions, published journals or conferences, years, the category of methods, and the solved problem.The present study presents the challenging problems solved by previous research and the development trends of existing methods.This study also focuses on the methods of improving drone-view object detection performance in terms of data augmentation, multiscale feature fusion, region searching strategies, multitask learning, and lightweight model.The advantages and disadvantages of these methods for different problems are systematically summarized and analyzed.Besides introducing existing methods, the present study compiles and introduces the applications of drone-view object detection in practical scenarios, such as traffic monitoring, power inspection, crop analysis, and disaster rescue.These applications further emphasize the significance of object detection in drone view.Then, this study collects and organizes UAV datasets suitable for object detection tasks.These datasets are present from various perspectives, such as year, published journals or conferences, annotation information, and number of citations.In particular, the present study provides the performance evaluation of the existing algorithms on two commonly used public datasets.The presentation of these performance data is expected to help researchers understand the current state of development of drone-view object detection and promote further development in this field.Finally, this study provides an outlook on the future direction of drone-view object detection by considering the aforementioned imbalance problems.The promising research includes the following:1)data augmentation:providing the network with enough high-quality learning samples by considering the specific characteristics of drone-view images based on the conventional data augmentation strategy is a good idea;2)multiscale representation:how to avoid the interference of background noise in feature fusion and effectively extract information at different scales using an efficient fusion strategy is an urgent problem to be solved;3)visual inference:using information unique to the viewpoint of drones, mining contextual information from images to facilitate image recognition, and using easy-to-detect objects to improve the performance of difficult-to-detect objects are directions worthy of deep consideration.
- Survey of small object detection Pan Xiaoying, Jia Ningxing, Mu Yuanzhen, Gao Xuanrongdoi:10.11834/jig.220455
20-09-2023
243
175
Abstract:In recent years, object detection has attracted increasing attention because of the rapid development of computer vision and artificial intelligence technology.Early traditional object detection methods, such as histogram of oriented gradient(HOG)and deformable parts model(DPM)usually adopt three steps:region selection, manual feature extraction, and classification regression.However, manual feature extraction has great limitations for small object detection.The object detection algorithm based on the convolutional neural network can be divided into two-stage and one-stage detection algorithms.Two-stage detection algorithms, such as faster region with convolutional neural network(Faster RCNN)and cascade region with convolutional neural network(Cascade RCNN), select candidate regions through the region proposal network.Then, they classify and regress these regions to obtain the detection results.However, the problem of low accuracy still exists in small object detection.One-stage detection algorithms, such as single shot MultiBox detector(SSD)and you only look once(YOLO), can directly locate the object and output the category detection information of the object, thereby improving the speed of object detection to a certain extent.However, small object detection has always been a huge challenge in the field of object detection because of the small proportion of small object pixels, little semantic information, and small objects that are easily disturbed by complex scenes.In particular, the challenges in object detection are as follows:First, the characteristics of small objects are few.Given the small scale of small objects and the small coverage area in data images, extracting favorable semantic feature information in network training is difficult.Second, small object detection is susceptible to interference.Most of the small objects have low resolution, blurred images, and little visual information.Thus, they are easily disturbed during difficult feature extraction.Thus, the detection model cannot easily locate and identify small objects accurately.Moreover, many false detections and missed detections exist.Third, a shortage of small object datasets exists.At present, most of the mainstream object datasets, such as PASCAL VOC and MS-COCO, are aimed at normal-scale objects.In particular, the proportion of small-scale objects is insufficient, and the distribution is uneven.However, some datasets mentioned in this study that can be used for small object detection are all aimed at specific scenes or tasks.These datasets include DOTA remote sensing object detection dataset, face detection dataset and benchmark, which are not universal for small object detection.Fourth, small objects are easy to gather and block.A serious occlusion problem occurs when small objects gather.After many downsampling and pooling operations, quite a lot of feature information is lost, resulting in some detection difficulties.At present, visual small object detection is increasingly important in all fields of life.Aiming at the problems in small object detection, this study combs the research status and achievements of small object detection at home and abroad to promote the development of small object detection further, improve the speed and accuracy of small object detection, and optimize its algorithm model.The methods of small object detection are analyzed and summarized from the aspects of data enhancement, super resolution, multiscale feature fusion, contextual semantic information, anchor frame mechanism, attention, and specific detection scenarios.Data enhancement is the method proposed for solving the problems of a few general small object datasets, a small number of small objects in public datasets, and uneven distribution of small objects in images.The earliest data enhancement strategy is to increase the number of object training and improve the performance of object detection by deforming, rotating, scaling, cutting, and translating object instances.Then, other effective data augmentation methods emerged, which included oversampling the images containing small objects in the experiment, scaling and rotating the small objects, and copying the objects to any new position in order to augment the data.Data enhancement helps improve the robustness of a model to a certain extent.Moreover, it solves the problems of unobvious visual features of small objects and less object information.It also achieves good results in the final detection performance.However, the improper design of data enhancement strategy in practical applications may lead to new noise, impairing the performance of feature extraction.This scenario also brings some challenges to the design of the algorithm.The small object detection method based on multiscale fusion needs to make full use of the detailed information in the image because the characteristic information of small-scale objects is little.In the existing convolutional neural network(CNN)model of general object detection, multiscale detection can help the model to obtain accurate positioning information and discriminating feature information by using a low-level feature layer.This scenario is conducive to the detection and recognition of small-scale objects.First, a feature pyramid network(FPN)with strong semantic features at all scales is introduced.Then, an fpn-based path aggregation network(PANet), which not only achieved good results in case segmentation but also improved the detection of small objects.In feature fusion, the residual feature enhancement method extracts the context information with a constant ratio to reduce the information loss of the highest pyramid feature map.At present, many methods are based on multiscale feature fusion, which uses the low-level highresolution and high-level strong feature semantic information of the network to improve the accuracy of small objects.In small object detection, the target's feature expression ability is weak.Thus, the network structure must be deepened to learn considerable feature information.Introducing an attention mechanism can often make the network model pay considerable attention to the channels and areas related to the task.In the object detection network, the shallow feature map lacks the contextual semantic information of small objects.By incorporating attention mechanisms into the SSD model, irrelevant information in feature fusion is suppressed, leading to an improvement in the detection accuracy of small objects.In general, the attention mechanism can reasonably allocate the used resources, quickly find the region of interest, and ignore disturbing information.However, the improper design in use increases the cost of network calculation and affects the extraction of object features by the model.Finally, the future research direction of small object detection is prospected.Visual small object detection is becoming increasingly important in all fields of life, and it will develop in other directions in the future.
- Object detection techniques based on deep learning for aerial remote sensing images:a survey Shi Zhenghao, Wu Chenwei, Li Chengjian, You Zhenzhen, Wang Quan, Ma Chengchengdoi:10.11834/jig.221085
20-09-2023
174
132
Abstract:Given the successful development of aerospace technology, high-resolution remote-sensing images have been used in daily research.The earlier low-resolution images limit researchers'interpretation of image information.In comparison, today's high-resolution remote sensing images contain rich geographic and entity detail features.They are also rich in spatial structure and semantic information.Thus, they can greatly promote the development of research in this field.Aerial remote sensing image object detection aims to provide the category and location of the target of interest in aerial remote sensing images and present evidence for further information interpretation reasoning.This technology is crucial for aerial remote sensing image interpretation and has important applications in intelligence reconnaissance, target surveillance, and disaster rescue.The early remote sensing image object detection task mainly relies on manual interpretation.The interpretation results are greatly affected by subjective factors, such as the experience and energy of the interpreters.Moreover, the timeliness is low.Various remote sensing image object detection methods based on machine learning technology have been proposed with the progress and development of machine learning technology.Traditional machine learning-based object detection techniques generally use manually designed models to extract feature information, such as feature spectrum, gray value, texture, and shape of remote sensing images, after generating sliding windows.Then, they feed the extracted feature information into classifiers, such as support vector machine(SVM)and adaptive boosting(AdaBoost), to achieve object detection in remote sensing images.These methods design the corresponding feature extraction models for specific targets with strong interpretability but weak feature expression capability, poor generalization, time-consuming computation, and low accuracy.These features make meeting the needs of accurate and efficient object detection tasks challenging in complex and variable application scenarios.In recent years, the research on the application of deep learning in remote sensing image processing has received considerable attention and become a hotspot because of the wide application of deep learning techniques, such as deep convolutional neural networks and generative adversarial neural networks, in the fields of natural image object detection, classification, and recognition, and the excellent performance in the task of large-scale natural scene image object detection.Thus, many excellent works have emerged.Object detection in aerial remote sensing images mainly faces challenges, such as large-size and high-resolution images, interference from complex backgrounds, target direction diversity, dense targets, dramatic scale changes, and small targets.At present, these challenges have corresponding model improvement methods.For large-scale natural scene image object detection, high-resolution aerial remote sensing images are used because the target scale in the image is widely distributed.This approach ensures the integrity of small target detail information.Thus, the most commonly used detection and recognition method involves segmenting the image during data preprocessing;that is, the large image is segmented into regular image sizes and sent to the object detection algorithm for detection and recognition in turn.In the subsequent processing, all the detection results are finally stitched together and reset to complete the detection of the whole image.Moreover, the aerial remote sensing image with the ultrahigh resolution has a complex background.The target to be detected is easily interfered with by various similar objects, and the similar targets to be detected present different characteristics.Thus, false detection quickly occurs during detection.Therefore, the usual methods for solving complex background interference can be divided into two types:extracting the contextual information in the image and improving the attention mechanism.The targets to be detected in the images for the complex multidirectional and multitarget situations are multidirectional because the aerial remote sensing images are all top-down images.Moreover, the aspect ratio range of the targets to be detected is more diverse than that of the targets in the natural images.Thus, the interference between the targets is serious, thereby affecting the accuracy of the final target localization and classification.At present, three practical improvement ideas are available for the problems of directional diversity and dense arrangement distribution of targets to be detected:image rotation enhancement, design of rotation invariant module, and design of an accurate position regression method.The designed model needs to have good scale invariance, i.e., the model has high recognition ability even under the drastic changes of multiple scales of multiple targets, to meet the challenge of drastic changes in the target scales in aerial remote sensing images.Thus, the common improvement scheme is the multiscale feature fusion.For the small target detection in aerial remote sensing images, the current algorithms are mainly improved from feature enhancement, multilevel feature map detection, and the design of precise positioning strategies.In summary, the challenges and difficulties of object detection in aerial remote sensing imagery do not exist independently.For example, the large size and high resolution of aerial remote sensing images inevitably lead to a complex background in the images and a sharp increase in the category and number of small targets to be detected.Moreover, most of the small targets are susceptible to strong interference from the complex background.This phenomenon results in localization and classification recognition accuracy.In addition, the improvements for one challenge also apply to other difficulties, e.g., the improvements for multiscale target feature enhancement benefit almost all challenges.Therefore, the problems in the field must be analyzed and improved from a global perspective.Based on the full study of the latest reviews and related research works, this study systematically compares and summarizes deep learning object detection algorithms for aerial remote sensing images, particularly the research methods at home and abroad in the past three years, to provide appropriate object detection research for aerial remote sensing images and help scholars comprehensively understand and grasp the latest progress in aerial remote sensing image object detection research based on deep learning.First, the present study introduces the deep-learning-based image object detection model.Then, it systematically composes the deep-learning-based aerial remote sensing image detection methods, introduces the publicly available datasets for aerial remote sensing image object detection, and compares the performances of typical methods through experiments.Finally, the problems in the current research of aerial remote sensing image object detection are presented, and future research and development trends are prospected.
- Image-level labeled weakly supervised object detection:a survey Chen Zhenyuan, Wang Zhendong, Gong Chendoi:10.11834/jig.220854
20-09-2023
136
129
Abstract:Object detection is a fundamental problem in computer vision and image processing.From the perspective of supervision, it can be divided into fully-supervised, semi-supervised, and weakly-supervised.In recent years, object detection has played an important role in various areas and shown great application value.Precise object detection depends on the accurate region or instance-level image labeling during detector training.However, the complexity of the background and the diversity of objects in real scenes make accurate image labeling extremely time-consuming and laborious.In particular, traditional fully supervised object detection algorithms need to mark the position and category of each object in the image manually with a minimum rectangular box.Thus, the cost of acquiring a training label is increased.By contrast, weakly-supervised object detection(WSOD)algorithms only require the category labels of the whole image for training.Thus, a large number of training samples can be easily obtained by searching the category labels on some image websites.WSOD has received increasing attention and achieved encouraging progress because of its ability to reduce the labor cost of labeling remarkably.Therefore, researchers focus on WSOD algorithms based on image-level coarse labeling.These algorithms slightly depend on supervised information.Compared with other supervised object detection tasks, WSOD aims to localize and classify objects in an image by using only image-level category annotations.The present study starts with the research significance of WSOD.First, the definition, basic framework, and main challenges of WSOD are introduced:1) WSOD is performed in the training and test phases with standard detectors.The whole problem of WSOD can be understood as learning a mapping relationship from several candidate boxes contained in an image to image category markers.2)The problem setup of WSOD is consistent with that of multi-example learning in weakly supervised learning.Thus, WSOD can be treated as a learning problem by taking each candidate box and the image containing all the candidate boxes as an example and a "package" itself, respectively.For each category, if the image contains at least one target object of this category, the image is a positive packet;otherwise, it is a negative packet.Therefore, detector parameters can be learned based on candidate boxes in images.If an image is predicted to be a positive packet of a certain class, then the image contains the target of this class.Thus, the target can be identified using a rectangular candidate box.3)WSOD faces three major problems:local dominance problem, instance ambiguity problem, and conspicuous memory consumption problem.Afterward, advanced WSOD algorithms are classified into three categories according to the network architectures:optimization-candidate-box-generation-based algorithms, segmentation-based algorithms, and self-training-based algorithms.Among them, the core of the optimized-candidate-box-generation-based algorithms is the improved candidate box generator in the basic framework.The core of segmentation-based and self-training-based algorithms is the improved detector in the basic framework.The difference is that the former algorithms aim to add a segmentation branch and guide detection through segmentation, whereas the latter algorithms aim to optimize the detection network.Furthermore, the detection results of various WSOD algorithms are compared under several evaluation metrics through extensive experiments.This study selects and compares the current mainstream WSOD algorithms on PASCAL visual object class 2017(VOC2007)and VOC2012 datasets.All algorithms use the Visual Geometry Group(VGG)network 16 pretrained on the ImageNet LargeScale Visual Recognition Challenge(ILSVRC)dataset as the backbone for feature extraction to ensure the fairness of comparison.Moreover, only the performance of the model itself is evaluated without considering the effect of fully supervised models, such as Fast R-CNN.In the mean average precision (mAP) comparison on the VOC2007 dataset, multiple instance self-training(MIST)is considered the best, with the single model obtaining 54.9% mAP.The mAP of the existing advanced WSOD algorithms is between 50% and 60%.Compared with the mAP of the online instance classifier refinement(OICR)algorithm, which is often used as the baseline method, the mAP of MIST is improved by less than 15%.This finding indicates that this field still has a large room for improvement.The comparison of mAP and correct localization (CorLoc)on the VOC2012 dataset indicates that negative deterministic information weakly supervised object detection (NDI-WSOD)achieves good performance, reaching 53.9%, which is 16% higher than the OICR performance.The best algorithm for the CorLoc is pyramidal multiple instance detection network(P-MIDN), and its performance reaches 73.3%.This value is 11.2% higher than that reached by OICR.In addition, various algorithms are adopted for comparison on Microsoft common objects in context(MS COCO)datasets.The algorithm with the highest ValAP50 is still P-MIDN, which achieves 27.4%.MIST combines optimized pseudo notation generation, regularization technique, and bounding box regression in the self-training process.Thus, it can continue to be superior to its competitors on different datasets.The research of the WSOD algorithm based on image-level labeling has made a great breakthrough because of the vigorous development of deep learning.However, WSOD still faces many challenges, and a certain gap between it and fully supervised object detection exists.Finally, some valuable future research directions in this field are discussed:1)generating a few candidate boxes with high quality, 2)designing a reasonable and efficient cooperative framework for detection and segmentation, 3)designing a reasonable strategy or digging out many improved positive samples through the network itself, and 4) designing lightweight network models that can be applied to mobile terminals.
- Survey on Transformer for image classification Shi Zhenghao, Li Chengjian, Zhou Liang, Zhang Zhijun, Wu Chenwei, You Zhenzhen, Ren Wenqidoi:10.11834/jig.220799
20-09-2023
169
123
Abstract:Image classification is an important research direction in the field of image processing and computer vision.It aims to identify the specific category of the object in the image and has important practical application value.However, the classification effect of the existing methods is always unsatisfactory because of the diversity of the shape and type of image objects and the complexity of the imaging environment.Moreover, the existing problems, such as low classification accuracy and high false positives, seriously affect the application of image classification in the subsequent image and computer vision-related tasks.Therefore, improving image classification accuracy through postprocessing algorithms is highly desirable.Given the wide application of deep learning techniques, such as deep convolutional neural networks and generative adversarial neural networks, in the field of natural image object detection, the research on the application of deep learning techniques in image classification has received great attention and become a research hotspot in the field of image processing and computer vision in recent years.Moreover, many excellent works have been born.As a rising star, visual Transformer(ViT)gains an increasing interest in image processing tasks, particularly because of its strong ability of remote modeling and parallel sequence processing.Several technical review articles on the Transformer have been recently published.Moreover, ViT and its variants have been systematically summarized from different angles, and the application of the Transformer in different visual tasks has been introduced.This scenario provides appropriate help for people studying and tracking the research progress of image classification technology.Compared with traditional convolutional neural network (CNN), ViT achieves global modeling and parallel processing of the image by dividing the input image into patches.Thus, the image classification ability of the model is greatly improved.However, many problems, such as poor scalability, high computational overhead, slow convergence, and attention collapse, still exist because of the complexity of image classification problems and the diversity of the development of ViT technology.These problems can be solved using the ViT variants in image processing tasks.Moreover, the reviews that can help scholars comprehensively understand and grasp the latest progress of ViT for image processing tasks from a global perspective are very few.Therefore, the present study systematically compares and summarizes the ViT algorithms for image classification based on the full study of the latest reviews and related research to help scholars understand and grasp the latest progress of image classification research based on ViT.Unlike the existing review papers, our work is particularly focused on the research methods at home and abroad in the past 2 years(between January 2021 and December 31, 2022).We begin by describing the basic concept, principle, and structure of the traditional Transformer model for easy understanding.First, we introduce the attention mechanism and multihead attention mechanism.Then, the feed-forward neural network and position coding are described.Finally, the model structure of the traditional Transformer is presented.Afterward, the evolution of the Transformer model and its applications in image processing in recent years are figured.Then, the concept, principle, and structure of ViT are briefly introduced.Various vision Transformer models and applications in image classification are described in detail according to the problems faced by ViT.Different solutions, including scalable location coding, low complexity, low computing cost, local and global information fusion, and deep ViT model, are described one by one.Experiments on ImageNet, Canadian Institute for Advanced Research(CIFAR-10), and CIFAR-100 are provided, and many evaluations are presented to demonstrate the classification performance of the ViT and its variants for image classification.Two indicators are adopted, namely, accuracy and parameter quantity, to evaluate experimental results.Floating point operation(FLOPs)per second is also used to analyze the performance of the model comprehensively.Given that the Transformer has also been widely used in remote sensing image classification in recent years, the present study compares and analyzes the remote sensing image classification methods based on the Transformer.The experiments are performed on the hyperspectral image datasets of Indian Pines, Trento, and Salinas to evaluate the Transformer for the remote sensing image classification.Three indicators, namely, overall accuracy(OA), average accuracy(AA), and Kappa coefficient, are employed in this work.Finally, the problems and challenges faced by the current application of ViT in image classification are presented.Future research and development trends are also prospected.
- NSPDet:real-time nearby-aware pedestrian detection algorithm for multi-scene surveillance at night Gong An, Li Zhonghao, Liang Chenhongdoi:10.11834/jig.220834
20-09-2023
136
90
Abstract:Objective Pedestrian detection is a widely concerned topic in computer vision tasks.It is also a basic and critical technology in automatic driving assistance systems, visual surveillance, and behavior recognition.In the traffic environment, pedestrians and cyclists belong to the "vulnerable groups on the road".The World Health Organization(WHO)statistics show that approximately half of all fatalities in road accidents involve pedestrians.Unlike conventional detection objects(e.g., automobiles)with relatively stable structural characteristics, different limb activities of pedestrians exhibit the nonrigid characteristic of structural instability, thereby complicating pedestrian detection.Moreover, the night scene is difficult to navigate.However, insufficient domestic and international research on night pedestrian detection is currently lacking.Given insufficient illumination and local overexposure, pedestrian recognition algorithms are vulnerable to accuracy restrictions, leading to missing and incorrect detections.Therefore, nighttime pedestrian detection technology has important research and social value for ensuring pedestrian safety.Method The monitoring conditions at night are constrained by uneven and insufficient lighting.Thus, the acquired photos have inadequate exposure, which reduces the effectiveness of pedestrian detection.The present study suggests adding a low-light enhancement module(Zero-DCE)to the detector to boost the model's nighttime detection performance and address the issue.We feed the regression loss of the detector and the detection location information to the low-light enhancement module for the joint training of the low-light image enhancement and pedestrian detection tasks to make the low-light image enhancement act as a positive gain for the pedestrian detection task.This approach maintains the regional continuity of pedestrian features in the image and avoids the degradation of detection accuracy caused by the pixel-level low-light enhancement operation that destroys the features in the pedestrian region.Pedestrian detection has a long history.In recent years, pedestrian detection strategies using histograms of oriented gradients(HOG)to model human features with a support vector machine(SVM)as a feature classifier have been widely studied.However, the traditional pedestrian detection methods are based on feature engineering.Moreover, the hand-crafted features have low accuracy and are not generalizable.In recent years, deep learning algorithms have started to be used for pedestrian detection tasks.The convolutional neural network(CNN)can extract high-level features and gradually becomes the mainstream pedestrian detection method.On the basis of whether the detection algorithm is based on region proposal, deep learning-based pedestrian detection algorithms can be broadly divided into two-stage and one-stage methods.Two-stage methods first use sliding windows to find preselected regions in the image.Then, the regions and the representative are classified and regressed.The representative methods are R-CNN and Faster R-CNN.The detection algorithm based on the region proposal can capture rich features.Thus, the detection accuracy is high.However, problems, such as redundancy of preselected regions and slow inference speed, exist.One-stage methods do not base on region proposal.However, they directly regress the target's position in the image, thereby simplifying the detection process and accelerating inference speed.The representative methods are single shot multibox detector(SSD), you only look once v3(YOLOv3), and YOLOX, proposed by MEGVII.In this study, the one-stage method YOLOX is finally selected as the baseline model for the consideration of detection accuracy and inference speed.The targeted optimization is performed for night scenes on the baseline.Additionally, a significant issue with pedestrian detection is the missing and incorrect detection brought on by interclass occlusion and dense crowds.The original non-maximum suppression(NMS)algorithm is susceptible to falsely deleting the detection box when numerous pedestrians are present and their distribution is concentrated.This scenario leads to pedestrian missing detection.Aiming at this problem, the present study reconsiders the NMS strategy in the model reasoning stage and introduces a nonmaximum suppression algorithm(nearby object hallucinatory(NOH))that adds the distribution information of nearby pedestrian targets.We eliminate the dependence of NOH on region proposals, allowing it to be ported to the one-stage target detection algorithm.The bounding box features predicted by YOLOX are pooled into the same feature space.Then, we use a simple full connection module to build the location distribution and density information of nearby pedestrians required by NOH.The improved NOH module is combined with the original YOLOHead as Pedestrian-Head to obtain the final pedestrian detection information.We determine through experiments that adding such a full connection module effectively reduces the missing detection problem caused by occlusion, and the reasoning speed is slightly improved.However, full connection modules inevitably bring redundant parameters to the network.Therefore, this study further investigates the reduction of model volume.Deep separable convolution is also used in the lightweight model to maintain the accuracy of model detection and reduce the computational power required for reasoning.The floating-point computation of the lightweight model is reduced to 22.4 GFLOPs.In theory, our algorithm can meet the needs of real-time reasoning of mobile devices.Result We divided the ablation experiments into three groups for verification on the NightSurveillance dataset.Compared with the baseline model(YOLOX), NSPDet increased the average precision(AP)and the average recall(AR)indices by 10.1 and 7.2, respectively.In addition, the parameters of the lightweight NSPDet model are reduced by 16.4 M.The AP attenuation and AR attenuation are 7.6 and 6.2, respectively.However, the lightweight NSPDet model is still better than the baseline model.The comparison experiments of other methods on Caltech, CityPersons, and NightOwls datasets show that the night pedestrian detection algorithm proposed in this study has a low average false detection rate.Conclusion The NSPDet algorithm proposed in this study improves the accuracy of the baseline model for pedestrian detection at night.The proposed algorithm also has the performance of real-time reasoning.This study optimizes the accuracy of the baseline model for pedestrian detection in various complex nighttime scenes, including low light, strong light interference, image blur, occlusion, and rainy weather.It has an important application value for promoting research in autonomous driving and intelligent transportation.
- Lightweight object detection model in remote sensing image by combining rotation box and attention mechanism Li Zhaohui, An Jintang, Jia Hongyu, Fang Yandoi:10.11834/jig.220839
20-09-2023
171
134
Abstract:Objective Remote sensing image object detection plays an important role in military security, maritime traffic supervision, intelligent monitoring, and other fields.Remote sensing images are different from natural images.Most remote sensing images are taken at altitudes ranging from several kilometers to tens of thousands of meters.Therefore, the scale of target objects in remote sensing images is large.Most of the target objects are small, such as small vehicles.The other target objects are huge, such as ships.The angles of the objects in the remote sensing images are distributed arbitrarily because of the shooting angle.Therefore, this scenario is a huge challenge for the feature extraction network in remote sensing image target detection, particularly in complex backgrounds.Given the continuous improvement in the computing power of hardware devices and the rapid development of deep learning theory, large and ultralarge object detection networks have been continuously proposed in recent years to improve detection accuracy.Although these detection networks have strong representation learning capabilities, they ignore the cost-effectiveness gained from the relationship of detection accuracy with model calculation amount and the number of parameters.Moreover, real-time detection requirements are difficult to achieve, and the number of parameters and amount of calculation are very limited in model deployment.In addition, most of the general target detection models are designed for natural field datasets.The detection effect in remote sensing image target detection is unsatisfactory, particularly for densely arranged objects.The traditional horizontal box object detection cannot achieve precise detection, such as ships in port and cars in parking lots.Aiming at the above problems, a lightweight rotating box remote sensing image object detection model (YOLO-RMV4)is designed.Method In the experiment, the open-source datasets DOTA2.0, FAIR1M, and HRSC2016 are used as the basic datasets.Moreover, four common vehicles, including a ship, a plane, a small vehicle, and a large vehicle, are selected as objects.A aerial images of vehicle ship and plane(AVSP)dataset is prepared after preprocesses, such as filtering, segmentation, conversion, and relabeling, are performed.This dataset contains 19 406 images of 1 024×1 024 and 637 466 object instances.The AVSP data labels are divided into HBB and OBB(HBB is the horizontal box annotation, and OBB is the rotating box annotation), where OBB is represented by eight parameters.YOLO-RMV4 is improved based on the MobileNetv3 network.Adding an efficient channel attention(ECA)mechanism module with excellent performance in the feature extraction network, appropriately expanding the network scale, adding the SPPF module after the feature extraction network, and adding the path aggregation network(PANet)result in multiscale fusion of the extracted features of the backbone network, thereby providing the network with rich and reliable target features.In the network detection head, multiscale detection technology is used to deal with target objects of different sizes.More than half of the objects in the dataset are small targets.Thus, the detection after four times of downsampling is added, resulting in 4, 8, 16, and 32 times of downsampling.Moreover, the small target loss is given a high weight.The smooth circular label is added to the angle prediction in the detection head, which converts the angle regression problem into a classification problem.Thus, the distance between the predicted angle and the real angle can be measured, and the angle periodicity problem is solved.This scenario results in a precise bounding box positioning.Moreover, the anchor size is designed according to the characteristics of the dataset.We use random cropping, flipping, mosaic technique, and other data augmentation approaches in the training.Result In this study, we conduct comparative experiments, and ablation experiments are carried out on the AVSP dataset.We also conduct comparative experiments on seven mainstream lightweight network models to verify the effectiveness of the model.we used average recall(AR), mean average precision(mAP), parameter count, and detection speed(frames per second, FPS)as evaluation metrics.Each model's parameters, such as mAP, AR, and FPS, are also compared.The size of YOLO-RMV4(5.3 M)is only 1/8 of that of RYOLOv5l(45.3 M).Compared with the mAP and AR of RYOLOv5l, those of YOLO-RMV4 are increased by 1.2% and 1.6%, respectively.Moreover, the mAP and AR of YOLO-RMV4 are much higher than those of other lightweight network models(EfficientNet and ShuffleNet).We also compress and prune YOLO-RMV4 to obtain YOLO-RMV4S, whose size is only 4.5 M.YOLO-RMV4S is also better than common lightweight network models in terms of detection precision and recall.Ablation experiments were also conducted on each improved module to verify the improvement degree of model performance by different modules.The mAP increases by 8.4% after the addition of PANet.PANet fuses the features of different layers.This phenomenon largely makes up for the defect of the limited feature extraction capability of the lightweight network.After the rotation detection head is added, the mAP increases by 16.8%, greatly increasing the detection performance of the model.After the ECA module is added, the mAP increases by 1.6%.The ECA module can accurately stimulate the backbone feature extraction network to utilize the limited capacity and the limited amount of parameters and learn the feature information of the target object.After the addition of four times of downsampling, the mAP increases by 3.0%.The addition of four times of downsampling greatly enhances the performance of small target objects.One of the modules is also eliminated based on YOLO-RMV4.The performance degradation degree of the model is compared to reflect the unique role of each module in the model.Finally, the detection accuracy of each category is analyzed.The mAP and AR of the plane are the highest.Those of the ship and the large vehicle are the second, whereas those of the small vehicle are the lowest.Conclusion YOLO-RMV4 is supplemented by multiscale fusion and rotating box detection under the lightweight network structure.Thus, the model can achieve real-time inference and high-precision detection under extremely limited parameters, thereby making it very cost-effective.
- Open-set object detection based on annular prototype space optimization Sun Xuhao, Shen Yang, Wei Xiushen, An Pengdoi:10.11834/jig.220992
20-09-2023
122
102
Abstract:Objective In the close-set setup, object detection identifies objects in a set of images or data in other modalities that belong to the same class in both the training and test phases.Under this setting, modern object detectors have achieved impressive progress.However, the images to be detected in practical tasks usually contain objects of unknown categories.For example, specifying that some fish that meet the size requirements can be caught whereas others that do not meet the requirements are prohibited is common in offshore fishing.Object detectors usually produce two types of errors:The first involves classifying the objects of interest as another object or background, i.e., identifying a known class as a background class or an unknown class.The second occurs when a background sample or an unknown object is mistaken as one of the classes of interest, i.e., identifying a background region or an unknown object region as a known class.Most of the previous detection methods under closed-set conditions can identify unknown and background classes in the open-set setup to some extent after unknown class thresholds are added for screening.However, adjusting these thresholds in real scenarios is challenging for us.Therefore, this study explores the open-set object detection(OSOD)task to improve the robustness of the model in real-world detection tasks.In the open-set environment, the model needs to distinguish not only the known objects contained in the training data but also other objects not contained in the training set.Moreover, the model must delineate the background classes that are neither known nor unknown objects.Method The existing approaches within the OSOD domain typically group background classes and unknown classes into feature sparse classes and classify them as one class.This approach leaves the task of dividing the background class from the unknown class entirely to the final classifier.It is contrary to the original intention of the region proposal networks(RPN)layer to filter the inclusion of object candidate regions.Therefore, we propose a new OSOD framework.On the one hand, we improve the design of the classifier therein through an annular prototype space.Thus, the classifier can focus on identifying known and unknown classes.In particular, the detector can layer known classes, unknown classes, and background classes.Thus, known classes become dense in the high-dimensional space through prototype learning optimization, whereas background classes become sparse in the high-dimensional space.This scenario helps improve the detection performance.On the other hand, we filter out the background classes by randomly masking the existing proposal regions, thereby improving the robustness of the RPN layer while retaining the advantage of proposing object candidates with the RPN layer.Moreover, the need for the additional step of background class sampling is eliminated.In particular, the feature vectors generated for the regions belonging to the unknown category change considerably after a small random mask sampling.However, the feature vectors generated for the regions belonging to the background category do not change considerably after a small random mask sampling.Thus, the module corrects the regions identified as unknown categories.Result The proposed method is experimented with on the OSOD benchmark, which consists of PASCAL Visual Object Classes(PASCAL VOC)and Microsoft common objects in context(MS COCO).The train-val set of VOC is used for close-set training.Moreover, 20 VOC and 60 non-VOC sets in COCO are used to evaluate the proposed method under different open-set conditions.The comparison methods contain Faster-CNN(FR-CNN), placeholders for open-set recognition(PROSER), open world object detector (ORE), dropout sampling(DS), and open-set detector(OpenDet).OpenDet is currently the state-of-the-art method in the field of OSOD.In particular, we adopt two settings to prove the effectiveness of our method.For setting one, we gradually increase the number of unknown classes and build three joint datasets called Visual Object Classes-Common Objects in Context-20 (VOC-COCO-20), Visual Object Classes-Common Objects in Context-40 (VOC-COCO-40), and Visual Object Classes-Common Objects in Context-60(VOC-COCO-60).The proposed method outperforms other methods by a large margin in all targets and achieves new state-of-the-art results in OSOD.For example, our method gains approximately 26%, 32%, and 15.88 on wilderness impact(WI), absolute open-set error(AOSE)and APU, respectively, without compromising the mAPK(58.85% vs.58.45%)on the VOC-COCO-20 dataset.Compared with the state-of-the-art method, our method gains approximately 8%, 5%, and 15% on WI, AOSE and APU, respectively, on average on the three compared datasets.For setting two, we gradually increase the frequency of frames that may have unknowns, named the wilderness ratio, to construct three joint datasets:Visual Object Classes-Common Objects in Context-0.5n(VOC-COCO-0.5n), Visual Object Classes-Common Objects in Context-n(VOC-COCO-n), and Visual Object Classes-Common Objects in Context-4n(VOC-COCO-4n).The proposed method achieves new state-of-the-art results in 10 out of 12 targets from three comparison experiments in open-set object detection.The ablation study also demonstrates the effectiveness of each module in the proposed method.Conclusion In this study, the OSOD framework improved by the annular prototype space is adaptable to the OSOD problem.The comparison of the effects of baseline methods, the current state-of-the-art method, and our proposed method on the OSOD benchmark settings show that the proposed method can accurately detect open-set categories and background categories without changing the performance of the close-set object detection of the vanilla backbone.In future work, we hope to investigate further the correlation between known and unknown class detection performance and extend the categories to be detected to research areas, such as out-of-distribution and fine-grained image analysis.
- Mitosis detection by appearance and motion pattern perception Lin Fanchao, Xie Hongtao, Liu Chuanbin, Zhang Yongdongdoi:10.11834/jig.220901
20-09-2023
121
68
Abstract:Objective In the processes of medical research and diagnosis, such as cancer screening and drug development, mitosis detection under phase-contrast microscopy image provides a very important biological criterion.Manual counting of mitotic cells takes a lot of time and labor.Thus, automatic mitosis detection is more efficient and economic than the manual process.On the one hand, the distributions of mitosis images are significantly different under various culture conditions.Moreover, the increment of cell density makes screening out the cell regions difficult for conventional preprocessing methods.On the other hand, the cells at different stages have similar appearances and blurred motion processes.They also require the model to have a strong ability to discriminate cell types and states.Recent deep-learning-based works use threedimensional convolutions or temporal networks to obtain context information from the sequence images.However, an explicit supervision process for learning cell states is lacking, making effective pattern information of target regions difficult to achieve.As a result, these methods are not fully capable of distinguishing different cells and background areas from feature encoding, and their performance and generalization ability are limited.Therefore, this study explores a detection framework based on cell appearance and motion pattern perception to solve the above problems.An accurate prediction under complex scenes is also achieved through effective preprocessing and discriminative learning of cell patterns.Method The proposed method consists of three stages.The first stage aims to extract regions of interest as candidates.This stage serves as the preprocessing for finding the notable areas and facilitating the later detection.The original electron microscope image is divided into local slices.An instance segmentation network is also trained to segment roughly all the candidate regions that may contain mitosis.Then, a candidate region refinement algorithm is designed based on a concise spatiotemporal hypothesis to refine the candidates and reduce the redundant results.In the second stage, two encoding networks are pretrained to maintain the feature encoding of both appearance and motion information by building proxy learning processes.In particular, an image classification task is conducted for the appearance encoding network training, which learns to predict the cell categories from the spatial context of a single patch.Moreover, an image reconstruction task is conducted for the motion encoding network training, which considers patches from adjacent frames and learns the information of interframe changes by recovering the raw patches.These two processes complement each other to help model the cell states from different aspects.Finally, in the third stage, the whole spatiotemporal model is trained end-to-end by classifying the candidate patch sequences.The spatial modules are initialized with the pretrained parameters of encoding networks in the second stage, thereby allowing them to be aware of the cell patterns at the beginning of the training.Given the appropriate spatial context, the temporal modules are optimized to combine the interframe information and make the final prediction.The overall model provides a confidence score for each patch.The position with the highest score is regarded as a mitosis point.Result We conduct experiments on the public C2C12-16 benchmark.The experimental results demonstrate the superior detection ability of the proposed method.On the C2C12-16 validation set, the mean precision reaches 85.3%, the mean recall reaches 89.3%, and the mean F-score is 87.2%.On the C2C12-16 test set, the mean precision reaches 86.4%, the mean recall reaches 86.1%, and the F-score is 86.2%.The proposed method demonstrates high performance and can generate stable predictions under various conditions.The mean temporal bias of the proposed method in all groups is only 0.221 ±0.536 frames, and the mean spatial bias is 3.321 ±2.461 pixels, both of which are much lower than those obtained by the counterpart method.Conclusion This study explores a new framework to tackle the hard cases under complex scenes in mitosis detection.The preprocessing strategy effectively extracts candidate regions and substantially improves detection efficiency.The pre-training of the feature encoding network based on proxy tasks fully enhances the model's ability to learn the appearance and motion patterns of the candidate regions.With the preprocessing and pretraining designs, our framework can distinguish the discrepancy of visual patterns between mitosis cells, common cells, and background noises, overcome the interference of complex scenes and cell patterns in the microscope image, and achieve both accurate and stable mitosis detection from spatiotemporal dimensions.
Review
- The development,application,and future of LLM similar to ChatGPT Yan Hao, Liu Yuliang, Jin Lianwen, Bai Xiangdoi:10.11834/jig.230536
20-09-2023
148
251
Abstract:Generative artificial intelligence(AI)technology has achieved remarkable breakthroughs and advances in its intelligence level since the release of ChatGPT several months ago, especially in terms of its scope, automation, and intelligence.The rising popularity of generative AI attracts capital inflows and promotes the innovation of various fields.Moreover, governments worldwide pay considerable attention to generative AI and hold different attitudes toward it.The US government maintains a relatively relaxed attitude to stay ahead in the global technological arena, while European countries are conservative and are concerned about data privacy in large language models(LLMs).The Chinese government attaches great importance to AI and LLMs but also emphasizes the regulatory issues.With the growing influence of ChatGPT and its competitors and the rapid development of generative AI technology, conducting a deep analysis of them becomes necessary.This paper first provides an in-depth analysis of the development, application, and prospects of generative AI.Various types of LLMs have emerged as a series of remarkable technological products that have demonstrated versatile capabilities across multiple domains, such as education, medicine, finance, law, programming, and paper writing.These models are usually fine-tuned on the basis of general LLMs, with the aim of endowing the large models with additional domainspecific knowledge and enhanced adaptability to a specific domain.LLMs(e.g., GPT-4)have achieved rapid improvements in the past few months in terms of professional knowledge, reasoning, coding, credibility, security, transferability, and multimodality.Then, the technical contribution of generative AI technology is briefly introduced from four aspects:1) we review the related work on LLMs, such as GPT-4, PaLM2, ERNIE Bot, and their construction pipeline, which involves the training of base and assistant models.The base models store a large amount of linguistic knowledge, while the assistant models acquire stronger comprehension and generation capabilities after a series of fine-tuning.2)We outline a series of public LLMs based on LLaMA, a framework for building lightweight and memory-efficient LLMs, including Alpaca, Vicuna, Koala, and Baize, as well as the key technologies for building LLMs with low memory and computation requirements, consisting of low-rank adaptation, Self-instruct, and automatic prompt engineer.3)We summarize three types of existing mainstream image -text multimodal techniques:training additional adaptation layers to align visual modules and language models, multimodal instruction fine-tuning, and LLM serving as the center of understanding.4)We introduce three types of LLM evaluation benchmarks based on different implementation methods, namely, manual evaluation, automatic evaluation, and LLM evaluation.Parameter optimization and fine-tuning dataset construction are crucial for the popularization and innovation of generative AI products because they can significantly reduce the training cost and computational resource consumption of LLMs while enhancing the diversity and generalization ability of LLMs.Multimodal capability is the future trend of generative AI because multimodal models have the ability to integrate information from multiple perceptual dimensions, which is consistent with human cognition.Evaluation benchmarks are the key methods to compare and constrain the models of generative AI, given that they can efficiently measure and optimize the performance and generalization ability of LLMs and reveal their strengths and limitations.In conclusion, improving parameter optimization, highquality dataset construction, multimodal, and other technologies and establishing a unified, comprehensive, and convenient evaluation benchmark will be the key to achieving further development in generative AI.Furthermore, the current challenges and possible future directions of the related technologies are discussed in this paper.Existing generative AI products have considerable creativity, understanding, and intelligence and have shown broad application prospects in various fields, such as empowering content creation, innovating interactive experience, creating "digital life, " serving as smart home and family assistants, and realizing autonomous driving and intelligent car interaction.However, LLMs still exhibit some limitations, such as lack of high-quality training data, susceptibility to hallucinations, output factual errors, uninterpretability, high training and deployment costs, and security and privacy issues.Therefore, the potential research directions can be divided into three aspects:1)the data aspect focuses on the input and output of LLMs, including the construction of general tuning instruction datasets and domain-specific knowledge datasets.2)The technical aspect improves the internal structure and function of LLMs, including the training, multimodality, principle innovation, and structure pruning of LLMs.3)The application aspect enhances the practical effect and application value of LLMs, including security enhancement, evaluation system development, and LLM application engineering implementation.The advancement of generative AI has provided remarkable benefits for economic development.However, it also entails new opportunities and challenges for various stakeholders, especially the industry and the general public.On the one hand, the industry needs to foster a large pool of researchers who can conduct systematic and cutting-edge research on generative AI technologies, which are constantly improving and innovating.On the other hand, the general public needs to acquire and apply the skills of prompt engineering, which can enable them to utilize existing LLMs effectively and efficiently.
- Overview of the computational intelligence method in 3D point cloud registration Wu Yue, Yuan Yongzhe, Xiang Benhua, Sheng Jinlong, Lei Jiayi, Hu Congying, Gong Maoguo, Ma Wenping, Miao Qiguangdoi:10.11834/jig.220727
20-09-2023
139
136
Abstract:Point cloud data collected by lidar, structured light sensors, and stereo cameras have attracted widespread attention with the maturity and popularization of 3D data acquisition equipment.On this basis, many algorithms of point cloud registration, classification, segmentation, and tracking have been developed.Algorithms promote research progress in the field of point cloud.Point cloud registration is an important research direction in point cloud data processing.It aims to find a rigid transformation motion parameter.Thus, the motion parameter can be aligned with the reference point cloud after acting on the source point cloud.Most of the traditional point cloud registration methods are sensitive to initial poses and outliers.In comparison, computational intelligence methods can effectively solve point cloud registration problems.They can also be applied to handle the partially-overlapping problem.In these cases, computational intelligence methods show strong robustness and generalization.These methods do not depend on the characteristics of the problem itself nor require the establishment of an accurate model.However, they only require an approximate solution to replace the true value solution, thereby greatly simplifying the calculation amount.The applications of computational intelligence methods in point cloud registration have three main categories:deep learning, evolutionary computing, and fuzzy logic.The deep learning methods in point cloud registration can be divided into two types according to whether a corresponding relationship exists:the corresponding point cloud registration method and the noncorresponding point cloud registration method.The research on the former is based on the traditional iterative closest point framework;that is, the network framework is divided into four parts:feature extraction, feature matching, outlier elimination, and motion parameter estimation.In comparison, the noncorresponding point cloud registration is performed by searching for two parts.The difference in the global features of the point cloud estimates the motion parameters.Noncorresponding point cloud registration finds the difference between the global features of the two point clouds.It also solves the motion parameters according to the difference and includes two important steps:1)extracting the global features sensitive to the pose of the point cloud and 2)using the differences in global features to solve for motion parameters.The main point cloud registration is the point cloud registration method based on the correspondence relationship.The global feature descriptor is used by proposing global registration to describe the point cloud registration.The feature descriptors are used to include the neighbors of the feature points.Domain information can effectively overcome the problem that the traditional iterative closest point method is sensitive to the initial pose and easily falls into the local minimum.The correspondence-based point cloud registration method comprises four modules:feature extraction, feature matching, outlier removal, and motion parameter estimation.Feature extraction is the primary task in point cloud registration.The quality of the extracted features directly affects global performance.In the point cloud registration based on correspondence, the global features of all points in the two point cloud sets are first extracted to generate a map.After the transformation matrix is solved, feature extraction mainly includes voxelbased feature extraction and feature extraction based on raw data.Feature matching can find the corresponding points in the overlapping area, thereby evaluating the transformation matrix.Compared with the traditional point cloud, deep learningbased point cloud registration uses the network to generate the corresponding points.Outliers greatly impact the point cloud registration performance.The weights of the points can be solved through the neural network, and the corresponding points with large weights can be selected through the maximum pool for registration.The outliers can also be removed.Motion parameter estimation is the last task in point cloud registration.It solves the rotation matrix and translation vector by mainly using regression and singular value decomposition.Evolutionary computing methods in point cloud registration mainly include two categories:evolutionary algorithms and swarm intelligence.The genetic algorithm and differential evolution algorithm are mainly used in the evolutionary algorithm and point cloud registration.The genetic algorithm constructs the population, evaluates the individual, and performs crossover mutation according to the fitness to evolve the population until the population meets the termination condition and obtains the optimal solution.The differential evolution algorithm is a heuristic global search algorithm.It encodes the parameters in the point cloud registration.Then, it initializes the population, performs mutation crossover and selection according to different strategies, and finally finds the optimal transformation parameters according to the iterative search.For point cloud registration, the point cloud registration method based on swarm intelligence is also mainly divided into two types:particle swarm optimization algorithm and ant colony optimization algorithm.The particle swarm algorithm in point cloud registration first designs an appropriate objective function as a fitness function.Then, it encodes the parameters to generate an initialization particle swarm and updates the individual best position and the global best position according to fitness.It iterates until the termination condition is met.The evolutionary algorithm has robustness, parallelism, and self-adaptation.These features are in good agreement with the characteristics of point cloud registration.The fuzzy logic method in point cloud registration is mainly used in two ways:1)reducing the number of point clouds and 2)point cloud registration based on fuzzy clustering.When the point cloud is reduced, the quality of the point cloud can be improved by dividing the point cloud input space into several fuzzy sets and defining fuzzy rules and membership functions of fuzzy variables.The fuzzy clustering method has three main steps:converting the point cloud input into a fuzzy matrix, establishing a fuzzy similarity matrix, and relying on the fuzzy matrix for classification.This method can evaluate point cloud registration quality without ground truth.The present article discusses in detail the above point cloud registration methods and the advantages and disadvantages of each method to summarize the related research on point cloud registration comprehensively and clearly.
- Deep-learning-based image captioning:analysis and prospects Zhao Yongqiang, Jin Zhi, Zhang Feng, Zhao Haiyan, Tao Zhengwei, Dou Chengfeng, Xu Xinhai, Liu Donghongdoi:10.11834/jig.220660
20-09-2023
146
118
Abstract:The task of image captioning is to use a computer in automatically generating a complete, smooth, and suitable corresponding scene's caption for a known image and realizing the multimodal conversion from image to text.Describing the visual content of an image accurately and quickly is a fundamental goal for the area of artificial intelligence, which has a wide range of applications in research and production.Image captioning can be applied to many aspects of social development, such as text captions of images and videos, visual question answering, storytelling by looking at the image, network image analysis, and keyword search of an image.Image captions can also assist individuals born with visual impairments, making the computer another pair of eyes for them.The accuracy and inference speed of image captioning algorithms have been greatly improved with the wide application of deep learning technology.On the basis of extensive literature research we find that image captioning algorithms based on deep learning still have key technical challenges, i.e., delivering rich feature information, solving the problem of exposure bias, generating the diversity of image captions, realizing the controllability of image captions, and improving the inference speed of image captions.The main framework of the image captioning model is the encoder-decoder architecture.First, the encoder-decoder architecture uses an encoder to convert an input image into a fixed-length feature vector.Then, a decoder converts the fixed-length feature vector into an image caption.Therefore, the richer the feature information contained in the model is, the higher the accuracy of the model is, and the better the generation effect of the image caption is.According to the different research ideas of the existing algorithms, the present study reviews image captioning algorithms that deliver rich feature information from three aspects:attention mechanism, pretraining model, and multimodal model.Many image captioning algorithms cannot synchronize the training and prediction processes of a model.Thus, the model obtains exposure bias.When the model has an exposure bias, errors accumulate during word generation.Thus, the following words become biased, seriously affecting the accuracy of the image captioning model.According to different problem-solving methods, the present study reviews the related research on solving the exposure bias problem in the field of image captioning from three perspectives:reinforcement learning, nonautoregressive model, and curriculum learning and scheduled sampling.Image captioning is an ambiguity problem because it may generate multiple suitable captions for an image.The existing image captioning methods use common high-frequency expressions to generate relatively "safety" sentences.The caption results are relatively simple, empty, and lack critical detailed information, easily causing a lack of diversity in image captions.According to different research ideas, the present study reviews the existing image captioning methods of generative diversity from three aspects:graph convolutional neural network, generative adversarial network, and data augmentation.The majority of current image captioning models lack controllability, differentiating them from human intelligence.Researchers have proposed an algorithm to solve the problem by actively controlling image caption generation, which is mainly divided into two categories:content-controlled image captions and style-controlled image captions.Content-controlled image captions aim to control the described image content, such as different areas or objects of the image.Thus, the model can describe the image content in which the users are interested.Style-controlled image captions aim to generate captions of different styles, such as humorous, romantic, and antique.In this study, the related algorithms of content-controlled and style-controlled image captions are reviewed.The existing image captioning models are mostly encoder-decoder architectures.The encoder stage uses a convolutional neural network-based visual feature extraction method, whereas the decoder stage uses a recurrent neural network-based method.According to the different existing research ideas, the methods for improving the inference speed of image captioning models are divided into three categories.The first category uses nonautoregressive models to improve the inference speed.The second category uses the grid-based visual feature method to improve the inference speed.The third category uses a convolutional-neural-network-based decoder to improve inference speed.In addition, this study provides a detailed introduction to general datasets and evaluation metrics in image captioning.General datasets mainly include the following:bilingual evaluation understudy(BLEU);recall-oriented understanding for gisting evaluation(ROUGE);metric for evaluation of translation with explicit ordering(METEOR);consensus-based image description evaluation(CIDEr);semantic propositional image caption evaluation(SPICE);Compact bilinear pooling;Text-to-image grounding for image caption evaluation;Relevance, extraness, omission;Fidelity and adequacy ensured.The evaluation metrics mainly include Flickr8K, Flickr30K, MS COCO(Microsoft common objects in context), TextCaps, Localized Narratives, and Nocaps.Finally, this study deeply discusses the problems to be solved and the future research direction in the field of image captioning, i.e., how to improve the performance of visual feature extraction in image captions, how to improve the diversity of image captions, how to improve the interpretability of deep learning models, how to realize the transfer between multiple languages in image captions, how to automatically generate or design the optimal network architecture, and how to study the datasets and evaluation metrics that are suitable for image captions.Image captioning research is a popular hot spot in computer vision and natural language processing.At present, many algorithms for solving different problems are proposed annually.Other research directions will be developed in the future.
- Survey on knowledge distillation and its application Si Zhaofeng, Qi Honggangdoi:10.11834/jig.220273
20-09-2023
117
101
Abstract:Deep learning is an effective method in various tasks, including image classification, object detection, and semantic segmentation.Various architectures of deep neural networks(DNNs), such as Visual Geometry Group network (VGGNet), residual network(ResNet), and GoogLeNet, have been proposed recently.All of which have high computational costs and storage costs.The effectiveness of DNNs mainly comes from their high capacity and architectural complexity, which allow them to learn sufficient knowledge from datasets and generalize well to real-world scenes.However, high capacity and architectural complexity can also result in a drastic increase in storage and computational costs, thereby complicating the implementation of deep learning methods on devices with limited resources.Given the increasing demand for deep learning methods on portable devices, such as mobile phones, the cost of DNNs must be urgently reduced.Researchers have developed a series of methods called model compression to solve the aforementioned problem.These methods can be divided into four main categories:network pruning, weight quantization, weight decomposition, and knowledge distillation.Knowledge distillation is a comparably new method first introduced in 2014.It attempts to transfer the knowledge learned by a cumbersome network(teacher network)to a lightweight network(student network), thereby allowing the student network to perform similarly to the teacher network.Thus, compression can be achieved by using the student network for inference.Traditional knowledge distillation works by providing softened labels to the student network as the training target instead of allowing the student network to learn ground truth directly.The student network can learn about the correlation among classes in the classification problem by learning from softened labels.This approach can be taken as extra supervision while training.The student network trained by knowledge distillation should ideally approximate the performance of the teacher network.In this way, the computational and storage costs in the compressed network are reduced with minor degradation compared with those in the uncompressed network.However, this situation is almost unreachable when the compression rate is large enough to be comparable with the compression rates of other model compression methods.On the contrary, knowledge distillation can be taken as a measure of enhancing the performance of a deep learning model.Thus, this model can perform better than other models of similar size.Moreover, knowledge distillation is a method of model compression.In this study, we aim to review the knowledge distillation methods developed in recent years from a new perspective.We sort the existing methods according to their target by dividing them into performance-oriented methods and compression-oriented methods.Performance-oriented methods emphasize the improvement of the performance of the student network, whereas compression-oriented methods focus on the relationship between the size of the student network and its performance.We further divide these two categories into specific ideas.In performance-oriented methods, we describe state-of-the-art methods in two aspects:the representation of knowledge and ways of learning knowledge.The representation of knowledge has been widely studied in recent years.The researchers attempt to derive knowledge from the teacher network instead of outputting vectors to enrich the knowledge while training.Other forms of knowledge include a middle-layer feature map, representation extracted from the middle layer, and structural knowledge.The student network can learn about the teacher network's behavior while forward propagating by combining this extra knowledge with the soft target in traditional knowledge distillation.Thus, the student network acts similarly to the teacher network.Studies on the way of learning knowledge attempt to explore distillation architectures on the basis of the teacher-student architecture.Moreover, architectures includingonline distillation, self-distillation, multiteacher distillation, progressive knowledge distillation, and generative adversarial network(GAN)-based knowledge distillation are proposed.These architectures focus on the effectiveness of distillation and different use cases.For example, online distillation and self-distillation can be applied when the teacher network with high capacity is unavailable.In compression-oriented knowledge distillation, researchers try to combine neural architecture search(NAS)methods with knowledge distillation to balance the relationship between the performance and the size of the student network.Many studies on the impact of the size difference between the teacher network and the student network on distillation performance are also available.They concluded that a wide gap between the teacher and the student can cause performance degradation.Then, bridging the gap between the teacher and the student with several middle-sized networks was proposed in these studies.We also formalize different kinds of knowledge distillation methods.The corresponding figures are shown uniformly to help researchers understand the basic ideas comprehensively and learn about recent works on knowledge distillation.One of the most notable characteristics of knowledge distillation is that the architectures of the teacher network and the student network stay intact during training.Thus, other methods for different tasks can be incorporated easily.In this study, we introduce recent works on different knowledge distillation tasks, including object detection, face recognition, and natural language processing.Finally, we summarize the knowledge distillation methods mentioned before and propose several possible ideas.Recent research on knowledge distillation has mainly focused on enhancing the performance of the student network.The major problem of the student network lies in finding a feasible source of knowledge from the teacher network.Moreover, compression-based knowledge distillation suffers from the problem of searching space when NAS adjusts network architecture.On the basis of the analysis above, we propose three possible ideas for researchers to study:1)obtaining knowledge from various tasks and architectures in the form of knowledge distillation, 2)developing a searching space for NAS when combined with knowledge distillation and adjusting the teacher network while searching for the student network, and 3)developing a metric for knowledge distillation and other model compression methods to evaluate both task performance and compression performance.
Image Processing and Coding
- Efficient tone mapping via macro and micro information enhancement and color correction Zhu Zhongjie, Cui Weifeng, Bai Yongqiang, Jing Weiyi, Jin Minhongdoi:10.11834/jig.220460
20-09-2023
146
189
Abstract:Objective The traditional 8-bit images cannot accurately store and represent the real natural scene because the brightness variations in reality are very wide, ranging from faint starlight to direct sunlight with more than nine orders of magnitude.High dynamic range(HDR)imaging technology adopts floating-point numbers to address this deficiency.This technology can accurately represent the fidelity of a real scene with abundant brightness and chroma information.However, HDR images cannot be rendered directly on conventional display devices.Tone mapping(TM)technology aims to convert HDR images into traditional images while preserving the natural scene without losing information.Many excellent TM operators have emerged and have been widely used in business.However, the scene information is inevitably lost in different degrees because of the large-scale transformation and compression of the brightness range.In particular, even the state-ofart TM operators for complex scenes still have some problems, such as blurred details, edge halation, brightness imbalance, and color distortion, which seriously affect the subjective feeling of human eyes.Hence, a novel TM algorithm is proposed in this study via macro and micro information enhancement and color correction.Method Targeted algorithm structures with different strategies for the brightness and chroma domains are constructed in this study based on the human visual perception mechanism.First, an HDR image is converted from RGB color space to HSV color space, and the independent luminance information and chrominance information can be separated effectively.Thus, the subsequent processing can be performed smoothly without mutual interference.Second, different processing and optimization strategies are adopted for the brightness and chroma channels, respectively.For the former, the brightness range is greatly compressed to meet the demand of low dynamic range images while enhancing the detailed information perceived by human eyes from the macro and micro points of view.In particular, the brightness channel is divided into the basic and detail layers through the weighted guidance filter.The basic layer is compressed and combined with the macro statistical information to reduce the brightness contrast of the image and ensure the authenticity and integrity of the image background information and the overall structure.Subsequently, the salient region of the real scene is extracted by the gray-level co-occurrence matrix based on the human eye attention mechanism.According to the saliency information distribution, the texture information of the detail layer is enhanced.The edge halation is further eliminated by adjusting the scaling factor.Finally, the compressed base layer and enhanced detail layer are linearly fused to the targeted brightness channel while ensuring macro consistency and micro significance with the HDR image.For the chroma channel, a saturation migration model is designed with integrating brightness compression.This model can adaptively adjust the information saturation with brightness variety while keeping the hue information unchanged.According to the principle of color constancy, people's perception of the color of the object's surface remains unchanged when the color light irradiating the object's surface changes.Moreover, experience shows that different saturation levels directly affect people's subjective perception of color, even if the hue information remains unchanged.Therefore, a median shift model is constructed to adjust the image chromaticity saturation adaptively by combining the changes in the statistical information of brightness compression.Thus, the constancy of object surface color can be ensured, and the subjective color distortion caused by information compression of the luminance channel can be effectively avoided.The main experiments include the establishment of a database containing nearly 200 HDR images with different brightness dynamic ranges, light and dark area distribution, and detail richness.This database is used to verify the feasibility and generalization of the proposed algorithm.Result Experimental results show that the proposed algorithm is superior to the existing TM algorithms in subjective and objective evaluations.In terms of objective evaluation, the images are scored using the TM quality index(TMQI).The comprehensive evaluation score is obtained by calculating the naturalness and structural fidelity.Compared with the algorithms in the reviewed studies, the proposed algorithm exhibits a comprehensive TMQI score reaching the highest score of 0.862 9.The proposed algorithm is also superior to most of the existing methods in terms of naturalness and structural fidelity.For the subjective evaluation, we refer to the international mean opinion score standard, with scores ranging from 1 to 5, indicating the worst to the best.The subjects score the test images according to their personal preferences, which are combined with the images'texture details, edge halation, brightness imbalance, and color distortion.The scores of 20 subjects, including 10 men and 10 women, are counted.Results show that the subjects give four points to most images mapped by the proposed algorithm, and a few images achieve the best five points.In particular, the average score of the proposed algorithm reaches the highest 4.3 points.Conclusion In this study, a brightness perception compression model with macro consistency and micro significance is constructed in the brightness channel.Thus, the drawbacks of the existing TM algorithms, such as the loss of detail texture information, edge halation, and brightness imbalance, can be effectively solved.Moreover, a saturation migration model is designed by integrating brightness compression in the chroma channel, effectively solving the color distortion caused by brightness compression.The experimental comparison results indicate that the TM algorithm via macro and micro information enhancement and color correction proposed in this study is better than the existing TM algorithms.Moreover, the proposed algorithm provides a beneficial condition for us to conduct high dynamic image generation and high dynamic video coding.
- Low-light image enhancement and denoising with internal and external priors Du Shuangli, Dang Hui, Zhao Minghua, Shi Zhenghaodoi:10.11834/jig.220707
20-09-2023
120
179
Abstract:Objective Low-light image enhancement has been studied extensively in the past few decades as one of the most challenging image processing problems.The images taken in low-light conditions usually contain extremely dark areas and unexpected noise.Many impressive methods, including cognition-based and learning-based approaches, have been proposed to improve image brightness and recover image details and color information.Remarkable enhancement results have been achieved by deep-learning-based techniques.Low-light and norm-light image pairs are required for enhancement methods based on supervised learning.However, no unique or well-defined norm-light ground truth exists.In addition, the models trained by a direct image-to-image transformation manner, even with generative adversarial networks, tend to show a bias toward a certain range of luminance values and scenes.Approaches based on the retina cortex(Retinex)represent a branch of cognition-based methods.However, they tend to amplify the noises hidden in dark images.Some attempts for noise suppression have been introduced.They focus on utilizing the internal prior in the input image to distinguish the noise from the image texture.The denoising performance is limited, and the image texture is often removed together with noise.This scenario leads to a blurry background.Additionally, most of these enhancement methods are designed for generally low-light images.If they are used for images with extremely low light, insufficient brightness improvement and obvious color deterioration are produced.This study proposes a low-light image enhancement and denoising method to address these issues by combining the internal and external priors.Method We regard extremely low-light image enhancement as a two-stage illumination correction task.First, the global illumination in a scene is estimated based on the well-known dark channel prior.If the global illumination is lower than 0.5, the input image is regarded as an extremely low-light image, and an initial brightness correction is performed for the image.If the global illumination is greater than or equal to 0.5, the input image is a generally low-light image;thus, no further processing is required.Second, a sequential Retinex decomposition model is proposed to decompose a low-light image into an illumination component multiplied by a reflectance component.An L1-norm regularization term on the illumination gradient is applied under the assumption that it is spatially piecewise smooth.Unlike approximating the illumination layer to a pre-estimation, our method aims to approximate the illumination layer to the low-light image in the RGB color space.Then, all noises are supposed to be contained in the reflectance layer.Based on the Retinex decomposition result, the enhanced noise image is produced with Gamma correction.Finally, a denoising technique is proposed based on a dual, complementary prior constraint.This technique utilizes a nonlocal selfsimilarity property to construct the internal prior for the reflection component.The deep learning technique is also utilized to construct the external prior constraint for the enhanced noise image.Then, the internal and external priors restrict each other.The proposed denoising model can be solved by an alternating optimization strategy.Result We compare the proposed method with six existing enhancement algorithms, including two Retinex-based traditional approaches, two deep learning approaches, and two Retinex-based learning approaches, to verify its effectiveness.We select 140 generally lowlight images(global illumination > 0.5)from the commonly used datasets, including DICM, LIME, and ExDark.We also select 162 extremely low-light images(global illumination < 0.15)from the LOL dataset for testing.For the generally lowlight images, no well-exposed normal-exposure image exists for reference.Both visual evaluation and quantitative evaluation are provided.Three nonreference quality assessment metrics, including blind tone-mapped quality index(BTMQI), no-reference image quality metric for contrast distortion(NIQMC)and natural image quality evaluator(NIQE), and two full-reference metrics, including peak signal-to-noise ratio(PSNR)and structural similarity index measure(SSIM), are utilized for evaluation.The visual comparisons show that our method has advantages in brightness improvement, color fidelity, and denoising.For the generally low-light images, the quantitative comparisons show that our method achieves the second-best results for BTMQI and NIQE.For NIQMC, the result of our method is close to the results of the two Retinexbased traditional methods.For extremely low-light images, our method achieves the best results for NIQMC, PSNR, and SSIM.The PSNR values obtained by other algorithms range from 8 to 18.35 dB, and their SSIM values range from 0.3 to 0.78.In comparison, our algorithm can reach 18.94 dB and 0.82 for PSNR and SSIM, respectively, showing noticeable advantages.Qualitative and quantitative experimental results show that the proposed algorithm can enhance low-light images under different illumination conditions and effectively remove noise hidden in images.Moreover, its performance is relatively stable.Conclusion This paper proposes a novel Retinex-based low-light image enhancement and denoising method, which can be used for generally and extremely low-light images.The irreconcilable problem between the brightness increase and the color distortion for extreme low-light enhancement task is effectively solved by transforming one extremely low-light image into a generally low-light image.A dual, complementary constraint is constructed based on the internal and external priors to remove the amplified noise.The experiments demonstrate that the constraint can balance noise removal and texture preservation, making the enhanced image edge clear.
Image Analysis and Recognition
- Infrared target tracking algorithm based on attention mechanism enhancement and target model update Ji Qingbo, Chen Kuicheng, Hou Changbo, Li Ziqi, Qi Yufeidoi:10.11834/jig.220459
20-09-2023
168
212
Abstract:Objective Most target tracking algorithms are designed based on visible sight scenes.However, in some cases, infrared target tracking has advantages that visible light does not have.Infrared equipment uses the radiation of an object itself to image and does not require additional lighting sources.It can display the target in weak light or dark scenes and has a certain penetration ability.However, infrared images have defects, such as unclear boundaries between targets and backgrounds, blurred images, and cluttered backgrounds.Moreover, some infrared dataset images are rough, negatively impacting the training of data-driven-based deep learning algorithms to a certain extent.Infrared tracking algorithms can be divided into traditional methods and deep learning methods.Traditional methods generally take the idea of correlation filtering as the core.Deep learning methods are mainly divided into the method of a neural network providing target features for correlation filters and the method of calculating the similarity of the image area with the framework of the Siamese network.The feature extraction ability of traditional methods for infrared targets is far inferior to that of deep learning methods.Moreover, the filters trained online cannot adapt to fast-moving or blurred targets, resulting in poor tracking accuracy in scenes with complex backgrounds.At present, most deep-learning-based infrared target tracking methods still lack the use of detailed information on infrared targets in infrared scenes with weak contrast and noise.Most trackers cannot effectively update the tracked target when the tracking scene has similar targets and cluttered background.This scenario results in poor robustness in long-term tracking.Therefore, an infrared target tracking algorithm based on attention and template adaptive update is proposed to solve the problems mentioned.Method The Siamese network tracking algorithm takes the target in the first frame as the template and performs similarity calculation on the search area of the subsequent frames to obtain the position of the target with the maximum response.The method has a simple structure and high tracking efficiency.However, most algorithms currently use the anchor-based mechanism, and the preset anchor requires tedious manual debugging to adapt to changes in the scale and aspect ratio of the target.The anchor-free design of the Siamese box adaptive network(SiamBAN)avoids the hyperparameters related to the candidate box.These hyperparameters are flexible and general.Therefore, this study is based on the SiamBAN tracking framework.Then, a fast attention enhancement module designed for infrared tracking scenes is added to process infrared images in parallel.This module mainly includes two parts:The first part is the contrast limited adaptive histogram equalization;the second part is the efficient channel attention module.A three-layer convolutional network connects the two parts to form a residual structure.This structure can improve the difference between the infrared target and the background.It can also enhance the detailed information of the target without losing the original information.The extracted features are proportionally fused into the middle layer of the backbone network to achieve rapid utilization.The target adaptive update network is used to learn the feature change trend of the infrared target while dynamically updating the middle- and high-level features of the target.The target adaptive update network uses the target information of the first frame as the initial template.Then, it superimposes the historical accumulation template and the template of the current frame to calculate the best template of the target in the next frame and realize the continuous use of the historical information of the target.Result We compare our infrared target tracking algorithm with 10 state-of-the-art trackers on four infrared target tracking evaluation benchmarks, namely, large-scale thermal infrared object tracking benchmark(LSOTB-TIR), thermal infrared pedestrian tracking benchmark(PTB-TIR), thermal infrared visual object tracking(VOT-TIR2015), and VOT-TIR2017.The precision on the LSOTB-TIR dataset is 79.0%.The normalized precision and the success rate of the first-ranked algorithm are 71.5% and 66.2%, which are 4.0% and 4.6% higher than those of the second-ranked algorithm.On the PTB-TIR dataset, the precision and the success rate of the first-ranked algorithm are 85.1% and 66.9%, which are 1.3% and 3.6% higher than those of the second-ranked algorithm.The expected average overlap on the VOT-TIR2015 dataset is 0.344, the accuracy is 0.73, and the results of the same test on the VOTTIR2017 dataset are 0.276 and 0.71.The evaluation results of the algorithm in the first three datasets have reached the highest ranking.The experimental results of the ablation study on the LSOTB-TIR dataset show that the algorithm has an obvious gain effect on the baseline tracker.Finally, the qualitative analysis of experimental results on the LSOTB-TIR dataset shows that the algorithm in this study has strong robustness in the attribute of background clutter, fast motion, intensity variation, scale variation, occlusion, out-of-view, deformation, low resolution, and motion blur.It also shows that the fast attention enhancement module and the target adaptive update network of this algorithm positively affect the improvement of the tracking success rate.Conclusion Our algorithm improves the ability of the backbone to capture the features of the infrared target.It also adaptively adjusts the characteristic state of the target through the historical change information of the target.Thus, the problem that infrared target tracking is susceptible to interference in complex environments is solved, and the precision and success rate of long-term infrared target tracking are improved.
- Human similar action recognition by fusing saliency image semantic features Bai Zhongyu, Ding Qichuan, Xu Hongli, Wu Chengdongdoi:10.11834/jig.220028
20-09-2023
114
69
Abstract:Objective Human action recognition is a valuable research area in computer vision.It has a wide range of applications, such as security monitoring, intelligent monitoring, human-computer interaction, and virtual reality.The skeleton-based action recognition method first extracts the specific position coordinates of the major body joints from the video or image by using a hardware method or a software method.Then, the skeleton information is used for action recognition.In recent years, skeleton-based action recognition has received increasing attention because of its robustness in dynamic environments, complex backgrounds, and occlusion situations.Early action recognition methods usually use hand-crafted features for action recognition modeling.However, the hand-crafted feature methods have poor generalization because of the lack of diversity in the extracted features.Deep learning has become the mainstream action recognition method because of its powerful automatic feature extraction capabilities.Traditional deep learning methods use constructed skeleton data as joint coordinate vectors or pseudo-images, which are directly input into recurrent neural networks(RNNs) or convolutional neural networks(CNNs)for action classification.However, the RNN-based or CNN-based methods lose the spatial structure information of skeleton data because of the limitation set by the European data structure.Moreover, these methods cannot extract the natural correlation of human joints.Thus, distinguishing subtle differences between similar actions becomes difficult.Human joints are naturally structured as graph structures in non-Euclidean space.Several works have successfully adopted graph convolutional networks(GCNs)to achieve state-of-the-art performance for skeletonbased action recognition.In these methods, the subtle differences between the joints are not explicitly learned.These subtle differences are crucial to recognizing similar actions.Moreover, the skeleton data extracted from the video shield the object information that interacts with humans and only retain the primary joint coordinates.The lack of image semantics and the reliance only on joint sequences remarkably challenge the recognition of similar actions.Method Given the above factors, the saliency image feature enhancement based center-connected graph convolutional network(SIFE-CGCN)is proposed in this work for skeleton-based similar action recognition.The proposed model is based on GCN, which can fully utilize the spatial and temporal dependence information between human joints.First, the CGCN is proposed for skeletonbased similar action recognition.For the spatial dimension, a center-connection skeleton topology is designed to establish connections between all human joints and the skeleton center to capture the small difference in joint movements in similar actions.For the temporal dimension, each frame is associated with the previous and subsequent frames in the sequence.Therefore, the number of adjacent nodes in the frame is fixed at 2.The regular 1D convolution is used on the temporal dimension as the temporal graph convolution.A basic co-occurrence graph convolution unit includes a spatial graph convolution, a temporal graph convolution, and a dropout layer.For training stability, the residual connection is added for each unit.The proposed network is formed by stacking nine graph convolution basic units.The batch normalization(BN)layer is added before the beginning of the network to standardize the input data, and a global average pooling layer is added at the end to unify the feature dimensions.The dual-stream architecture is used for utilizing the joint and bone information of the skeleton data simultaneously to extract data features from multiple angles.Given the different roles of each joint in different actions, the attention map is added to focus on the main motion joints in action.Second, the saliency image in the video is selected using the Gaussian mixture background modeling method.Each image frame is compared with the real-time updated background model to segment the image area with considerable changes, and the background interference is eliminated.The effective extraction of semantic feature maps from saliency images is the key to distinguishing similar actions.The Visual Geometry Group network (VGG-Net) can effectively extract the spatial structure features of objects from images.In this work, the feature map is extracted through pre-trained VGG-Net, and the fully connected layer is used for feature matching.Finally, the feature map matching result is used to strengthen and revise the recognition result of CGCN and improve the recognition ability for similar actions.In addition, the similarity calculation method for skeleton sequences is proposed, and a similar action dataset is established in this work.Result The proposed model is compared with the stateof-the-art models on the proposed similar action dataset and Nanyang Technological University RGB+D(NTU RGB+D) 60/120 dataset.The methods for comparison include CNN-based, RNN-based, and GCN-based models.On the crosssubject(X-Sub)and cross-view(X-View)benchmarks in the proposed similar action dataset, the recognition accuracy of the proposed model can reach 80.3% and 92.1%, which are 4.6% and 6.0% higher than the recognition accuracies of the suboptimal algorithm, respectively.The recognition accuracy of the proposed model on the X-Sub and X-View benchmarks in the NTU RGB+D 60 dataset can reach 91.7% and 96.9%.Compared with the suboptimal algorithm, the proposed model improves by 1.4% and 0.6%.Compared with the suboptimal model feedback graph convolutional network (FGCN), the proposed model improves the recognition accuracy by 1.7% and 1.1% on the X-Sub and cross-setup(X-Set) benchmarks in the NTU RGB+D 120 dataset, respectively.In addition, we conduct a series of comparative experiments to show clearly the effectiveness of the proposed CGCN, the saliency image extraction method, and the fusion algorithm.Conclusion In this study, we propose a SIFE-CGCN to solve the recognition confusion when recognizing similar actions due to the ambiguity between the skeleton feature and the lack of image semantic information.The experimental results show that the proposed method can effectively recognize similar actions, and the overall recognition performance and robustness of the model are improved.
Image Understanding and Computer Vision
- Complex gesture pose estimation network fusing multiscale features Jia Di, Li Yuyang, An Tong, Zhao Jinyuandoi:10.11834/jig.220636
20-09-2023
144
78
Abstract:Objective Hand pose estimation aims to identify and localize key points of human hands in images.It has a wide range of applications in computer vision.Hand pose estimation methods can be categorized as depth- or RGB-based methods.Depth-based methods estimate the hand pose by extracting depth features.They require specific devices to constrain the user environment.Scholars use RGB images for hand pose estimation.However, this approach is difficult in an occluded environment.In particular, hand pose estimation based on a single RGB image has low accuracy because of the complexity of the pose, local self-similarity of finger features, and occlusion.Edge information is usually ignored in hand pose estimation.However, this information is important in extracting the information of occluded parts.Moreover, fingertips are small, thereby complicating the recognition of the joints at the fingertips.However, many existing RGB-based gesture estimation methods do not make good use of edge information.A multiscale feature fusion network for monocular vision gesture pose estimation is proposed to address this problem.Method Gesture pictures usually contain complex detailed features.A strong correlation between fingers and joints is present.Therefore, the use of a single feature for hand pose estimation tends to ignore diverse feature information, thereby complicating the accurate extraction of gesture information.Multiscale feature fusion network(MS-FF)aims to estimate the hand pose through a single RGB image.The feature maps of different resolutions are extracted from RGB images through the ResNet50 module.Feature maps are fed into the channel conversion module to learn the dependencies between channels explicitly, thereby enhancing important information and downplaying minor information.The level of feature information depends on the resolution of a feature map.Thus, the global regression module obtains high-resolution feature maps containing semantic information.These maps are separately input in the local optimization module to extract deep information.The Gaussian heatmap of hand joints is obtained to improve the spatial generalization ability of the model.Thus, accurate joint locations can be obtained.We take the feature map with the smallest resolution from the channel conversion module, through which the handedness and relative depth information between the wrist joints are obtained.The above results are combined to estimate the hand pose.Result The PyTorch framework was used for training.The hand image was resized to 256×256 pixels and input to the network.In the experiment, the batch size was set to 16.The network was trained for 20 epochs with an NVIDIA 3090 GPU.The initial learning rate was set to 0.000 1 and reduced by a factor of 10 at the 15th and 17th epochs to optimize the network output.The proposed method achieved better metrics than other methods on different test sets.InterHand2.6M(H+M)was selected as the training set.Compared with the evaluation metrics obtained by InterNet, the mean relative root position error, mean per joint position error of single hand sequences, and mean per joint position error of interacting hand sequences obtained by MS-FF had low errors of 30.92, 11.10, and 15.14, respectively.These values were 5.1%, 8.3%, and 5.8% lower than those obtained by InterNet.We also found that each finger achieved a low error.MS-FF also possesses few model parameters and low computational complexity while improving recognition accuracy.However, the running rate of MS-FF(28 frame/s)is lower than that of InterNet(53 frame/s).The picture shows the hand pose with finger self-occlusion and mutual occlusion of hands.Thus, estimating this interacting hand pose is more difficult than predicting a single hand pose.In the result obtained by our method, the hand joint positions and hand pose estimations are correctly predicted under occlusion.Moreover, our method can accurately predict hand joint positions and hand poses in case of occlusion.The proposed method achieves good recognition results in occluded gestures.Conclusion This study proposes an MS-FF for monocular visual hand pose estimation.MS-FF can extract information of different levels from feature maps of different resolutions to process the detailed information of occluded edges and fingertips effectively and estimate hand poses accurately.MS-FF accurately estimates hand poses in an RGB image and copes well with complex application scenarios.Thus, it can deal with difficult-to-recognize joints and inaccurate gesture recognition in occlusion scenes.Channels contain various implicit information.We need to focus on the information that is important for recognizing gestures.A channel conversion module adjusts the weights of channels to enhance important information.Fingertips occupy a small percentage of an image.They are also relatively difficult to identify.A global regression module generates different resolutions with rich semantic information to utilize image edge details and deep information effectively.This module is important in estimating finger poses.The global regression module may not accurately identify occluded joints.A local optimization module is designed with deep information in the feature map.It fuses all-level feature maps by correcting joints that do not return to the correction position.Thus, these maps can be applied well to the occlusion scene.Our method can effectively estimate single and interacting hand poses.It can also avoid errors caused by occlusion to a certain extent.High accuracy and robustness are achieved using the proposed method.However, the running rate of MS-FF is slower than that of the InterNet method because of the complex construction process of the MS-FF method.This scenario increases serial wait, kernel startup, and synchronization time overhead.In future work, we will continue to optimize our model, reduce the running rate of the model while ensuring recognition accuracy, and achieve a fast recognition speed to pave the way for fast and accurate gesture recognition in real scenes.
- Meta-transfer learning in cross-domain image classification with few-shot learning Du Yandong, Feng Lin, Tao Peng, Gong Xun, Wang Jundoi:10.11834/jig.220664
20-09-2023
105
89
Abstract:Objective Few-shot learning image classification aims to recognize images with limited labeled samples.At present, few-shot learning methods are roughly divided into three categories:gradient optimization, metric learning, and transfer learning.Gradient optimization methods usually consist of two loop stages.In the inner loop stage, the base model quickly adapts to new tasks with limited labeled samples;in the outer loop stage, the meta-model learns cross-task knowledge for good generalization performance.The metric learning method first maps the samples to the high-dimensional embedding space.Then, the correct classification of the unknown samples is completed according to the similarity measure.The transfer learning method first pretrains a high-quality feature extractor with a large amount of annotated data and migrates to fine-tune the classifier.Thus, the model is suitable for the current task.The existing few-shot learning methods based on meta-learning assume that the training and test tasks are the same or have a similar distribution.However, these methods face cross-domain classification challenges, such as weak generalization ability and poor classification accuracy.In the training and testing stages, the few-shot learning method based on transfer learning does not consider the inconsistency of sample categories.It also fails to leave enough feature embedding space for new category samples.On the basis of the idea of integrating transfer learning and meta-learning, we propose a model of compressed meta-transfer learning (CMTL)to improve the cross-domain ability of few-shot learning.Method The method is mainly composed of two aspects.On the one hand, for meta-learning, the prior knowledge generated by meta-training is used to complete the classification of the target task.When the source and target tasks have different data distributions, the base class data used for training comes from the source domain natural dataset mini-ImageNet, whereas the novel class data used for testing comes from the target domain medical dataset Chest-X.The meta-knowledge acquired by the source task training cannot be quickly generalized to the target task because of its lack of universality, which further leads to poor cross-domain classification effects.In this study, new auxiliary tasks with strategies, such as random cropping and gamma transformation, were constructed on the support set of the target domain during the testing process.These auxiliary tasks fine-tune the meta-training parameters to improve task adaptability.On the other hand, for transfer learning, the sample categories are consistent during training and testing.Thus, the feature embedding space available for the novel class samples is small if the deep learning model is optimized with the traditional softmax loss function.This scenario further leads to the unsatisfactory feature extraction ability of the model and poor classification accuracy.Given the above problems, this study proposes using a self-compression loss function in the pretraining stage.This loss function adjusts the distribution position between prototypes of the base classes to make the base class samples concentrated in the embedding space and reserve part of the embedding space for the novel class.In the fine-tuning stage, the novel class with large domain distribution is guided to obtain expressive features.The existing studies on cross-domain few-shot learning show that meta-learning methods perform well when the data distributions of the target and source tasks are similar.Conversely, transfer learning methods perform effectively.The ensemble of prediction scores of the above two strategies is regarded as the final classification result to take full advantage of the above two methods.Result This study compares the proposed model with several state-of-the-art cross-domain few-shot image classification models, such as graph convolutional network(GCN), adversarial task augmentation(ATA), self-training to adapt representations to unseen problems(STARTUP), and other classic methods.Compared with the current state-of-theart cross-domain few-shot methods, CMTL has advantages, as shown in the experimental results.In the testing phase, extensive experiments are performed on the 5-way 1-shot and 5-way 5-shot settings to complete the validation of the model and ensure a fair comparison with advanced methods.The experimental results show that on the 5-way 1-shot and 5-way 5-shot settings, the mini-ImageNet is used as the source domain dataset for training.Moreover, the effectiveness of the CMTL on the EuroSAT, ISIC, CropDisease, and Chest-X datasets is tested, and the accuracy rates reach 68.87%/87.74%, 34.47%/49.71%, 74.92%/93.37%, and 22.22%/25.40%, respectively.Compared with meta-learning and transfer learning models, our model achieves competitive results on the 5-way 1-shot and 5-way 5-shot settings on all crossdomain tasks.Conclusion This study proposes a cross-domain few-shot image classification model based on meta-transfer learning.The proposed model improves the generalization ability of few-shot learning.The experimental results show that the CMTL proposed in this study combines the advantages of meta-learning and transfer learning methods.It also has significant effects on cross-domain few-shot tasks.
- Visual-semantic dual-disentangling for generalized zero-shot learning Han Ayou, Yang Guan, Liu Xiaoming, Liu Yangdoi:10.11834/jig.220486
20-09-2023
117
69
Abstract:Objective Traditional deep learning models, widely adopted in many application scenarios, perform effectively.However, they rely on a large number of training samples.Thus, a large number of training samples are difficult to collect in practical applications.Moreover, the limitation of identifying only the classes already present in the training phase(seen classes)is bypassed, and processing the classes never seen in the training phase(unseen classes)is a challenge.Zero-shot learning(ZSL)provides a good solution to this challenge.Zero-shot learning aims to classify unseen classes for which no training samples are available during the training phase.However, another problem exists with the complexity of the real world.In practice, seen and unseen classes can be found in real life.Therefore, generalized zeroshot learning(GZSL)is proposed.This new method has realistic and universal characteristics.As a generalized method, generalized zero-shot learning can sample test sets from seen and unseen classes.The existing generalized zero-shot learning methods can be subdivided into two categories, namely, embedding-based and generation-based methods.The former learns a projection or embedding function that associates the visual features of the seen class with the corresponding semantics.In comparison, the latter learns a generative model to generate visual features for the unseen class.In previous studies, the visual features extracted using the pretrained deep models(e.g., ResNet101)are not specifically extracted for the generalized zero-shot learning task.The extracted visual features are not all semantically related to the predefined attributes in dimension.This scenario leads the model to incline to the seen classes.Most methods ignore useful information related to the features present in the semantics during classification, thereby remarkably impacting the final classification.In this paper, we propose a new generalized zero-shot learning method, called visual-semantic dual-disentangling framework for generalized zero-shot learning(VSD-GZSL), to disentangle relevant visual features and semantic information.Method The conditional variational auto-encoders(VAEs)are combined with a disentanglement network and trained in an end-to-end manner.The proposed disentanglement network is an encoder-decoder structure.The visual features and semantics of the seen classes are first used to train the conditional variational auto-encoders and the disentanglement network.Once the network has converged, the trained generative network generates visual features for the unseen classes.The real features of the seen classes and the generative features of the unseen classes are fed into a visual feature disentanglement network to disentangle the semantic-consistent and semantic-irrelevant features.Then, they are fed into a semantic disentanglement network to disentangle the semantic into feature-relevant and feature-irrelevant semantic information.The components disentangled by the two disentanglement networks are fed into the decoder to be reconstructed back to the corresponding space by using the reconstruction loss to prevent information loss during the disentanglement stage.A total correlation penalty module is designed to measure the independence between potential variables disentangled by the disentanglement network.A relational network is designed to maximize the compatibility score between the components disentangled by the visual disentanglement network and the corresponding semantics and learn the semantic consistency of the visual features.The semantic information related to the visual features disentangled by the semantic disentanglement network is fed into the visual disentanglement decoder for cross-modal reconstruction to measure the feature relevance of the semantics.Finally, the semantic consistency features and feature-related semantics disentangled by the two disentanglement networks are jointly learned into a generalized zero-shot classifier for classification.Result The proposed method was validated in several experiments on four generalized zero-shot learning open datasets(AwA2, CUB, SUN, and FLO).The proposed method achieved better results than the baseline, with a 3.8% improvement in the unseen class accuracy, a 0.2% improvement in the seen class accuracy, and a 1.6% improvement in the harmonic mean on the AwA2 dataset.The unseen class accuracy improved by 3.8%, the seen class accuracy improved by 2.4%, and the harmonic mean improved by 3.2% on the CUB dataset.The unseen class accuracy improved by 10.1%, the seen class accuracy improved by 4.1%, and the harmonic mean improved by 6.2% on the SUN dataset.Moreover, the seen class accuracy improved by 9.1%, and the harmonic mean improved by 1.5% on the FLO dataset.The proposed method was also compared with 10 recently proposed generalized zero-shot learning methods.Compared with f-CLSWGAN, VSD-GZSL exhibited improved harmonic means by 10%, 8.4%, 8.1%, and 5.7% on the four datasets.Compared with cycle-consistent adversarial networks for zero-shot learning(CANZSL), VSDGZSL exhibited improved harmonic means by 12.2%, 5.6%, 7.5%, and 4.8% on the four datasets.Compared with leveraging invariant side GAN(LisGAN), VSD-GZSL exhibited improved harmonic means by 8.1%, 6.5%, 7.3%, and 3% on the four data sets.Compared with cross- and distribution-aligned VAE(CADA-VAE), VSD-GZSL exhibited improved harmonic means by 6.5%, 5.7%, 6.9%, and 10% on the four datasets.Compared with f-VAEGAN-D2, VSD-GZSL exhibited improved harmonic means by 6.9%, 4.5%, 6.2%, and 6.7% on the four datasets.Compared with CycleCLSWGAN, VSD-GZSL exhibited improved harmonic means by 5.1%, 8.1%, and 6.2% on the CUB, SUN, and FLO datasets, respectively.Compared with feature refinement(FREE), VSD-GZSL exhibited improved harmonic means by 3.3%, 0.4%, and 5.8% on the AwA2, CUB, and SUN datasets, respectively.The experimental results showed that the proposed method achieves excellent results.Thus, the effectiveness of the proposed method can be demonstrated.Conclu- sion The proposed VSD-GZSL method demonstrates its superiority to the traditional models.Our proposed method can disentangle the semantically consistent features in the visual features and the semantic information associated with the features in the semantics.Then a final classifier is learned from these two decomposed mutually consistent features.Compared with several related methods, VSD-GZSL achieves a remarkable performance improvement on multiple datasets.
Medical Image Processing
- Vessel segmentation of OCTA images based on latent vector alignment and swin Transformer Xu Cong, Hao Huaying, Wang Yang, Ma Yuhui, Yan Qifeng, Chen Bang, Ma Shaodong, Wang Xiaogui, Zhao Yitiandoi:10.11834/jig.220482
20-09-2023
163
105
Abstract:Objective Optical coherence tomography angiography(OCTA)is a noninvasive, emerging technique that has been increasingly used for images of the retinal vasculature at the capillary-level resolution.OCTA technology can demonstrate the microvascular information around the macula and has significant remarkable advantages in retinal vascular imaging.Fundus fluorescence angiography can visualize the retinal vascular system, including capillaries.However, the technique requires intravenous injection of contrast.This process is relatively time-consuming and may have serious side effects.In clinical practice, doctors can look at different layers of vascular structures through OCTA images and analyze changes in vascular structures to determine the presence of related diseases.In particular, any abnormality in the microvasculature distributed in the macula often indicates the presence of some diseases, such as early-stage glaucomatous optic neuropathy, diabetic retinopathy, and age-related macular degeneration.Therefore, the automatic segmentation and extraction of retinal vascular structure in OCTA are vital for the quantitative analysis and clinical decision-making of many ocular diseases.However, the OCTA imaging process usually produces images with a low signal-to-noise ratio, thereby posing a great challenge for the automatic segmentation of vascular structures.Moreover, variations in vessel appearance, motion, and shadowing artifacts in different depth layers and underlying pathological structures significantly remarkably increase the difficulty in accurately segmenting retinal vessels.Therefore, this study proposes a novel segmentation method of retinal vascular structures by fusing hidden vector alignment and Swin Transformer to achieve the accurate segmentation of vascular structures.Method In this study, the ResU-Net network is used as the base network(the encoder and decoder layers consist of residual blocks and pooling layers), and the Swin Transformer is introduced into ResU-Net to form a new encoder structure.The encoding step of the feature encoder consists of four stages.Each stage comprises two layers:the Transformer layer consisting of several Swin Transformer blocks stacked together and the residual structure.The Swin Transformer encoder can acquire rich feature information, whereas the feature maps output from each Swin Transformer layer is combined with the feature maps sampled on the decoder via a jump connection.A feature alignment loss function based on hidden vectors is also designed in this study.This feature alignment loss function is different from the classical pixel-level loss function.Feature alignment loss can optimize segmentation results in terms of feature dimensions.It can also enhance the encoder's ability to extract the structural features of OCTA image vessels and optimize the network at the hidden space level by constraining the consistency of labels and images in the hidden space to improve the segmentation performance.Result Experimental results on three OCTA datasets(including two public datasets and one private dataset) show that our method is ahead of other comparative methods and has the best overall segmentation performance.In particular, the area under the curves(AUCs)of this method reaches 94.15%, 94.87%, and 97.63%, whereas the accuracy (ACCs)reaches 91.57%, 90.03%, and 91.06%, respectively.Compared with the classical medical image segmentation network U-Net, the proposed method improves the AUC, Kappa, false discovery rate(FDR), and Dice by approximately 4.06%, 10.18%, 23.16%, and 7.87%, respectively, on the OCTA-O dataset.In addition, ablation experiments are conducted for each component in this study to verify the validity of each component of the proposed model.The results show that each component can play a positive role.Conclusion An end-to-end vascular segmentation network is proposed in this study to address the challenges of complex retinal vascular structures and low overall image contrast present in OCTA.In this study, ResU-Net is used as the backbone network to mitigate the interference of scattering noise and artifacts on segmentation through image multifusion input.Moreover, the Swin Transformer module is used as the coding structure to obtain rich features.A novel hidden vector alignment loss function that can optimize the network at the hidden space level is also designed in this study.Thus, the gap between segmentation results and labels is reduced, and the segmentation performance is improved.The experimental results demonstrate that the method in this study achieves the best segmentation performance on all three OCTA datasets, and it outperforms other comparative methods.
Remote Sensing Image Processing
- Key sub-region feature fusion network for fine-grained ship detection and recognition in remote sensing images Zhang Lei, Chen Wen, Wang Yuehuandoi:10.11834/jig.220671
20-09-2023
143
98
Abstract:Objective The ocean has great economic and military value.The development of human society increases the impact of ocean activities on the development of a country.The sea is an important carrier of marine activities.Thus, the recognition and monitoring of ship targets in key sea areas through remote sensing images are crucial to the national defense and development of the economy.Fine-grained ship detection and recognition in high-resolution remote sensing images refer to the identification of specific types of ships based on ship detection.A precise and detailed classification is valuable in practical application fields, such as sea surveillance and intelligence gathering.Instead of coarse-grained classification categories, such as warcraft and merchant ships, specific ship types, such as Arleigh Burke-class destroyer, Nimitz-class aircraft carrier, container, and car carrier, are necessary.However, the overall color, shape, and texture of different types of ship targets are similar.The structures of ships belong to different types, but their uses are similar.Moreover, the coating color of military ships is monotonous.These characteristics complicate the classification of these targets.The existing ship detectors are designed to focus on locating targets.The design of the classification branch of these detectors is relatively simple.They only use the features of whole targets for classification, significantly decreasing the performance in the fine-grained labeled datasets.The existing ship classification methods, which mainly classify targets on the pre-cropped image patches, are separated from the detection process.This approach is unsatisfactory for practical applications for two reasons:1)the whole backbone of these methods based on neural networks must be executed on every proposal to extract features.The remote sensing images of the harbor usually include several ships;thus, the computation cost increases sharply.2)The detection and classification networks are optimized separately, and the parameters of both networks are optimized to the best.The whole process cannot obtain the optimal solution because the locations of proposals obtained by detection methods vary with the pre-cropped image patches.utilize prior knowledge of ships and propose the key sub-region feature fusion network(KSFFN), which fuses features of sub-regions that are discriminative to the whole feature and combines detection and fine-grained recognition into one framework.Method KSFFN uses ResNet-50 as the backbone network to extract features and construct a proposal locating network by combining Faster R-CNN with region of interest(ROI) Transformer for obtaining proposal locations.Then, all of the proposals are ranked according to the probability of targets.The proposals with low probability are filtered.Then, the multi-level feature fusion recognition network(MLFFRN)is proposed to extract features from the proposals generated by the proposal locating network and to classify the proposals.First, the proposals are separated into several subregions along the axis and the overall features and sub-region features are extracted from different levels of the feature pyramid.Then, the self-supervision mechanism in the navigator-teacherscrutinizer network(NTS-Net)finds the key subregion that may contain important parts contributing to fine-grained recognition.Due to the limitation of image quality and characteristics of the target, not all targets have a very discriminating subregion.Moreover, the self-supervision mechanism in NTS-Net cannot reflect this subregion.Therefore, the information from all subregions in the proposal is utilized to calculate the discriminant significance of the subregion, which reflects the influence of the subregion on target recognition.Based on the discriminant significance, the weight of the sub-region is calculated, and the key sub-region features are fused with the overall features according to the weight.The combined feature is used to obtain the final classification result, thereby improving the accuracy of fine-grained recognition of ship targets.Result Public high resolution ship collection 2016(HRSC2016)dataset L3 task and self-built fine-grained ships in aerial images dataset(FGSAID)are used to evaluate the model.HRSC2016 dataset contains 1 061 images with 2 886 ships divided into 19 types.FGSAID dataset contains 1 690 images with 5 410 ships divided into 45 types.The average precision(AP)is used as an evaluation metric, and the intersection over union is set as 0.5 to determine whether the prediction box matches the ground truth.On the HRSC2016 dataset L3 task, the proposed method achieves an AP of 77.3%, and MLFFRN can improve the AP by 6.3%.On the FGSAID dataset, our method achieves an AP of 71.5%.A series of ablation experiments is conducted on the HRSC2016 dataset L3 task to show the effectiveness of different parts of the proposed method.In addition, the proposed method is compared with the state-of-the-art deep-learning-based ship detection framework on two datasets.The experiment results show that our model outperforms all other methods on both datasets.Compared with single-shot alignment network(S2ANet)network, the proposed method increases the AP by 7.8% and 8.9% on HRSC2016 and FGSAID, respectively.In particular, the AP of the proposed method increases by 16.7%, 11.1%, and 1.1% for the aircraft carrier/amphibious assault ships, other warships, and merchant ships, respectively, in the FGSAID dataset.Conclusion In this study, the end-to-end fine-grained ship detection and recognition network KSFFN is proposed.It extracts the overall features and sub-region features of the proposals and fuses them according to the discriminant significance.The proposed method combines detection and fine-grained recognition into one framework, thereby improving the processing speed greatly while performing excellently.Thus, KSFFN has great application value.The proposed method has a more powerful classification framework and can achieve more accurate results than the existing detection method.The experiment results show that our method outperforms several state-of-the-art deep-learning-based ship detection frameworks, thereby proving the effectiveness of KSFFN.
- Interferometric phase denoising combining global context and fused attention Zeng Qingwang, Dong Zhangyu, Yang Xuezhi, Chong Fatingdoi:10.11834/jig.220562
20-09-2023
118
77
Abstract:Objective Interferometric phase noise is introduced by three types of inherent factors:1)system noise, such as thermal noise and synthetic aperture radar(SAR)speckle noise;2)decoherence problems, including baseline, temporal, and spatial decoherence;3)signal processing errors, such as misregistration.The existence of noise increases the difficulty of phase unwrapping and even causes the process to fail, thereby seriously interfering with the final interferometric result.Therefore, interferometric phase denoising is a key link in interferometric SAR(InSAR)technology.Its effect has an important influence on the accuracy of measurement results.The existing interferometric phase denoising algorithms still have many defects.First is the insufficient ability to capture global contextual information.Some algorithms ignore global context information or only focus on local context information derived from a few pixels.They also lack global context information.This feature is manifested as unstable detail preservation ability in denoising results.Second, many researchers only pay attention to the influence of the spatial dimension or channel dimension of the image on the denoising result to improve the performance of denoising networks.However, they do not use spatial and channel dimensions in combination.Third, the high-level features extracted from the deep layers of the convolutional neural network have rich semantic information and ambiguous spatial details.In comparison, the low-level features extracted from the shallow layers of the network contain considerable pixel-level noise information.However, these features are isolated from one another;thus, they cannot be fully used.Method Most of the existing interferometric phase denoising methods focus on local features, and they have many limitations in feature extraction.A phase denoising network called GCFA-PDNet is proposed to solve these problems while balancing the relationship between denoising and structure preservation.This proposed phase denoising network combines global context and fused attention.The method separates the interference phase into real and imaginary parts and inputs them into the network.First, the shallow features are extracted from the noise phase.Then, they are mapped to the feature enhancement module composed of the global context extraction module and the fused attention module.The shallow features extracted by the network are fused with the deep features concurrently.Finally, a denoised image is generated through global residual learning.Four global context extraction modules and four fused attention modules are used in the whole network.The core module of the global context extraction module is the global context block, which can extract the global context information and has the advantages of nonlocal methods.The fused attention module fuses the features extracted by its two submodules:the channel attention block and the spatial attention block.It emphasizes key features and efficiently extracts noise information hidden in complex backgrounds.Result We present six experimental results:Goldstein, nonlocal interferogram estimator(NL-InSAR), block-matching 3D(InSAR-BM3D), a deep learning framework for SAR interferometric phase restoration and coherence estimation (DeepInSAR), phase filtering network, and GCFAPDNet.Different evaluation indicators are selected for different datasets to evaluate the advantages and disadvantages of various algorithms objectively.For the experiments of simulated images, peak signal-to-noise ratio(PSNR)and structural similarity(SSIM)are selected as evaluation indicators.A large PSNR indicates a little difference between the filtered phase and the clean phase.However, it does not consider the correlation between pixels in the image.Therefore, SSIM is employed to assess the SSIM of the filtered images to evaluate the overall denoising quality of the denoised images.Compared with the comparative methods, the proposed method improves the average PSNR and SSIM indicators of the simulated data results by 5.72% and 2.94%, respectively.For the real interference phase, the above two indicators cannot be used for evaluating the denoising performance because no noise-free image is used as a reference.The number of residues (NOR)and phase standard deviation(PSD)can also be used as objective evaluation indicators to judge subjectively the denoising performance of real interferometric phase images.NOR can reflect the ability of the filtering method to suppress noise.The smaller the NOR of the filtered interference phase is, the stronger the ability to suppress noise is.PSD can be used to measure the discrete degree of noise distribution.The smaller the PSD value is, the more concentrated the noise distribution is and the better the quality of the interferogram is.In the real data results, compared with the comparative methods, the proposed method increases the average percentage of residual point reduction and PSD indicators by 2.01% and 3.57%, respectively.The visual observation of the experimental results of various algorithms shows that the proposed method achieves the best denoising results.The qualitative and quantitative analyses also indicate that the proposed method outperforms the five other types of phase denoising methods.Conclusion The phase denoising network designed in the present study combines the global context and fused attention.Thus, it has certain advantages compared with other related algorithms.It also has a more powerful feature extraction ability than other methods.The network focuses on global context information and emphasizes key features.Thus, it can maintain the original phase details while enhancing the denoising ability, thereby achieving the best denoising results.