Current Issue Cover
由粗到精的多尺度散焦模糊检测

衡红军, 叶何斌, 周末, 黄睿(中国民航大学计算机科学与技术学院, 天津 300300)

摘 要
目的 散焦模糊检测致力于区分图像中的清晰与模糊像素,广泛应用于诸多领域,是计算机视觉中的重要研究方向。待检测图像含复杂场景时,现有的散焦模糊检测方法存在精度不够高、检测结果边界不完整等问题。本文提出一种由粗到精的多尺度散焦模糊检测网络,通过融合不同尺度下图像的多层卷积特征提高散焦模糊的检测精度。方法 将图像缩放至不同尺度,使用卷积神经网络从每个尺度下的图像中提取多层卷积特征,并使用卷积层融合不同尺度图像对应层的特征;使用卷积长短时记忆(convolutional long-short term memory,Conv-LSTM)层自顶向下地整合不同尺度的模糊特征,同时生成对应尺度的模糊检测图,以这种方式将深层的语义信息逐步传递至浅层网络;在此过程中,将深浅层特征联合,利用浅层特征细化深一层的模糊检测结果;使用卷积层将多尺度检测结果融合得到最终结果。本文在网络训练过程中使用了多层监督策略确保每个Conv-LSTM层都能达到最优。结果 在DUT (Dalian University of Technology)和CUHK (The Chinese University of Hong Kong)两个公共的模糊检测数据集上进行训练和测试,对比了包括当前最好的模糊检测算法BTBCRL (bottom-top-bottom network with cascaded defocus blur detection map residual learning),DeFusionNet (defocus blur detection network via recurrently fusing and refining multi-scale deep features)和DHDE (multi-scale deep and hand-crafted features for defocus estimation)等10种算法。实验结果表明:在DUT数据集上,本文模型相比于DeFusionNet模型,MAE (mean absolute error)值降低了38.8%,F0.3值提高了5.4%;在CUHK数据集上,相比于LBP (local binary pattern)算法,MAE值降低了36.7%,F0.3值提高了9.7%。通过实验对比,充分验证了本文提出的散焦模糊检测模型的有效性。结论 本文提出的由粗到精的多尺度散焦模糊检测方法,通过融合不同尺度图像的特征,以及使用卷积长短时记忆层自顶向下地整合深层的语义信息和浅层的细节信息,使得模型在不同的图像场景中能得到更加准确的散焦模糊检测结果。
关键词
Coarse-to-fine multiscale defocus blur detection

Heng Hongjun, Ye Hebin, Zhou Mo, Huang Rui(Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China)

Abstract
Objective Defocus blur detection (DBD) is devoted to distinguishing the sharp and the blurred pixels, which has wide applications and is an important problem in computer vision. The DBD result can be applied to many computer vision tasks, such as deblurring, blur magnification, and object saliency detection. According to the adopted image features, the DBD methods can be generally divided into two categories:hand-crafted feature-based traditional DBD methods and deep-feature-based DBD methods. The former utilize low-level blur features, such as gradient, frequency, singular value, local binary pattern, and simple classifiers to distinguish sharp image regions and blurred image regions. These low-level blur features are extracted from image patches, which results in loss of high-level semantic information. Although the traditional DBD methods do not need many training exemplars, they perform unsatisfactorily when images have complex scenes, especially in homogeneous regions and dark regions. More recent DBD methods propose to learn the representation of blur and sharp by using a large volume of images to extract task-adaptive features. Blur prediction can be generated by an end-to-end convolutional neural network (CNN), which is more efficient than traditional DBD methods. CNN can extract multiscale convolutional features, which is very useful for different vision problems, due to the hierarchical nonlinear ensemble of convolutional, rectified linear unit, and pooling layers. Generally, bottom layers extract low-level texture features that can improve the details of the detection results, whereas top layers extract high-level semantic features that are useful for conquering noise and background cluster. Most of the present methods integrate multiscale low-level texture features and high-level semantic features in their networks to generate robust defocus blur results. Although the existing DBD methods achieve better blur detection results than the hand-crafted feature-based methods, they still suffer from scale ambiguity and incomplete detection boundaries when processing images with complex scenes. In this paper, we propose a novel DBD framework that extracts multiscale convolutional features from images with different scales. Then, we use four branches of multiscale result refinement subnetworks to generate blur results at different feature scales. Lastly, we use a multiscale result fusion layer to generate the final blur results. Method The proposed network architecture consists of three parts:multiscale feature extraction subnetwork (FEN), multiscale result refinement subnetwork (RRN), and multiscale result fusion layer (RFL). We use visual geometry group(VGG16) as our basic feature extractor. We remove fully connected layers and the last pooling layer to increase image feature resolution. FEN consists of three basic feature extractors and a feature integration branch that integrates convolutional features of the same layers extracted from different scaled images. RRN is built by five convolutional long-short term memories (Conv-LSTM) layers to generate multiscale blur estimation from multiscale convolutional features. RFL consists of two convolutional layers with filter sizes of 3×3×32 and 1×1×1. We first resize the input image with different ratios and extract multiscale convolutional features from each resized image by FEN.FEN also integrates the features of the corresponding layers to explore the merits of features extracted from different images. Then, we feed the highest convolutional features of each branch of FEN into RRN to produce coarse blur maps. The blur maps producing the features are robust to noise and background clusters due to highest layers' extracted semantic features. However, these blur maps are in low resolutions that provide better guidance for fine-scale blur estimation. Thus, we gradually incorporate lower layers' higher resolution features into Conv-LSTMs to generate more precise blur maps. Multiscale convolutional features are integrated by the Conv-LSTMs from top to bottom in each branch of RRN. RFL is responsible for fusing the blur maps generated by the four branches. We concatenate the last prediction maps of each branch of RRN with the first integrated features of the first layer from the FEN as input for RFL to generate the final blur map because features of shallow layer contains a large amount of detail structure information that can improve the DBD result. We use the combination of F-measure, precision, recall, mean absolute error (MAE), and cross-entropy as our loss function for network pretraining and training. We add supervised signal at each prediction layers, which can directly pass the gradient to the corresponding layers and make the network optimization easier. We randomly select 2 000 images from the Berkeley segmentation dataset, uncompressed color image database, and Pascal2008 to synthesize blur images to pretrain the proposed network. The real training set consists of 1 204 images, which are selected from Dalian University of Technology(DUT) and The Chinese University of Hong Kong (CUHK) datasets. We augment the real training images by rotation, flipping, and cropping, enlarging the training data by 15 times. This operation greatly improves network performance. Our network is implemented by Keras. We resize the input images and ground truths to 320×320 pixels. We use adaptive moment estimation optimizer. We set the learning rate to 1×10-5 and divide by 10 in every five epochs until 1×10-8. We initialize FEN by VGG16 weights trained on ImageNet. We initialize the remaining layers by "Xavier Uniform". We conduct pretraining and training using Nvidia RTX 2080Ti. The whole training takes approximately one day. Result We train and test our network on two public blur detection datasets, DUT and CUHK, and compare our method with 10 state-of-the-art DBD methods. On the DUT dataset, our method achieves 38.8% relative MAE reduction and 5.4% relative F0.3 improvement over DeFusionNet(DBD network via recurrently fusing and refining multi-scale deep features). On this dataset, our method is the only one whose F0.3 is higher than 0.87 and whose MAE is lower than 0.1. On the CUHK dataset, our method achieves 36.7% relative MAE reduction and 9.7% relative F0.3 improvement over the local binary pattern. The proposed DBD method performs well in several challenging cases including homogeneous region and background cluster. Our blur detection is more precise at the detection boundaries. We conduct different ablation analysis to verify the effectiveness of our model. Conclusion We propose the coarse-to-fine multiscale DBD method, which extracts multiscale convolutional features from images with different resize ratios, and generate multiscale blur estimation with Conv-LSTMs. Conv-LSTMs integrate the semantic information of deep layer with detail information of the shallow layer to refine the blur maps. We produce the final blur map by integrating the blur maps generated from different-sized images and the low-level fused features. Our method generates more precise DBD results in different image scenes compared with other DBD methods in various scenes.
Keywords

订阅号|日报