Current Issue Cover
多特征融合的文档图像版面分析

应自炉, 赵毅鸿, 宣晨, 邓文博(五邑大学智能制造学部, 江门 529020)

摘 要
目的 在文档图像版面分析上,主流的深度学习方法克服了传统方法的缺点,能够同时实现文档版面的区域定位与分类,但大多需要复杂的预处理过程,模型结构复杂。此外,文档图像数据不足的问题导致文档图像版面分析无法在通用的深度学习模型上取得较好的性能。针对上述问题,提出一种多特征融合卷积神经网络的深度学习方法。方法 首先,采用不同大小的卷积核并行对输入图像进行特征提取,接着将卷积后的特征图进行融合,组成特征融合模块;然后选取DeeplabV3中的串并行空间金字塔策略,并添加图像级特征对提取的特征图进一步优化;最后通过双线性插值法对图像进行恢复,完成文档版面目标,即插图、表格、公式的定位与识别任务。结果 本文采用mIOU(mean intersection over union)以及PA(pixel accuracy)两个指标作为评价标准,在ICDAR 2017 POD文档版面目标检测数据集上的实验表明,提出算法在mIOU与PA上分别达到87.26%和98.10%。对比FCN(fully convolutional networks),提出算法在mIOU与PA上分别提升约14.66%和2.22%,并且提出的特征融合模块对模型在mIOU与PA上分别有1.45%与0.22%的提升。结论 本文算法在一个网络框架下同时实现了文档版面多种目标的定位与识别,在训练上并不需要对图像做复杂的预处理,模型结构简单。实验数据表明本文算法在训练数据较少的情况下能够取得较好的识别效果,优于FCN和DeeplabV3方法。
关键词
Layout analysis of document images based on multifeature fusion

Ying Zilu, Zhao Yihong, Xuan Chen, Deng Wenbo(School of Intelligent Manufacturing, Wuyi University, Jiangmen 529020, China)

Abstract
Objective Document image layout analysis aims to segment different regions on the basis of the content of the page and to identify the different regions quickly. Different strategies must be developed for diverse layout objects owing to varied handling for each type of area. Therefore, document image layout must be first analyzed to facilitate subsequent processing. The traditional method of document image layout analysis is generally based on complex rules. The method of first positioning and post-classification cannot simultaneously achieve the regional positioning and classification of document layout, and different document images need their own specific strategies, thereby limiting versatility. Compared with the feature representation of traditional method, the deep learning model has powerful representation and modeling capabilities and is further adaptable to complex target detection tasks. Proposal-based networks, such as Faster region-convolutional neural networks (Faster R-CNN) and region based fully convolutional network (R-FCN), and proposal-free networks, such as single shot multbox detecter (SSD), you only look once (YOLO), and other representative object-level object detection networks, have been proposed. The application of pixel-level object detection networks, such as fully convolutional networks and a series of DeepLab networks, enables deep learning technology to make breakthroughs in target detection tasks. In deep learning, object detection techniques at the object or pixel level have been applied in document layout analysis. However, most methods based on deep learning currently require complex preprocessing processes, such as color coding, image binarization, and simple rules, making the model structure complex. Moreover, the document image will lose considerable information due to the complicated preprocessing process, which affects the recognition accuracy. In addition, common deep learning models are difficult to apply to small datasets. To address these problems, this paper proposes a deep learning method for multi-feature fusion convolutional neural networks. Method First, feature extraction is performed on the input image by convolution layers composed of convolution kernels with different sizes. The convolutional layer of the parallel extraction feature has three layers. The numbers of three convolution kernels are 3, 4, and 3. The first layer uses a large-scale convolution kernel with sizes of 11×11,9×9, and 7×7 to increase the receptive field and retain additional feature information. The number of convolution kernels in the second layer is 4, and the sizes of the convolution kernel are 7×7,5×5,3×3, and 1×1 to increase the feature extraction while ensuring coarse extraction. The third layer is composed of three different scale convolution kernels of 5×5,3×3, and 1×1 to extract detailed information further. The feature fusion module consists of a convolutional layer and a 1×1 size convolution kernel. The fusion module then adds the convolutional layer to extract the features again. The atrous spatial pyramid pooling (ASPP) strategy in DeepLabV3 is selected. ASPP consists of four convolution kernels with different sizes, which are the standard 1×1 convolution kernel and 3×3atrous convolution kernel with expansion ratios of 6, 12, and 18. When the size of the sampled convolution kernel is close to the size of the feature map, the 3×3atrous convolution kernel loses the capability to capture full image information and degenerates into a 1×1 convolution kernel; thus, image-level features are added. The role of ASPP is to expand the receptive field of the convolution kernel without losing the resolution and to retain the information of the feature map to the utmost extent. Finally, the image is restored by bilinear interpolation, and the document layout target is completed as the positioning and identification of figures, tables, and formulas. During training, the experimental environment is Ubuntu 18.04 system, which is trained with TensorFlow framework and NVDIA 1080 GPU with 16 GB memory. The data use the ICDAR 2017 POD document layout target detection dataset with 1 600 training images and 812 test images. The input data pixels are uniformly reduced to 513×513 during training to reduce the model training parameters. Result Mean intersection over union (IOU) and pixel accuracy (PA) are used as evaluation criteria. The experiments on the ICDAR 2017 POD document layout object detection dataset show that the proposed algorithm achieves 87.26% and 98.10% mIOU and PA, respectively. Compared with fully convolutional networks, the proposed algorithm improves mIOU and PA by 14.66% and 2.22%, respectively, and the proposed feature fusion module improves mIOU and PA by 1.45% and 0.22%, respectively. Conclusion This paper proposes the positioning and recognition of multiple targets in the document layout under a network framework. It does not need complex preprocessing on the image, and it simplifies the model structure. The experimental data prove that the algorithm can further efficiently identify the background, illustrations, tables, and formulas and achieve improved recognition results with less training data.
Keywords

订阅号|日报