多特征融合的文档图像版面分析

应自炉; 赵毅鸿; 宣晨; 邓文博

doi:10.11834/jig.190190

图像分析和识别 | 浏览量 : 0 下载量: 36 CSCD: 5

PDF
导出
分享
收藏
专辑

多特征融合的文档图像版面分析
Layout analysis of document images based on multifeature fusion
2020年25卷第2期页码：311-320
收稿：2019-05-15，

修回：2019-7-22，

录用：2019-7-29，

纸质出版：2020-02-16
DOI： 10.11834/jig.190190
稿件说明：

移动端阅览

应自炉, 赵毅鸿, 宣晨, 邓文博. 多特征融合的文档图像版面分析[J]. 中国图象图形学报, 2020,25(2):311-320. DOI： 10.11834/jig.190190.

Zilu Ying, Yihong Zhao, Chen Xuan, Wenbo Deng. Layout analysis of document images based on multifeature fusion[J]. Journal of Image and Graphics, 2020, 25(2): 311-320. DOI： 10.11834/jig.190190.

摘要

目的

在文档图像版面分析上，主流的深度学习方法克服了传统方法的缺点，能够同时实现文档版面的区域定位与分类，但大多需要复杂的预处理过程，模型结构复杂。此外，文档图像数据不足的问题导致文档图像版面分析无法在通用的深度学习模型上取得较好的性能。针对上述问题，提出一种多特征融合卷积神经网络的深度学习方法。

方法

首先，采用不同大小的卷积核并行对输入图像进行特征提取，接着将卷积后的特征图进行融合，组成特征融合模块；然后选取DeeplabV3中的串并行空间金字塔策略，并添加图像级特征对提取的特征图进一步优化；最后通过双线性插值法对图像进行恢复，完成文档版面目标，即插图、表格、公式的定位与识别任务。

结果

本文采用mIOU（mean intersection over union）以及PA（pixel accuracy）两个指标作为评价标准，在ICDAR 2017 POD文档版面目标检测数据集上的实验表明，提出算法在mIOU与PA上分别达到87.26%和98.10%。对比FCN（fully convolutional networks），提出算法在mIOU与PA上分别提升约14.66%和2.22%，并且提出的特征融合模块对模型在mIOU与PA上分别有1.45%与0.22%的提升。

结论

本文算法在一个网络框架下同时实现了文档版面多种目标的定位与识别，在训练上并不需要对图像做复杂的预处理，模型结构简单。实验数据表明本文算法在训练数据较少的情况下能够取得较好的识别效果，优于FCN和DeeplabV3方法。

Abstract

Objective

Document image layout analysis aims to segment different regions on the basis of the content of the page and to identify the different regions quickly. Different strategies must be developed for diverse layout objects owing to varied handling for each type of area. Therefore

document image layout must be first analyzed to facilitate subsequent processing. The traditional method of document image layout analysis is generally based on complex rules. The method of first positioning and post-classification cannot simultaneously achieve the regional positioning and classification of document layout

and different document images need their own specific strategies

thereby limiting versatility. Compared with the feature representation of traditional method

the deep learning model has powerful representation and modeling capabilities and is further adaptable to complex target detection tasks. Proposal-based networks

such as Faster region-convolutional neural networks (Faster R-CNN) and region based fully convolutional network (R-FCN)

and proposal-free networks

such as single shot multbox detecter (SSD)

you only look once (YOLO)

and other representative object-level object detection networks

have been proposed. The application of pixel-level object detection networks

such as fully convolutional networks and a series of DeepLab networks

enables deep learning technology to make breakthroughs in target detection tasks. In deep learning

object detection techniques at the object or pixel level have been applied in document layout analysis. However

most methods based on deep learning currently require complex preprocessing processes

such as color coding

image binarization

and simple rules

making the model structure complex. Moreover

the document image will lose considerable information due to the complicated preprocessing process

which affects the recognition accuracy. In addition

common deep learning models are difficult to apply to small datasets. To address these problems

this paper proposes a deep learning method for multi-feature fusion convolutional neural networks.

Method

irst

feature extraction is performed on the input image by convolution layers composed of convolution kernels with different sizes. The convolutional layer of the parallel extraction feature has three layers. The numbers of three convolution kernels are 3

and 3. The first layer uses a large-scale convolution kernel with sizes of 11×11

9×9

and 7×7 to increase the receptive field and retain additional feature information. The number of convolution kernels in the second layer is 4

and the sizes of the convolution kernel are 7×7

5×5

3×3

and 1×1 to increase the feature extraction while ensuring coarse extraction. The third layer is composed of three different scale convolution kernels of 5×5

3×3

and 1×1 to extract detailed information further. The feature fusion module consists of a convolutional layer and a 1×1 size convolution kernel. The fusion module then adds the convolutional layer to extract the features again. The atrous spatial pyramid pooling (ASPP) strategy in DeepLabV3 is selected. ASPP consists of four convolution kernels with different sizes

which are the standard 1×1 convolution kernel and 3×3atrous convolution kernel with expansion ratios of 6

and 18. When the size of the sampled convolution kernel is close to the size of the feature map

the 3×3atrous convolution kernel loses the capability to capture full image information and degenerates into a 1×1 convolution kernel; thus

image-level features are added. The role of ASPP is to expand the receptive field of the convolution kernel without losing the resolution and to retain the information of the feature map to the utmost extent. Finally

the image is restored by bilinear interpolation

and the document layout target is completed as the positioning and identification of figures

tables

and formulas. During training

the experimental environment is Ubuntu 18.04 system

which is trained with TensorFlow framework and NVDIA 1080 GPU with 16 GB memory. The data use the ICDAR 2017 POD document layout target detection dataset with 1 600 training images and 812 test images. The input data pixels are uniformly reduced to 513×513 during training to reduce the model training parameters.

Result

Mean intersection over union (IOU) and pixel accuracy (PA) are used as evaluation criteria. The experiments on the ICDAR 2017 POD document layout object detection dataset show that the proposed algorithm achieves 87.26% and 98.10% mIOU and PA

respectively. Compared with fully convolutional networks

the proposed algorithm improves mIOU and PA by 14.66% and 2.22%

respectively

and the proposed feature fusion module improves mIOU and PA by 1.45% and 0.22%

respectively.

Conclusion

This paper proposes the positioning and recognition of multiple targets in the document layout under a network framework. It does not need complex preprocessing on the image

and it simplifies the model structure. The experimental data prove that the algorithm can further efficiently identify the background

illustrations

tables

and formulas and achieve improved recognition results with less training data.

关键词

Keywords

references

Arif S and Shafait F. 2018. Table detection in document images using foreground and background features//Proceedings of 2018 Digital Image Computing: Techniques and Applications. Canberra, Australia: IEEE: 1-8[ DOI: 10.1109/DICTA.2018.8615795 http://dx.doi.org/10.1109/DICTA.2018.8615795 ]

Barakat B K and El-Sana J. 2018. Binarization free layout analysis for Arabic historical documents using fully convolutional networks//Proceedings of the 2nd IEEE International Workshop on Arabic and Derived Script Analysis and Recognition. London: IEEE: 151-155[ DOI: 10.1109/ASAR.2018.8480333 http://dx.doi.org/10.1109/ASAR.2018.8480333 ]

Chen L C, Papandreou G, Schroff F and Adam H. 2017. Rethinking Atrous convolution for semantic image segmentation[EB/OL]. 2017-12-05[2019-05-05]. https: //arxiv.org/pdf/1706.05587.pdf

Clausner C, Antonacopoulos A and Pletschacher S. 2017. ICDAR2017 competition on recognition of documents with complex Layouts-RDCL2017//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 1404-1410[ DOI: 10.1109/ICDAR.2017.229 http://dx.doi.org/10.1109/ICDAR.2017.229 ]

Dai J F, Li Y, He K M and Sun J. 2016. R-FCN: object detection via region-based fully convolutional networks[EB/OL]. 2016-05-20[2019-05-05] . https://arxiv.org/pdf/1605.06409.pdf https://arxiv.org/pdf/1605.06409.pdf

Eskenazi S, Gomez-Krämer P and Ogier J M. 2017. A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recognition, 64: 1-14[ DOI: 10.1016/j.patcog.2016.10.023 http://dx.doi.org/10.1016/j.patcog.2016.10.023 ]

Fink M, Layer T, Mackenbrock G and Sprinzl G.2018. Baseline detection in historical documents using convolutional U-Nets//Proceedings of the 13th IAPR International Workshop on Document Analysis Systems. Vienna: IEEE: 37-42[ DOI: 10.1109/DAS.2018.34 http://dx.doi.org/10.1109/DAS.2018.34 ]

Gao L C, Yi X H, Jiang Z R, Hao L P and Tang Z. 2017a. ICDAR2017 competition on page object detection//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 1417-1422[ DOI:10.1109/ICDAR.2017.231 http://dx.doi.org/10.1109/ICDAR.2017.231 ]

Gao L C, Yi X H, Liao Y, Jiang Z R, Yan Z Y and Tang Z. 2017b. A deep learning-based formula detection method for PDF documents//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 553-558[ DOI: 10.1109/ICDAR.2017.96 http://dx.doi.org/10.1109/ICDAR.2017.96 ]

Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V and Garcia-Rodriguez J. 2017. A review on deep learning techniques applied to semantic segmentation[EB/OL]. 2017-04-22[2019-05-05] . https://arxiv.org/pdf/1704.06857.pdf https://arxiv.org/pdf/1704.06857.pdf

Gilani A, Qasim S R, Malik I and Shafait F. 2017. Table detection using deep learning//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto, Japan: IEEE: 771-776[ DOI: 10.1109/ICDAR.2017.131 http://dx.doi.org/10.1109/ICDAR.2017.131 ]

Hao L P, Gao L C, Yi X H and Tang Z. 2016. A table detection method for PDF documents based on convolutional neural networks//Proceedings of the 12th IAPR Workshop on Document Analysis Systems. Santorini, Greece: IEEE: 287-292[ DOI: 10.1109/DAS.2016.23 http://dx.doi.org/10.1109/DAS.2016.23 ]

Kaddas P and Gatos B. 2018. A deep convolutional encoder-decoder network for page segmentation of historical handwritten documents into text zones//Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition. Niagara Falls, NY: IEEE: 259-264[ DOI: 10.1109/ICFHR-2018.2018.00053 http://dx.doi.org/10.1109/ICFHR-2018.2018.00053 ]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

Oliveira D A B and Viana M P. 2017. Fast CNN-Based document layout analysis//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops. Venice: IEEE: 1173-1180[ DOI: 10.1109/ICCVW.2017.142 http://dx.doi.org/10.1109/ICCVW.2017.142 ]

Redmon J and Farhadi A. 2017. Yolo9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI: IEEE: 6517-6525[ DOI: 10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ]

Ren S Q, He K M, Girshick and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]

Shelhamer E, Long J and Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640-651[DOI:10.1109/TPAMI.2016.2572683]

Tran T A, Oh K, Na I S, Lee G S, Yang H J and Kim S H. 2017. A robust system for document layout analysis using multilevel homogeneity structure. Expert Systems with Applications, 85:99-113[DOI:10.1016/j.eswa.2017.05.030]

Yu C, Levy C C and Saniee I. 2017. Convolutional neural networks for figure extraction in historical technical documents//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition. Kyoto: IEEE: 789-795[ DOI: 10.1109/ICDAR.2017.134 http://dx.doi.org/10.1109/ICDAR.2017.134 ]

Yu F and Koltun V. 2016. Multi-scale context aggregation by dilated convolutions[EB/OL]. 2016-04-30[2019-05-05] . http://arxiv.org/pdf/1511.07122.pdf http://arxiv.org/pdf/1511.07122.pdf