Current Issue Cover
回归—聚类联合框架下的手写文本行提取

朱健菲, 应自炉, 陈鹏飞(五邑大学信息工程学院, 江门 529020)

摘 要
目的 手写文本行提取是文档图像处理中的重要基础步骤,对于无约束手写文本图像,文本行都会有不同程度的倾斜、弯曲、交叉、粘连等问题。利用传统的几何分割或聚类的方法往往无法保证文本行边缘的精确分割。针对这些问题提出一种基于文本行回归-聚类联合框架的手写文本行提取方法。方法 首先,采用各向异性高斯滤波器组对图像进行多尺度、多方向分析,利用拖尾效应检测脊形结构提取文本行主体区域,并对其骨架化得到文本行回归模型。然后,以连通域为基本图像单元建立超像素表示,为实现超像素的聚类,建立了像素-超像素-文本行关联层级随机场模型,利用能量函数优化的方法实现超像素的聚类与所属文本行标注。在此基础上,检测出所有的行间粘连字符块,采用基于回归线的k-means聚类算法由回归模型引导粘连字符像素聚类,实现粘连字符分割与所属文本行标注。最后,利用文本行标签开关实现了文本行像素的操控显示与定向提取,而不再需要几何分割。结果 在HIT-MW脱机手写中文文档数据集上进行文本行提取测试,检测率DR为99.83%,识别准确率RA为99.92%。结论 实验表明,提出的文本行回归-聚类联合分析框架相比于传统的分段投影分析、最小生成树聚类、Seam Carving等方法提高了文本行边缘的可控性与分割精度。在高效手写文本行提取的同时,最大程度地避免了相邻文本行的干扰,具有较高的准确率和鲁棒性。
关键词
Combination of regression and clustering for handwritten text line extraction

Zhu Jianfei, Ying Zilu, Chen Pengfei(School of Information Engineering, Wuyi University, Jiangmen 529020, China)

Abstract
Objective Handwritten text line extraction is fundamental in document image processing. The text lines may suffer from tilting curving crossing and adhesion because of unconstrained paper layout and free writing style. Traditional text line segmentation or clustering method cannot guarantee the classification accuracy of the pixels between text lines. In this study, a text line regression-clustering joint framework for handwritten text line extraction is proposed. Method First, the anisotropic Gaussian filter bank is used to filter the handwritten document image in multiple scales and directions. The main body area (MBA) of text line is first extracted by smearing, andthe text line regression model is then obtained by extracting the skeleton structure of the MBA. Second, the super-pixel representation is constructed with connected component as the basic image element. For super-pixel classification and clustering, an approach based on associative hierarchical random fields is presented. A higher-order energy model is established by constructing a hierarchical network of pixel-connected component text lines. On the basis of the model, an energy function is built whose minimization yields the text line labels of the connected components. With the achieved instance labels of connected components as basis, the sticky characters that share the same label are detected. Third, the pixels of the sticky characters are re-clustered with k-means algorithm under the constraint of the text line regression model. With the instance labels of text lines, the manipulation of the text lines can be achieved by label switch. Therefore, the geometric segmentation of the document image is no longer needed, and the bounding box can be used to extract text line directly. Result Experiments were performed on HIT-MW document level dataset. The proposed framework achieved an overall detection rate of 99.83% and recognition accuracy of 99.92% which reach to the state-of-the-art performance for Chinese handwritten text line extraction. Conclusion Experimental results show that the proposed text line regression-clustering joint framework improves the segmentation accuracy in pixel levels and makes the edge of the text line more controllable than traditional algorithms, such as piecewise projection, minimum spanning tree-based clustering, and seam carving. The proposed system exhibits high performance on Chinese handwritten text line extraction together with enhanced robustness and accuracy without interference of adjacent text lines.
Keywords

订阅号|日报