Current Issue Cover
少数民族文字文本分析与识别的研究进展

王维兰1, 胡金水2, 魏宏喜3, 邵文苑4, 毕晓君5, 贺建军6, 李振江7, 杨争艳2, 丁凯8, 吴嘉嘉2, 郭丰俊8, 张建树2, 李婉莹9, 殷保才2, 殷兵2, 刘聪2, 杨耀威9, 金连文10, 高良才11(1.西北民族大学数学与计算机科学学院;2.科大讯飞研究院;3.内蒙古大学计算机学院;4.上海大学社会学院;5.中央民族大学信息工程学院;6.大连民族大学信息与通信工程学院;7.甘肃政法大学网络空间安全学院;8.上海合合信息科技股份有限公司;9.新疆大学计算机科学与技术学院;10.华南理工大学电子与信息学院;11.北京大学王选计算机研究所)

摘 要
我国各民族文字在结构类型、创制年代、使用地区和范围等方面都不同,民族文字撰写、记录、印制的历史文献和各种图书资料更是浩如烟海,都为我们探讨不同民族的文明史、发展史留下了极其宝贵的财富。相对于主流的语言,少数民族文字研究往往存在低资源的情况。而少数民族非物质文化遗产的保护与传承得到国家的高度重视,对不可再生的多元文化资源的保护具有重要的意义和应用价值。21世纪以来,得益于文档图像分析与识别领域技术的不断发展和应用,少数民族文字文本分析与识别的研究和应用得到广泛关注并取得很大的进展,也逐渐成为文档分析与识别、人工智能领域的研究热点之一。然而,由于少数民族文字数量众多、应用场景广泛、数据集稀少等因素,少数民族文字文本与识别研究领域仍然存在着大量问题亟需解决。为了更好地总结前人工作,为后续研究提供支持,本文主要围绕几个少数民族文种的印刷体文本识别、手写识别、古籍文档识别和场景文字识别等4个子任务,综述该领域国内外的发展历史和最新进展。首先阐述少数民族文字文本分析与识别的重要性和价值所在,介绍部分少数民族文字文本特别是古籍文档及其特点;然后回顾该领域发展历史和研究现状,分析总结传统方法研究的代表性成果和深度学习方法研究的进展。总体上,当前研究对象向深度、广度扩展,处理方法全面转向深度神经网络模型和深度学习方法,识别性能大幅提升且应用场景不断扩展。最后,在相关分析基础上,指出少数民族文字文本分析与识别中精度和泛化性上存在的明显不足、与汉文文本分析与识别存在的差异等问题。就少数民族文字文本识别领域所面临的主要困难与挑战,对未来的研究趋势和技术发展目标进行了展望。
关键词
A survey on text analysis and recognition for multi-ethnic scripts

Wang Weilan, Hu Jinshui1, Wei Hongxi2, Shao Wenyuan, Bi Xiaojun3, He Jianjun4, Li zhenjiang5, Yang Zhengyan1, Ding Kai6, Wu Jiajia1, Guo Fengjun6, Zhang Jianshu1, Li Wanying7, Yin Baocai1, Yin bing1, Liu Cong1, Yang Yaowei7, Jin Lianwen8,9,10,11, Gao Liangcai8,9,10,11,8,9,10,11(1.IFLYTEK Co Ltd;2.Inner Mongolia University,Hohhot;3.Shanghai University;4.Minzu University of China;5.Dalian Minzu University;6.Gansu University of Political Science and Law;7.Xinjiang University;8.Northwest Minzu University,Lanzhou,;9.INTSIG Information Co Ltd;10.South China University of Technology;11.Peking University)

Abstract
China"s ethnic scripts are different in terms of their structure types, creation periods, regions of usage and scope, etc. The historical documents and various literary materials written, recorded, and printed in ethnic scripts are even more voluminous, leaving an invaluable wealth for exploring the civilization and development history of different ethnic groups. Compared with mainstream languages, the study of ethnic minority scripts often faces low-resource conditions. In recent years, the protection and inheritance of the intangible cultural heritage of ethnic minorities have garnered high attention from the country, which is of great significance and application value for the protection of irreparable diverse cultural resources. Since the 21st century, thanks to the continuous development and application of technologies in the field of document image analysis and recognition, the research and application of ethnic scripts text analysis and recognition have received extensive attention and made remarkable progress, becoming one of the research hotspots in the field of document analysis and recognition and artificial intelligence. However, due to the large number of minority scripts, the wide range of application scenarios, and the scarcity of datasets, there are still a considerable number of problems that need to be solved in the field of minority script text and recognition research. To better summarize previous work and provide support for the subsequent research, this paper reviews the development history and recent progress in this field at home and abroad, focusing on four sub-tasks: printed text recognition, handwriting recognition, historical document recognition, and scene text recognition of several minority texts. These researches are mainly related to 1) In the document image pre-processing stage, the system performs a series of operations on the input image, such as binarization, noise removal, skew correction, image enhancement, etc. The goal of preprocessing is to improve the accuracy of subsequent analysis and recognition. 2) Layout analysis, such as layout segmentation, text line segmentation, and character segmentation, which helps to understand the organizational structure of documents and extract useful information. 3) Text recognition is one of the core tasks of document image analysis, which identifies the text in a document through various technical approaches. This may involve traditional methods such as text recognition based on single character classifiers, or it may include end-to-end text line recognition in deep learning methods. 4) Dataset construction, constructing various datasets for training and evaluating algorithms, such as document image binarization datasets, layout analysis datasets, text line datasets, character datasets and so on. Comparatively speaking, historical document analysis and recognition is most difficult due to the complexities of rough, degraded, and damaged historical book paper, resulting in severe background noise in the document image layout, sticky text strokes, unclear handwriting, and damage. At present, there is a lack of practical recognition system for historical documents. Firstly, the importance and value of minority script text analysis and recognition are explained, and some minority script texts, especially historical documents, and their characteristics are introduced. Then the history of the development of the field and the current state of the research is reviewed, and the representative results of the research of the traditional methods and the progress of the research of the deep learning methods are analyzed and summarized. Current research objects are expanding in depth and breadth, with processing methods comprehensively shifting to deep neural network models and deep learning methods, the recognition performance is greatly improved and the application scenarios are constantly expanding. Based on relevant analyses, it is pointed out that there are obvious deficiencies in recognition accuracy and generalization ability, and differences with Chinese text recognition of ethnic script text recognition. Finally, the main difficulties and challenges faced in the field of minority text recognition are discussed, and the future research trends and technical development goals are prospected.
Keywords

订阅号|日报