Current Issue Cover
IHCCD: 非规范手写汉字识别数据集

季佳美, 邵允学, 季倓正(南京工业大学计算机科学与技术学院)

摘 要
目的 随着深度学习技术的快速发展,规范手写汉字识别(Handwritten Chinese character recognition, HCCR) 任务已经取得突破性进展,但对非规范书写汉字识别的研究仍处于萌芽阶段。受到书法流派和书写习惯等原因影响,手写汉字常常与打印字体差异显著,导致同类别文字的整体结构差异非常大,基于现有数据集训练得到的识别模型,无法准确识别非规范书写的汉字。方法 为了推动非规范书写汉字识别的研究工作,本文收集了首套非规范书写的汉字数据集(Irregular Handwritten Chinese Character Dataset, IHCCD),目前共包含3755个类别,每个类别有30张样本。结果 本文给出了经典深度学习模型(ResNet,CBAM-ResNet,Vision Transformer,Swin Transformer)在此数据集上的基准性能,为推动非规范手写汉字识别发展奠定基础。该数据集下载链接:
IHCCD: dataset for identification of irregular handwritten Chinese characters

Ji Jiamei, Shao Yunxue, Ji Tanzheng(School of Computer Science and Technology,Nanjing Tech University)

Objective With the rapid development of deep learning technology, the task of Handwritten Chinese character recognition (HCCR) has made breakthrough progress. Initially, text recognition research mainly focused on the recognition of English characters and numbers, however, with the deepening of artificial intelligence technology, more and more researchers began to focus on the field of Chinese character recognition. In recent years, Chinese character recognition has been widely used in several application scenarios, and currently has a wide range of application scenarios in the fields of bank bill recognition, mail sorting, and office automation. Chinese characters are the most widely used language in the world with the richest information meaning, and they are an important language carrier for people"s communication, so the research on Chinese character recognition has a very important value. However, despite these advancements, the recognition of irregular handwritten Chinese characters remains a challenging task. Handwritten Chinese characters are often influenced by various calligraphic styles and individual writing habits, leading to significant deviations from regular printed fonts. These variations can result in considerable differences in the overall structure of characters within the same category. As a result, recognition models trained on these regular datasets may struggle to accurately identify irregularly handwritten Chinese characters encountered in real-world scenarios. For example, when we send a picture to WeChat, the text in the picture may involve sensitive words. When the text recognition engine recognizes these words, if these words are regular writing, then the engine can accurately identify and filter these sensitive words, but some people intentionally avoid the recognition of the text recognition engine due to irregular handwriting, to avoid regulation, resulting in the search engine can not recognize these words. Therefore, the research on the recognition of irregular handwritten Chinese characters is of great significance and can be applied in the field of information security and information filtering. Method The dataset of irregular handwritten Chinese characters can be classified into various types, including missing strokes or the wrong order of strokes, problems with the connection or separation of strokes, maliciously enlarged or shrunken radicals, serious distortion of the character shape, saki change of the form, and excessive amplitude of horizontal and vertical, which make the whole spatial structure of the characters misplaced and easily lead to ambiguities and misinterpretations. In order to promote the research work on recognition of irregular handwritten Chinese characters, this paper collects the first Irregular Handwritten Chinese Character Dataset (IHCCD), which currently contains a total of 3755 categories with 30 samples for each category. In the experiment, the first 20 samples were used as training samples and the next 10 samples were used as test samples. IHCCD is done by different irregular handwriters who handwrite on A4 printing paper and use scanner as input device to convert handwritten character samples into digital image samples. During the dataset collection process, these irregular handwriters do not need to write exactly according to the regular Chinese character stroke order, they can freely adjust the stroke thickness, length and position, they can enlarge or reduce the radicals arbitrarily, and they can change the tilt of the Chinese characters, which makes the characters" shape distorted and the spatial structure misaligned, so as to achieve the purpose of bypassing the current text recognition engine. For the collected dataset of irregular handwritten Chinese characters, a series of image processing techniques need to be adopted. A series of image processing techniques, including image skew correction, single character segmentation, Otsu binarization, character normalization, etc., need to be adopted in order to construct the IHCCD dataset. Result In this paper, detailed experiments were conducted on the IHCCD dataset and the CASIA-HWDB1.1 dataset to compare the recognition performance of the classical network models, such as ResNet, CBAM-ResNet, Vision Transformer, and Swin Transformer, under different experimental settings, and the results show that, although the classical network models on the canonical written CASIA-HWDB1.1 dataset can achieve good performance, but the network models trained with CASIA-HWDB1.1 training set have poor recognition results on the IHCCD test set, and after adding the IHCCD training set, the recognition performance of all the classical models on the IHCCD test set is greatly improved, showing that the IHCCD dataset is significant for the study of irregular written Chinese character recognition. The existing OCR recognition models still have limitations, and the dataset collected in this paper can effectively enhance the generalization performance of the recognition models. However, even for the Swin Transformer model, which has the best performance, there is still a big gap between the recognition accuracy of irregular written Chinese characters and that of regular written Chinese characters, which requires researchers to do further in-depth study on this problem. Link to download this dataset: