dc.description.abstract | Historical maps are essential resources for understanding the geography, culture, and socio-political landscapes of the past. However, the manual interpretation of these maps presents significant challenges for researchers due to its labor-intensive and time-consuming nature. This difficulty is compounded by the lack of annotated datasets of geographical names for Chinese historical maps, making it even more challenging to extract and analyze the information contained within these historical maps.
Current methods often fall short in effectively handling the complexities of historical maps. Our experiments show that manual annotation can take 1–2 days per Chinese historical map, which not only hampers research productivity but also leads to errors stemming from fatigue and the irregular spacing of geographical names. Additionally, Existing Optical Character Recognition (OCR) systems are typically optimized for contemporary texts and struggle with the unique characteristics of historical maps, such as handwritten annotations and grayscale imagery.
To address these shortcomings, this study introduces a five-stage automated process for extracting and recognizing geographical name from Chinese historical maps. This method encompasses character detection, character recognition, character reintegration, and character grouping for geographical names, leveraging Optical Character Recognition (OCR) to enhance both accuracy and efficiency. To expand the training dataset for character detection and recognition, data augmentation techniques, specifically HSV (Hue, Saturation, Value) transformations, are employed. These augmentations improve the model′s ability to manage the distinctive features of historical maps.
One major challenge tackled in this research is the irregular spacing of geographical names, which complicates automatic grouping. To resolve this, Delaunay triangulation is utilized to group geographically related geographical names effectively. We used topographic maps of Hebei, Liaoning, and Shanxi provinces from the 1930s as training datasets. In the overall system evaluation, using topographic maps of Hebei Province from the 1930s as a test dataset, our system achieved 70% accuracy in extracting correct geographical names while also detecting additional scattered character boxes.
In contrast to manual annotation, our proposed Chinese Historical MapOCR system completes the extraction process in just 7-10 minutes, significantly reducing both time and labor costs. This substantial improvement in efficiency provides an invaluable tool for historians and scholars working with large collections of historical maps. | en_US |