Project Objective
The primary goal of this study is to enhance and expand existing datasets of isolated glyph images from Southeast Asian palm leaf manuscripts. Publicly available datasets for scripts like Khmer, Balinese, and Sundanese are often limited in size and diversity, which hinders the training of robust deep learning models for document analysis. Our project addresses this by systematically extracting new collections from existing text-line and word-level datasets, aiming to increase the original dataset sizes by 20-50% and thereby improve the performance of classification and recognition tasks.
Our Data Collection Process
To ensure high-quality and linguistically accurate data, we engaged 15 university students from Cambodia and Indonesia. This collaborative, human-in-the-loop process was divided into two distinct roles:
- Group 1: Collectors: This group was responsible for the initial extraction. Using user-friendly annotation tools, they manually cropped individual characters from high-resolution manuscript images. Their local knowledge was crucial for accurately identifying glyphs, even in degraded or low-quality scans.
- Group 2: Validators: This group ensured the accuracy of the collected data. They meticulously cross-referenced the newly labeled glyphs against established character classes and dictionaries. This validation step was critical for maintaining data integrity and correcting any inconsistencies.
Challenges Faced
- Limited Dataset Diversity: The initial datasets lacked the variety needed to train robust models. Acquiring additional high-quality manuscript scans remains an ongoing challenge.
- Complex Character Recognition: The intricate nature of the scripts, with layered characters and stylistic variations, complicated the manual labeling process, requiring deep linguistic knowledge.
- Variability in Manuscript Quality: Inconsistent image quality due to age, damage, or poor scanning practices posed significant difficulties for accurate segmentation and preprocessing.
- Validation Complexity: Cross-referencing and validating labels against dictionaries was time-consuming, and establishing a reliable method for resolving labeling conflicts was essential for data integrity.
Results & Future Directions
Through this collaborative effort, we successfully collected and validated 15,000 new glyph images. This expansion significantly enhances the training capabilities of our deep learning models and contributes a valuable resource to the broader historical document analysis community. Future work will focus on integrating this expanded dataset into our machine learning frameworks and exploring advanced data augmentation techniques to further enrich the data and improve model performance.