Project Overview
The PALM-SEA project is dedicated to the digital preservation and analysis of historical palm leaf manuscripts from Southeast Asia. These manuscripts are invaluable cultural artifacts, containing centuries of knowledge on topics ranging from religious texts and historical records to literature and traditional medicine. However, their organic nature makes them highly susceptible to degradation, posing a significant risk to this heritage.
Our project addresses this challenge by creating the largest publicly available, multi-script dataset of palm leaf manuscripts, featuring scripts like Sundanese, Balinese, and Khmer. We develop and benchmark advanced computational methods for critical tasks such as document enhancement, isolated glyph classification, and full text recognition. By creating robust digital tools, we aim to unlock the rich information held within these manuscripts for scholars, historians, and future generations.
Project Tasks & Details
Our research is structured around four core tasks, each addressing a specific challenge in the digital analysis of palm leaf manuscripts. Explore the details for each task below.
Manuscript Collections
Sundanese Manuscripts
Originating from West Java, these manuscripts are written in the Old Sundanese script. The collection showcases the script's distinct rounded letterforms and complex ligatures. Key challenges include high character shape variability due to natural wear and the presence of overlapping text lines, requiring advanced image enhancement.
Balinese Manuscripts
From Bali and Lombok, these texts cover a rich array of topics. The intricate Balinese script features a mix of base consonants and vowel diacritics, creating complex ligatures and stacked forms. The presence of decorative elements intertwined with the script complicates segmentation and recognition.
Khmer Manuscripts
From Cambodia, these use one of the oldest scripts in Southeast Asia. The Khmer script is known for its curvilinear shapes and unique subscript characters, adding complexity to the text structure. Many manuscripts are severely degraded, with faint or fragmented characters requiring specialized restoration techniques.
Mixed Script Dataset
This collection combines samples from all three scripts to enable robust multi-script analysis, script identification, and cross-lingual studies. It is curated to include variations in script styles and degradation patterns, simulating real-world challenges and fostering the development of generalized models.
Publications
Multi-low resource languages in palm leaf manuscript recognition: Syllable-based augmentation and error analysis
Nimol Thuon, et al. (2025). Pattern Recognition Letters,.
A Low-Intervention Dual-Loop Iterative Process for Efficient Dataset Expansion and Classification in Palm Leaf Manuscript Analysis
Nimol Thuon, et al. (2025). International Journal on Document Analysis and Recognition (IJDAR), Special Issue track ICDAR-IJDAR 2025.
Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement.
Nimol Thuon, et al. (2024). International Journal on Document Analysis and Recognition (IJDAR), Special Issue track ICDAR-IJDAR 2024.
KhmerFormer: Multi-Scale CNNs-Transformer with External Attention for Ancient Khmer Isolated Glyph Classification
Thuon, N., et al. (2024). Asia-Pacific Signal and Information Processing Association Annual Summit (APSIPA 2024).
Improving Isolated Glyph Classification Task for Palm Leaf Manuscripts
Thuon, N., et al. (2024). International Conference on Frontiers in Handwriting Recognition 2022 (ICFHR 2022).
Conclusion & Impact
Research Contribution
This research introduces novel methodologies (e.g., PALM-GANs for enhancement, EFF for classification, and SADA for text synthesis) that establish new state-of-the-art benchmarks for the analysis of complex, low-resource historical manuscripts.
Cultural Impact
Beyond technical advancements, this work plays a crucial role in the digital preservation of Southeast Asian cultural heritage. By making the contents of endangered manuscripts accessible, we empower new forms of scholarly inquiry and public engagement.
Future Work
Future directions include expanding the dataset to more scripts (e.g., Javanese, Lontara), integrating multimodal approaches (combining visual and linguistic cues), and developing interactive, AI-assisted tools for historians and linguists.
Acknowledgment
This work was primarily conducted by Nimol Thuon, with funding support from the Chinese Academy of Sciences (CAS), The World Academy of Sciences (TWAS, Italy), and the National Natural Science Foundation of China (NSFC). The author also acknowledges the valuable contributions and collaboration from partners in Cambodia, China, and Indonesia.
References
[1] Kesiman, M. W. A., et al. (2018). ICFHR 2018 competition on document image analysis tasks for southeast asian palm leaf manuscripts. In 16th International Conference on Frontiers in Handwriting Recognition (ICFHR).
[2] Valy, D., et al. (2017). A new khmer palm leaf manuscript dataset for document analysis and recognition: Sleukrith set. In 4th International Workshop on Historical Document Imaging and Processing.
[3] Suryani, M., et al. (2017). The handwritten sundanese palm leaf manuscript dataset from 15th century. In 14th IAPR international conference on document analysis and recognition (ICDAR).
[4] Burie, J. C., et al. (2016). ICFHR2016 competition on the analysis of handwritten text in images of balinese palm leaf manuscripts. In 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).
[5] Thuon, N., et al. (2024). Generate, transform, and clean: the role of GANs and transformers in palm leaf manuscript generation and enhancement. International Journal on Document Analysis and Recognition (IJDAR).