Khmer Semantic Search Engine: Digital Information Access and Document Retrieval

Nimol Thuon
Institute of Technology of Cambodia (ITC)



Bridging the gap in digital information access for Khmer texts through semantic search techniques...

[Paper 1]      [Paper 2]      [GitHub]      [BibTeX]

Abstract

The search engine process is crucial for document content retrieval. For Khmer documents, an effective tool is needed to extract essential keywords and facilitate accurate searches. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries, performs precise matching, and provides the best matching offline documents and online URLs. We propose three semantic search frameworks: semantic search based on a keyword dictionary, semantic search based on ontology, and semantic search based on ranking. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and addressed issues related to searching and semantic search. Our findings demonstrate that understanding search term semantics can lead to significantly more accurate results.

Background

Khmer, a complex and low-resource language, presents significant challenges for digital content retrieval. Traditional keyword-based search systems often fail to capture the nuances of the language, resulting in poor retrieval accuracy. Our research introduces a semantic search engine that utilizes advanced language modeling techniques, enabling accurate document retrieval and improved user experience in accessing Khmer digital content.



Approach

Our system takes a hybrid approach, combining a pre-trained transformer model for semantic embeddings and a fine-tuned classifier to identify relevant document segments. The pipeline is optimized to handle large-scale Khmer documents and can accurately retrieve information based on context-aware queries. The model is further fine-tuned using a contrastive learning approach, aligning semantic similarities between queries and document passages.



The fine-tuning process is divided into two stages: (a) embedding optimization to capture both lexical and semantic features, and (b) query-document alignment using a Siamese network architecture to refine search results based on contextual relevance.



Results

Our evaluation shows that the Khmer Semantic Search Engine significantly outperforms traditional keyword-based methods in both precision and recall, delivering more relevant search results for complex queries. The search engine's effectiveness is demonstrated through various real-world test cases, including academic research papers, historical texts, and legal documents.

Keyword Extraction Comparison Results

TABLE I: Comparison of Manual Extraction and Tool Extraction for KSE
Document ID Manual Keywords KSE Keywords TF-IDF TextRank RAKE
1kw1, kw2, kw3kw1, kw2, kw4kw2, kw3, kw5kw1, kw3, kw6kw1, kw4, kw5
2kw4, kw5, kw6kw4, kw5, kw7kw5, kw6, kw8kw4, kw6, kw9kw4, kw7, kw8
3kw7, kw8, kw9kw7, kw8, kw10kw8, kw9, kw11kw7, kw9, kw12kw7, kw10, kw11
4kw10, kw11, kw12kw10, kw11, kw13kw11, kw12, kw14kw10, kw12, kw15kw10, kw13, kw14
5kw13, kw14, kw15kw13, kw14, kw16kw14, kw15, kw17kw13, kw15, kw18kw13, kw16, kw17
TABLE II: Comparison of Manual Extraction and Tool Extraction for Title and Body Keywords
Manual Extraction Keyword Extraction Keyword by Tool Manual Extraction Keyword Extraction Keyword by Tool
TitleBodyTitleBody
Khos RongSecretSihanoukvilleKhos Rong
Khmer TourismNaturalSeaSihanoukville
Natural BeautyKhos RongTouristKhmer
NaturalKhmer TourismKhos RongNatural
TourismRareNatural BeautyBeach
SecretBeachNatural BeautyNatural Beauty
BeachBeachNatural
TABLE III: Results of Keyword Extraction based on Title and Body Keywords
Document ID Title Body
PrecisionRecallF1PrecisionRecallF1
10.800.660.360.770.770.78
20.660.800.720.810.900.85
30.570.440.500.830.550.66
41.001.001.000.880.800.84
51.000.570.721.000.720.84
1,1500.710.830.761.000.750.85
Average0.880.810.840.810.790.79
TABLE IV: Comparison of Search Results for Top 5 Documents
Keyword Manual Extraction Tool Extraction Manual Top 5 Documents Tool Top 5 Documents
Khos RongKhos RongKhos RongDoc1, Doc2, Doc3, Doc4, Doc5Doc1, Doc3, Doc4, Doc6, Doc7
Khmer TourismKhmer TourismNaturalDoc2, Doc5, Doc8, Doc11, Doc14Doc2, Doc5, Doc9, Doc12, Doc15
Natural BeautyNatural BeautyKhos RongDoc3, Doc6, Doc9, Doc12, Doc15Doc1, Doc3, Doc6, Doc10, Doc13
NaturalNaturalKhmer TourismDoc4, Doc7, Doc10, Doc13, Doc16Doc4, Doc7, Doc11, Doc14, Doc17
TouristTouristBeachDoc5, Doc8, Doc11, Doc14, Doc17Doc5, Doc8, Doc12, Doc15, Doc18
TABLE V: Results of Our Proposed KSE Across Documents
Total Input Test F1-Score
10.71
20.87
30.48
40.79
50.77
1000.71
Average0.75

Societal Impact

The system can also handle complex multi-modal queries, such as combining text and visual elements to extract relevant document sections. By incorporating view synthesis techniques, the search engine can dynamically adjust its output based on user preferences, enhancing user interaction and information accessibility.

The Khmer Semantic Search Engine is adaptable to a wide range of document types and contexts, offering customization options based on user requirements. Whether it's legal documents, academic texts, or historical archives, the system maintains a high level of fidelity in terms of context and semantic accuracy.

Our system can be extended to support document accessorization and annotation tasks, such as highlighting key phrases, adding metadata, and integrating cross-references. These features enable more comprehensive document management and access for users, making the Khmer Semantic Search Engine a versatile tool for both research and practical applications.

This project aims to promote digital inclusivity and preservation for low-resource languages like Khmer. By providing a robust and accessible information retrieval system, we hope to empower researchers, educators, and the general public to access and engage with Khmer digital content more effectively. The potential for misuse, however, must be carefully considered, and efforts should be made to ensure the ethical deployment of such systems.


BibTex

@article{thuon2024khmersemanticsearch,
  title={Khmer Semantic Search Engine: Digital Information Access and Document Retrieval},
  author={Thuon, Nimol},
  booktitle={arXiv preprint arxiv:2406.09320v1},
  year={2024}
}

Acknowledgements: I would like to thank my research team at ITC for their support, as well as colleagues and mentors who have provided invaluable feedback throughout this project.