Sentence Embedding Architectures for Cross-Lingual Information Retrieval: Comparative Evaluation of Multilingual LSTM, Subword Segmentation, and Shared Encoder Approaches on Low-Resource Language Pairs
Cross-lingual information retrieval -- the task of retrieving documents in one language using queries formulated in another -- demands sentence representation architectures capable of projecting semantically equivalent content from different languages into aligned embedding spaces. This challenge is particularly acute for low-resource language pairs where parallel training corpora are scarce. This paper presents a systematic comparative evaluation of three cross-lingual sentence embedding architectures: multilingual LSTM encoders with shared vocabulary, subword segmentation approaches using SentencePiece with joint BPE, and shared encoder transformer architectures pretrained with cross-lingual objectives. Evaluation is conducted across 12 language pairs spanning high-resource (English-French, English-German), medium-resource (English-Swahili, English-Yoruba), and low-resource (English-Igbo, English-Hausa) settings using a standardized retrieval benchmark we construct from Wikipedia parallel corpora. Shared encoder transformers achieve the highest mean average precision across all resource levels, but exhibit a steeper performance cliff below 50,000 parallel training sentence pairs compared to subword approaches. For the Yoruba and Igbo language pairs, subword segmentation with morphologically-informed tokenization outperforms shared encoder approaches by 11.4 and 14.7 MAP points respectively due to the agglutinative morphological structure of these languages. We release the low-resource African language retrieval benchmark as an open dataset to stimulate further research in underrepresented language families.