Research Archive

Journal Article Open Access Natural Language Processing

Sentence Embedding Architectures for Cross-Lingual Information Retrieval: Comparative Evaluation of Multilingual LSTM, Subword Segmentation, and Shared Encoder Approaches on Low-Resource Language Pairs

Cross-lingual information retrieval -- the task of retrieving documents in one language using queries formulated in another -- demands sentence representation architectures capable of projecting semantically equivalent content from different languages into aligned embedding spaces. This challenge is particularly acute for low-resource language pairs where parallel training corpora are scarce. This paper presents a systematic comparative evaluation of three cross-lingual sentence embedding architectures: multilingual LSTM encoders with shared vocabulary, subword segmentation approaches using SentencePiece with joint BPE, and shared encoder transformer architectures pretrained with cross-lingual objectives. Evaluation is conducted across 12 language pairs spanning high-resource (English-French, English-German), medium-resource (English-Swahili, English-Yoruba), and low-resource (English-Igbo, English-Hausa) settings using a standardized retrieval benchmark we construct from Wikipedia parallel corpora. Shared encoder transformers achieve the highest mean average precision across all resource levels, but exhibit a steeper performance cliff below 50,000 parallel training sentence pairs compared to subword approaches. For the Yoruba and Igbo language pairs, subword segmentation with morphologically-informed tokenization outperforms shared encoder approaches by 11.4 and 14.7 MAP points respectively due to the agglutinative morphological structure of these languages. We release the low-resource African language retrieval benchmark as an open dataset to stimulate further research in underrepresented language families.

Ngozi Chukwu, Erik Lindstrom, Yuki Inoue, Amina Diallo· Mar 2016· 302 citations

Journal Article Open Access Software Engineering

Continuous Integration and Continuous Delivery Pipelines: Empirical Evidence from Large-Scale Enterprise Adoptions

Continuous Integration and Continuous Delivery (CI/CD) have become foundational practices in modern software engineering, yet their large-scale adoption within enterprise environments remains poorly understood. This paper presents findings from a multi-case study involving twelve enterprise organizations spanning the finance, healthcare, and telecommunications sectors, each with engineering teams of over 200 developers. Through 87 semi-structured interviews, artifact analysis, and longitudinal observation over 18 months, we identify the critical success factors and systemic barriers that determine CI/CD adoption outcomes. Our findings reveal that cultural resistance, legacy system incompatibility, and inadequate test automation maturity are the three most significant impediments. We propose a five-stage CI/CD Maturity Model (CM²) that maps organizational capabilities to adoption readiness, and validate it against our case data. Organizations that progressed beyond Stage 3 of the model reported a 43% reduction in mean time to recovery (MTTR) and a 61% increase in deployment frequency within 12 months. This work contributes both a validated empirical framework and actionable guidance for practitioners navigating enterprise-scale DevOps transformation.

Lena Hartmann, Marcus J. Oduya, Priya Subramaniam, Thomas Beckett· Mar 2016· 312 citations

Sentence Embedding Architectures for Cross-Lingual Information Retrieval: Comparative Evaluation of Multilingual LSTM, Subword Segmentation, and Shared Encoder Approaches on Low-Resource Language Pairs

Continuous Integration and Continuous Delivery Pipelines: Empirical Evidence from Large-Scale Enterprise Adoptions

Registration Required