Retrieval-Augmented Generation for Knowledge-Intensive Enterprise Tasks: Dense Retriever Design, Index Freshness Management, and Faithfulness Evaluation in Production RAG Systems
Retrieval-Augmented Generation (RAG) -- the combination of dense retrieval over a knowledge corpus with generative language model synthesis -- has emerged as the leading architectural pattern for grounding large language model outputs in verifiable factual sources, addressing the hallucination and knowledge staleness limitations of parametric LLM knowledge. Despite rapid practitioner adoption, the engineering design space of production RAG systems -- covering retriever architecture selection, embedding model choice, index freshness management, chunking strategy, context window utilization, and faithfulness evaluation -- lacks systematic empirical treatment. This paper presents the first comprehensive engineering study of production RAG systems, evaluating design decisions across 14 dimensions using a standardized enterprise question answering benchmark comprising 8,400 questions across legal, financial, and technical documentation corpora. We find that bi-encoder dense retrievers (DPR, E5-large) outperform BM25 sparse retrieval by 18.4 F1 points on complex multi-hop questions but underperform by 7.2 points on exact keyword lookup queries, motivating hybrid retrieval as the default architecture. Chunk size has the highest sensitivity of any single design parameter -- optimal chunk size varies by 4x across corpora depending on document structure. We introduce the RAG Faithfulness Score (RFS), a composite metric measuring citation accuracy, claim groundedness, and context utilization efficiency, and demonstrate its correlation with downstream user trust ratings (Pearson r=0.74). We release evaluation code, benchmark datasets, and optimal configuration templates for six enterprise RAG deployment profiles.