Deep Learning Architectures for Genomic Variant Pathogenicity Prediction: Evaluation of CNN, LSTM, and Attention-Based Models on ClinVar and gnomAD Population Databases
The clinical interpretation of genomic variants of uncertain significance (VUS) is one of the most pressing bottlenecks in genomic medicine, with over 60 percent of variants identified in clinical sequencing classified as uncertain significance in the ClinVar database. Machine learning approaches to variant pathogenicity prediction offer the potential to reduce this uncertainty, but the relative merits of different deep learning architectures for this task -- and the generalizability of published models across population diversity -- remain incompletely understood. This paper presents a systematic evaluation of four deep learning architectures for variant pathogenicity prediction: one-dimensional CNN with nucleotide sequence context, bidirectional LSTM with epigenomic feature integration, transformer with self-attention over genomic windows, and a novel hybrid CNN-Transformer architecture we term VariantNet. Evaluation uses a benchmark dataset of 48,000 pathogenic and benign variants curated from ClinVar with gnomAD population frequency annotations, stratified by variant type (SNV, indel, splice-site) and ancestry group. VariantNet achieves the highest AUC (0.943) on the combined benchmark, with particularly strong performance on splice-site variants (AUC 0.961) where sequential context is most informative. A critical finding is significant performance degradation for all models on African ancestry variants (mean AUC drop of 0.041) due to underrepresentation in training data, which we address through ancestry-stratified training with transfer learning. We release VariantNet weights, training code, and the curated benchmark dataset as open-source resources for the bioinformatics community.