Transformer Architecture Optimization for On-Device Inference: Knowledge Distillation, Quantization, and Pruning Strategies for Deploying Large Language Models on Edge Hardware
The deployment of transformer-based large language models on edge devices -- smartphones, embedded systems, and IoT endpoints -- requires model compression techniques that preserve task performance while meeting the memory, compute, and energy constraints of target hardware. This paper presents a systematic empirical study of three compression paradigms -- knowledge distillation, post-training and quantization-aware quantization, and structured and unstructured pruning -- applied to BERT-base, DistilBERT, and GPT-2-medium target models across NLP benchmarks (GLUE, SQuAD) and deployment on Qualcomm Snapdragon 888, Apple A15 Bionic, and STM32H7 microcontroller platforms. We characterize the accuracy-compression trade-off surface for each technique individually and in combination, finding that hybrid pipelines combining 4-bit quantization with structured pruning at 40% sparsity achieve 6.2x model size reduction and 4.1x inference speedup on Snapdragon 888 at less than 3% accuracy degradation on GLUE. On the STM32H7 microcontroller, task-specific distilled models achieve viable inference at 380ms per token under severe 256KB RAM constraints. We introduce the Edge Deployment Efficiency Index (EDEI) that normalizes accuracy retention against inference latency, memory footprint, and energy consumption, and release a reproducible compression pipeline toolkit supporting all three techniques. This work provides the most comprehensive empirical guide to LLM edge deployment to date.