Research Archive

Journal Article Open Access Artificial Intelligence

Transformer Architecture Optimization for On-Device Inference: Knowledge Distillation, Quantization, and Pruning Strategies for Deploying Large Language Models on Edge Hardware

The deployment of transformer-based large language models on edge devices -- smartphones, embedded systems, and IoT endpoints -- requires model compression techniques that preserve task performance while meeting the memory, compute, and energy constraints of target hardware. This paper presents a systematic empirical study of three compression paradigms -- knowledge distillation, post-training and quantization-aware quantization, and structured and unstructured pruning -- applied to BERT-base, DistilBERT, and GPT-2-medium target models across NLP benchmarks (GLUE, SQuAD) and deployment on Qualcomm Snapdragon 888, Apple A15 Bionic, and STM32H7 microcontroller platforms. We characterize the accuracy-compression trade-off surface for each technique individually and in combination, finding that hybrid pipelines combining 4-bit quantization with structured pruning at 40% sparsity achieve 6.2x model size reduction and 4.1x inference speedup on Snapdragon 888 at less than 3% accuracy degradation on GLUE. On the STM32H7 microcontroller, task-specific distilled models achieve viable inference at 380ms per token under severe 256KB RAM constraints. We introduce the Edge Deployment Efficiency Index (EDEI) that normalizes accuracy retention against inference latency, memory footprint, and energy consumption, and release a reproducible compression pipeline toolkit supporting all three techniques. This work provides the most comprehensive empirical guide to LLM edge deployment to date.

Chukwuemeka Aneke, Sofia Bergqvist, Keiji Matsumoto, Salma Benali· Apr 2022· 478 citations

Journal Article Open Access Software Engineering

Chaos Engineering in Production: Systematic Fault Injection as a DevOps Reliability Practice — Evidence from Microservices Deployments at Scale

Chaos engineering — the discipline of deliberately injecting faults into production systems to uncover latent weaknesses before they cause customer-impacting failures — has matured from an experimental practice pioneered by Netflix into a mainstream reliability engineering methodology. Yet its systematic integration into DevOps workflows and its measured effects on system reliability at scale remain understudied in the academic literature. This paper presents findings from a three-year longitudinal study of chaos engineering adoption across five organizations operating microservices platforms at scale (ranging from 80 to 1,400 services). We analyze 2,847 chaos experiments conducted across these organizations, categorized by fault type, blast radius, hypothesis quality, and outcome. Our analysis shows that well-formulated chaos experiments with defined steady-state hypotheses uncovered actionable weaknesses in 67% of cases. Organizations with mature chaos programs (>50 experiments per quarter) exhibited 78% fewer severity-1 incidents per deployment compared to organizations without chaos programs. We introduce the Chaos Experiment Quality Score (CEQS), a composite metric for assessing experiment design rigor, and demonstrate its correlation with actionable outcome rate. We also identify the three most impactful fault categories — network partition, resource exhaustion, and dependency timeout — accounting for 71% of all discovered weaknesses.

Tunde Afolabi, Ingrid Petersen, Zhou Weiming, Catalina Iorga· Apr 2022· 481 citations

Transformer Architecture Optimization for On-Device Inference: Knowledge Distillation, Quantization, and Pruning Strategies for Deploying Large Language Models on Edge Hardware

Chaos Engineering in Production: Systematic Fault Injection as a DevOps Reliability Practice — Evidence from Microservices Deployments at Scale

Registration Required