Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
On-device LLMs cut latency, protect user data, and lower cloud bills—key benefits for mobile apps, privacy-focused services, and offline products.
Summary TLDR
This 38-page review maps the state of running large language models (LLMs) on edge devices. It organizes methods (quantization, pruning, distillation, low-rank factorization, MoE, parameter sharing), software stacks (llama.cpp, MLC-LLM, VLLM), hardware trends (GPUs, NPUs, PIM/PNM), and deployment patterns (edge-only, edge-cloud sharding). The paper compiles numbers from many recent works (e.g., AWQ 3× speedups, EdgeShard up to 50% latency drop, LLMCad 9.3× token speedups) and flags open problems: energy, continual learning, privacy, and hardware-software co-design.
Problem Statement
Cloud LLMs lead to latency, privacy risk, and recurring cloud cost. Running LLMs on phones and edge devices promises instant replies and local data control but is hard because of limited RAM, compute, energy, and thermal budgets. The review asks: which model, compression, and deployment methods make on-device LLMs practical, and what open problems remain?
Main Contribution
Taxonomy of techniques to make LLMs run on edge: compression, efficient architectures, MoE, and collaborative deployment
Survey of software frameworks and hardware options for on-device inference and tiny training
A curated list of deployed on-device models and manufacturer case studies (Gemini Nano, Octopus, OpenELM, Phi-3-mini, MiniCPM)
Compilation of numerical trade-offs (latency, memory, energy) and pointers to future research directions
Key Findings
Edge AI market projected to grow nearly tenfold to $143.6B by 2032.
Post-training activation-aware quantization (AWQ) preserves a tiny fraction of weights and enables large speedups on mobile GPUs.
Collaborative sharding across edge and cloud can sharply cut latency and raise throughput.
Hierarchical generate-then-verify pipelines combine a small local model with a larger verifier to speed token generation.
Memory-centric hardware (PIM/PNM) can cut energy and raise throughput for on-device inference.
Sparse MoE and expert-management designs reduce active compute per token dramatically.
Results
Edge AI market projection
AWQ speedup on mobile GPUs
EdgeShard latency reduction
LLMCad token generation speed
PIM/PNM performance & energy
JetMoE inference compute reduction
Who Should Care
What To Try In 7 Days
Run a small on-device proof-of-concept using llama.cpp or MLC-LLM with a 1–7B model and AWQ/PTQ.
Measure TTFT and energy-per-token on target phones; compare to cloud API baseline.
Prototype a hybrid flow: local fast generator + cloud verifier to balance latency and quality.
Agent Features
Memory
- KV cache compression and chunk-wise swap
- Processing-in-Memory (PIM) and Processing-near-Memory (PNM)
Frameworks
- llama.cpp
- MLC-LLM
- VLLM
- OpenLLM
- ExecuTorch
- MNN
- PowerInfer
Architectures
- decoder-only transformer
- MoE
- modular / adapter-based multimodal modules
- parameter-sharing (deep-and-thin) architectures
Collaboration
- edge-cloud model sharding
- hierarchical generator-then-verifier pipelines
- distributed expert execution across devices
Optimization Features
Token Efficiency
- speculative generation (LLMCad)
- token tree generation and verification
Infra Optimization
- PIM/PNM near-memory compute
- NPU / TPU acceleration
- FPGA for low-power inference
Model Optimization
- MoE
- parameter sharing and deep-and-thin designs
- low-rank compensation (LoRC)
System Optimization
- edge-cloud sharding and dynamic placement
- memory-aware expert preloading
- any-precision serving engines
Training Optimization
- quantization-aware training (QAT)
- sparse-update / contribution analysis
- adapter-based knowledge distillation
Inference Optimization
- post-training quantization (GPTQ / AWQ)
- generate-then-verify speculative decoding
- KV cache compression and swapping
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Many reported gains depend on specific hardware and are not universally reproducible.
- Quantization and pruning introduce accuracy trade-offs that vary by model and task.
- Energy and thermal effects limit long interactive sessions on phones.
- Collaborative sharding adds network complexity and privacy risks.
When Not To Use
- If device memory/compute is extremely small (microcontrollers) prefer server inference.
- When strict, continually updated model knowledge is required and cloud-only models provide fresher data.
Failure Modes
- Accuracy drop after aggressive quantization or pruning on certain tasks.
- Battery drain and device thermal throttling during sustained inference.
- Communication bottlenecks and overhead in edge-cloud sharding.
- Privacy leakage in distributed or collaborative training setups.
Core Entities
Models
- LLaMA
- GPT (GPT-3/4)
- Gemini Nano
- OpenELM
- Phi-3-mini
- Gemma2-9B
- Octopus (Nexa AI)
- MiniCPM-Llama3-V 2.5
- JetMoE
- EdgeMoE
- LLMCad
- MobileLLM
- Qwen2-0.5B
Metrics
- TTFT (Time-to-First-Token)
- tokens/sec
- latency reduction (%)
- throughput (×)
- energy per token (J)
- Accuracy
Datasets
- MMLU
- MT-bench
- OpenCompass
- OCRBench
- TextVQA
- DocVQA
Benchmarks
- MELT (mobile evaluation)
- MT-Bench
- OpenCompass
- OCRBench
Context Entities
Models
- Llama2
- Mixtral
- Gemini Pro
- GPT-4
- Claude 3
- Qwen-VL
Metrics
- battery life impact (hours)
- memory footprint (RAM/VRAM)
- energy reduction (%)
Datasets
- Dolma / Dolma-scale corpora
- DataComp-LM (training corpora references)
Benchmarks
- MT-bench
- MMLU

