Survey of techniques, hardware, and trade-offs to run LLMs directly on phones and edge devices

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.8

Citation Count

Authors

Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling

Links

Abstract / PDF

Why It Matters For Business

On-device LLMs cut latency, protect user data, and lower cloud bills—key benefits for mobile apps, privacy-focused services, and offline products.

Summary TLDR

This 38-page review maps the state of running large language models (LLMs) on edge devices. It organizes methods (quantization, pruning, distillation, low-rank factorization, MoE, parameter sharing), software stacks (llama.cpp, MLC-LLM, VLLM), hardware trends (GPUs, NPUs, PIM/PNM), and deployment patterns (edge-only, edge-cloud sharding). The paper compiles numbers from many recent works (e.g., AWQ 3× speedups, EdgeShard up to 50% latency drop, LLMCad 9.3× token speedups) and flags open problems: energy, continual learning, privacy, and hardware-software co-design.

Problem Statement

Cloud LLMs lead to latency, privacy risk, and recurring cloud cost. Running LLMs on phones and edge devices promises instant replies and local data control but is hard because of limited RAM, compute, energy, and thermal budgets. The review asks: which model, compression, and deployment methods make on-device LLMs practical, and what open problems remain?

Main Contribution

Taxonomy of techniques to make LLMs run on edge: compression, efficient architectures, MoE, and collaborative deployment

Survey of software frameworks and hardware options for on-device inference and tiny training

A curated list of deployed on-device models and manufacturer case studies (Gemini Nano, Octopus, OpenELM, Phi-3-mini, MiniCPM)

Compilation of numerical trade-offs (latency, memory, energy) and pointers to future research directions

Key Findings

Edge AI market projected to grow nearly tenfold to $143.6B by 2032.

Numbers2022 $15.2B → 2032 $143.6B; CAGR 25.9%

Post-training activation-aware quantization (AWQ) preserves a tiny fraction of weights and enables large speedups on mobile GPUs.

Numbersprotects 0.1%–1% weights; up to 3× speedup vs FP16

Collaborative sharding across edge and cloud can sharply cut latency and raise throughput.

Numbersup to 50% latency reduction; up to 2× throughput

Hierarchical generate-then-verify pipelines combine a small local model with a larger verifier to speed token generation.

NumbersLLMCad reports up to 9.3× token generation speedup

Memory-centric hardware (PIM/PNM) can cut energy and raise throughput for on-device inference.

Numbersup to 4.5× performance improvement; 71% energy reduction

Sparse MoE and expert-management designs reduce active compute per token dramatically.

NumbersJetMoE: 8B params, 2B active per token, ~70% less inference compute than Llama2-7B

Results

Edge AI market projection

Value$143.6B by 2032

Baseline$15.2B in 2022

AWQ speedup on mobile GPUs

Valueup to 3×

BaselineFP16 implementation

EdgeShard latency reduction

Valueup to 50% lower latency

Baselinecloud-only deployment

LLMCad token generation speed

Valueup to 9.3× faster

Baselinesingle-model token generation

PIM/PNM performance & energy

Valueup to 4.5× perf; 71% energy cut

Baselinetraditional memory architectures

JetMoE inference compute reduction

Value≈70% less compute vs Llama2-7B

BaselineLlama2-7B inference cost

Who Should Care

CtoProduct ManagerMl EngineerEngineering LeadData ScientistFounder

What To Try In 7 Days

Run a small on-device proof-of-concept using llama.cpp or MLC-LLM with a 1–7B model and AWQ/PTQ.

Measure TTFT and energy-per-token on target phones; compare to cloud API baseline.

Prototype a hybrid flow: local fast generator + cloud verifier to balance latency and quality.

Agent Features

Memory

KV cache compression and chunk-wise swap
Processing-in-Memory (PIM) and Processing-near-Memory (PNM)

Frameworks

llama.cpp
MLC-LLM
VLLM
OpenLLM
ExecuTorch
MNN
PowerInfer

Architectures

decoder-only transformer
MoE
modular / adapter-based multimodal modules
parameter-sharing (deep-and-thin) architectures

Collaboration

edge-cloud model sharding
hierarchical generator-then-verifier pipelines
distributed expert execution across devices

Optimization Features

Token Efficiency

speculative generation (LLMCad)
token tree generation and verification

Infra Optimization

PIM/PNM near-memory compute
NPU / TPU acceleration
FPGA for low-power inference

Model Optimization

MoE
parameter sharing and deep-and-thin designs
low-rank compensation (LoRC)

System Optimization

edge-cloud sharding and dynamic placement
memory-aware expert preloading
any-precision serving engines

Training Optimization

quantization-aware training (QAT)
sparse-update / contribution analysis
adapter-based knowledge distillation

Inference Optimization

post-training quantization (GPTQ / AWQ)
generate-then-verify speculative decoding
KV cache compression and swapping

Reproducibility

Code Urls

https://github.com/NexaAI/Awesome-LLMs-on-device

Code Available

Open Source Status

partial

Risks & Boundaries

Limitations

Many reported gains depend on specific hardware and are not universally reproducible.
Quantization and pruning introduce accuracy trade-offs that vary by model and task.
Energy and thermal effects limit long interactive sessions on phones.
Collaborative sharding adds network complexity and privacy risks.

When Not To Use

If device memory/compute is extremely small (microcontrollers) prefer server inference.
When strict, continually updated model knowledge is required and cloud-only models provide fresher data.

Failure Modes

Accuracy drop after aggressive quantization or pruning on certain tasks.
Battery drain and device thermal throttling during sustained inference.
Communication bottlenecks and overhead in edge-cloud sharding.
Privacy leakage in distributed or collaborative training setups.

Core Entities

Models

LLaMA
GPT (GPT-3/4)
Gemini Nano
OpenELM
Phi-3-mini
Gemma2-9B
Octopus (Nexa AI)
MiniCPM-Llama3-V 2.5
JetMoE
EdgeMoE
LLMCad
MobileLLM
Qwen2-0.5B

Metrics

TTFT (Time-to-First-Token)
tokens/sec
latency reduction (%)
throughput (×)
energy per token (J)
Accuracy

Datasets

MMLU
MT-bench
OpenCompass
OCRBench
TextVQA
DocVQA

Benchmarks

MELT (mobile evaluation)
MT-Bench
OpenCompass
OCRBench

Context Entities

Models

Llama2
Mixtral
Gemini Pro
GPT-4
Claude 3
Qwen-VL

Metrics

battery life impact (hours)
memory footprint (RAM/VRAM)
energy reduction (%)

Datasets

Dolma / Dolma-scale corpora
DataComp-LM (training corpora references)

Benchmarks

MT-bench
MMLU

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Edge AI market projected to grow nearly tenfold to $143.6B by 2032.

Post-training activation-aware quantization (AWQ) preserves a tiny fraction of weights and enables large speedups on mobile GPUs.

Collaborative sharding across edge and cloud can sharply cut latency and raise throughput.

Hierarchical generate-then-verify pipelines combine a small local model with a larger verifier to speed token generation.

Memory-centric hardware (PIM/PNM) can cut energy and raise throughput for on-device inference.

Sparse MoE and expert-management designs reduce active compute per token dramatically.

Results

Edge AI market projection

AWQ speedup on mobile GPUs

EdgeShard latency reduction

LLMCad token generation speed

PIM/PNM performance & energy

JetMoE inference compute reduction

Who Should Care

What To Try In 7 Days

Agent Features

Memory

Frameworks

Architectures

Collaboration

Optimization Features

Token Efficiency

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Code Urls

Code Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

Related Papers