Survey of techniques, hardware, and trade-offs to run LLMs directly on phones and edge devices

August 26, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.8

Citation Count

2

Authors

Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling

Links

Abstract / PDF

Why It Matters For Business

On-device LLMs cut latency, protect user data, and lower cloud bills—key benefits for mobile apps, privacy-focused services, and offline products.

Summary TLDR

This 38-page review maps the state of running large language models (LLMs) on edge devices. It organizes methods (quantization, pruning, distillation, low-rank factorization, MoE, parameter sharing), software stacks (llama.cpp, MLC-LLM, VLLM), hardware trends (GPUs, NPUs, PIM/PNM), and deployment patterns (edge-only, edge-cloud sharding). The paper compiles numbers from many recent works (e.g., AWQ 3× speedups, EdgeShard up to 50% latency drop, LLMCad 9.3× token speedups) and flags open problems: energy, continual learning, privacy, and hardware-software co-design.

Problem Statement

Cloud LLMs lead to latency, privacy risk, and recurring cloud cost. Running LLMs on phones and edge devices promises instant replies and local data control but is hard because of limited RAM, compute, energy, and thermal budgets. The review asks: which model, compression, and deployment methods make on-device LLMs practical, and what open problems remain?

Main Contribution

Taxonomy of techniques to make LLMs run on edge: compression, efficient architectures, MoE, and collaborative deployment

Survey of software frameworks and hardware options for on-device inference and tiny training

A curated list of deployed on-device models and manufacturer case studies (Gemini Nano, Octopus, OpenELM, Phi-3-mini, MiniCPM)

Compilation of numerical trade-offs (latency, memory, energy) and pointers to future research directions

Key Findings

Edge AI market projected to grow nearly tenfold to $143.6B by 2032.

Numbers2022 $15.2B → 2032 $143.6B; CAGR 25.9%

Post-training activation-aware quantization (AWQ) preserves a tiny fraction of weights and enables large speedups on mobile GPUs.

Numbersprotects 0.1%–1% weights; up to 3× speedup vs FP16

Collaborative sharding across edge and cloud can sharply cut latency and raise throughput.

Numbersup to 50% latency reduction; up to 2× throughput

Hierarchical generate-then-verify pipelines combine a small local model with a larger verifier to speed token generation.

NumbersLLMCad reports up to 9.3× token generation speedup

Memory-centric hardware (PIM/PNM) can cut energy and raise throughput for on-device inference.

Numbersup to 4.5× performance improvement; 71% energy reduction

Sparse MoE and expert-management designs reduce active compute per token dramatically.

NumbersJetMoE: 8B params, 2B active per token, ~70% less inference compute than Llama2-7B

Results

Edge AI market projection

Value$143.6B by 2032

Baseline$15.2B in 2022

AWQ speedup on mobile GPUs

Valueup to 3×

BaselineFP16 implementation

EdgeShard latency reduction

Valueup to 50% lower latency

Baselinecloud-only deployment

LLMCad token generation speed

Valueup to 9.3× faster

Baselinesingle-model token generation

PIM/PNM performance & energy

Valueup to 4.5× perf; 71% energy cut

Baselinetraditional memory architectures

JetMoE inference compute reduction

Value≈70% less compute vs Llama2-7B

BaselineLlama2-7B inference cost

Who Should Care

What To Try In 7 Days

Run a small on-device proof-of-concept using llama.cpp or MLC-LLM with a 1–7B model and AWQ/PTQ.

Measure TTFT and energy-per-token on target phones; compare to cloud API baseline.

Prototype a hybrid flow: local fast generator + cloud verifier to balance latency and quality.

Agent Features

Memory

  • KV cache compression and chunk-wise swap
  • Processing-in-Memory (PIM) and Processing-near-Memory (PNM)

Frameworks

  • llama.cpp
  • MLC-LLM
  • VLLM
  • OpenLLM
  • ExecuTorch
  • MNN
  • PowerInfer

Architectures

  • decoder-only transformer
  • MoE
  • modular / adapter-based multimodal modules
  • parameter-sharing (deep-and-thin) architectures

Collaboration

  • edge-cloud model sharding
  • hierarchical generator-then-verifier pipelines
  • distributed expert execution across devices

Optimization Features

Token Efficiency

  • speculative generation (LLMCad)
  • token tree generation and verification

Infra Optimization

  • PIM/PNM near-memory compute
  • NPU / TPU acceleration
  • FPGA for low-power inference

Model Optimization

  • MoE
  • parameter sharing and deep-and-thin designs
  • low-rank compensation (LoRC)

System Optimization

  • edge-cloud sharding and dynamic placement
  • memory-aware expert preloading
  • any-precision serving engines

Training Optimization

  • quantization-aware training (QAT)
  • sparse-update / contribution analysis
  • adapter-based knowledge distillation

Inference Optimization

  • post-training quantization (GPTQ / AWQ)
  • generate-then-verify speculative decoding
  • KV cache compression and swapping

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Many reported gains depend on specific hardware and are not universally reproducible.
  • Quantization and pruning introduce accuracy trade-offs that vary by model and task.
  • Energy and thermal effects limit long interactive sessions on phones.
  • Collaborative sharding adds network complexity and privacy risks.

When Not To Use

  • If device memory/compute is extremely small (microcontrollers) prefer server inference.
  • When strict, continually updated model knowledge is required and cloud-only models provide fresher data.

Failure Modes

  • Accuracy drop after aggressive quantization or pruning on certain tasks.
  • Battery drain and device thermal throttling during sustained inference.
  • Communication bottlenecks and overhead in edge-cloud sharding.
  • Privacy leakage in distributed or collaborative training setups.

Core Entities

Models

  • LLaMA
  • GPT (GPT-3/4)
  • Gemini Nano
  • OpenELM
  • Phi-3-mini
  • Gemma2-9B
  • Octopus (Nexa AI)
  • MiniCPM-Llama3-V 2.5
  • JetMoE
  • EdgeMoE
  • LLMCad
  • MobileLLM
  • Qwen2-0.5B

Metrics

  • TTFT (Time-to-First-Token)
  • tokens/sec
  • latency reduction (%)
  • throughput (×)
  • energy per token (J)
  • Accuracy

Datasets

  • MMLU
  • MT-bench
  • OpenCompass
  • OCRBench
  • TextVQA
  • DocVQA

Benchmarks

  • MELT (mobile evaluation)
  • MT-Bench
  • OpenCompass
  • OCRBench

Context Entities

Models

  • Llama2
  • Mixtral
  • Gemini Pro
  • GPT-4
  • Claude 3
  • Qwen-VL

Metrics

  • battery life impact (hours)
  • memory footprint (RAM/VRAM)
  • energy reduction (%)

Datasets

  • Dolma / Dolma-scale corpora
  • DataComp-LM (training corpora references)

Benchmarks

  • MT-bench
  • MMLU