Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
Edge deployment cuts latency, reduces data exposure, and lowers bandwidth costs; cross-layer co-design (model+runtime+hardware) is required to preserve multi-step reasoning while meeting product SLAs.
Summary TLDR
This survey organizes techniques and system patterns to move reasoning-capable language models and agent systems from the cloud to edge devices. It groups optimizations across data, model, and runtime layers (quantization, pruning, distillation, Small Language Models, model partitioning, KV-cache tricks, speculative decoding, and mixed-precision), maps hardware-aware deployment blueprints, and proposes standardized evaluation metrics (latency, J/token, robustness, privacy). Key practical messages: stack cross-layer optimizations, prefer edge-native small models over naive compression for reasoning tasks, and measure energy/thermal alongside accuracy.
Problem Statement
Large reasoning-capable LLMs and agent systems need far more memory, bandwidth, and power than typical edge devices provide. The field lacks a unified cross-layer playbook to preserve multi-step reasoning while meeting strict latency, energy, and privacy constraints on devices.
Main Contribution
A unified, cross-layer framework for "cognitive edge computing" that links data, model, system, and evaluation layers
A practical taxonomy of edge-ready techniques: quantization, pruning, distillation, low-rank adapters, SLMs, MoE, speculative decoding, and partitioning
Reference deployment blueprints (smartphone, wearable, Jetson-class, MEC) and hardware-aware runtime patterns (paged KV, continuous batching)
A recommended standardized evaluation protocol including J/token, p50/p90 latency, context handling, and energy reporting
A gap analysis pointing to missing standardized reasoning benchmarks, energy reporting norms, and edge agent testbeds
Key Findings
Low-bit quantization yields large memory compression but needs careful validation for reasoning tasks.
Edge-native multimodal encoder designs can deliver massive latency wins.
Collaborative execution and hybrid offloading reduce latency and offload workload effectively.
Speculative sampling plus frequency prioritization yields modest throughput gains with low quality risk.
Stacking many optimizations exhibits diminishing returns without co-design.
Results
Model compression via quantization
First-token latency (visual encoder)
Speculative decoding speedup
Memory reduction via GPTQ+Marlin format
Collaborative offloading
Who Should Care
What To Try In 7 Days
Measure J/token and p50 latency on a target device using a small quantized model (INT8/INT4) and one real prompt workload
Prototype a hybrid flow: run short, latency-critical steps on-device and offload long-horizon context to a cloud microservice; measure p95 SLA
Swap a heavy visual encoder for a token-reduction encoder (e.g., FastViTHD) and measure time-to-first-token in your app
Agent Features
Memory
- KV-cache (paged, evict/compress)
- short-term streaming memory
- federated personalization state
Planning
- short-horizon planning with local SLMs
- long-horizon offload to cloud LLMs
Tool Use
- function calling
- retrieval-augmented tool invocation
Frameworks
- RouterEval-style routing
- EdgeShard collaborative partitioning
- vLLM/SGLang runtime patterns
Is Agentic
true
Architectures
- single-agent
- multi-agent
- hierarchical agents
Collaboration
- edge-edge model partitioning
- edge-cloud hybrid routing
- large-small cooperative routing
Optimization Features
Token Efficiency
- context compression and retrieval
- dynamic KV eviction
- vocabulary prioritization in drafts
Infra Optimization
- NPU/ASIC acceleration
- FPGA spatial acceleration
- compute-in-memory / near-memory approaches
Model Optimization
- quantization (INT8/INT4, GPTQ, AWQ)
- pruning (structured and unstructured)
- knowledge distillation (teacher→student KD)
- LoRA
- MoE
System Optimization
- model partitioning and offload
- hardware-aware compilation (CoreML, WebGPU)
- runtime scheduling (thermal-aware, duty-cycling)
Training Optimization
- data augmentation and synthetic data with audit
- multi-teacher distillation
- LoRA
Inference Optimization
- speculative decoding (FR-Spec, EAGLE-2)
- paged KV-cache and continuous batching
- mixed-precision runtime (W4A16)
- speculative sampling + quantization (SpecMQuant)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Heterogeneous source reports: numbers aggregate across different hardware and workloads
- No new experimental artifact provided by authors for reproducible baselines
- Standardized, modality-aware reasoning benchmarks and energy-reporting norms are missing
- Hardware toolchain fragmentation still demands many platform-specific adjustments
When Not To Use
- If you can afford always-on cloud compute and data centralization for complex reasoning
- For safety-critical tasks requiring formal verification beyond current XAI capabilities
- When device thermal or memory budgets are orders of magnitude below model needs
Failure Modes
- Reasoning degradation after extreme quantization especially for chain-of-thought tasks
- Thermal throttling causing unpredictable latency spikes on mobile SoCs
- Partitioning/offload failures due to network jitter or mismatched kernel versions
- Adversarial or bit-flip faults targeting low-bit quantized weights
Core Entities
Models
- MobileLLM
- MobileLLM-R1
- MiniCPM-V 4.0
- FastVLM
- MobileCLIP2
- LLaMA 3.1 8B
- Phi (Microsoft)
- Gemini Nano
- OmniVLM
Metrics
- p50/p90 latency
- J/token (energy per token)
- throughput (tokens/s)
- Accuracy
- memory footprint (GB/MB)
Datasets
- MATH
- GSM8K
- MMLU
- ImageNet-1k
- EdgeIIoTset
Benchmarks
- MobileAIBench
- NeurIPS Edge-LLMs competition (Edge-LLMs)
- MTEB / CoIR
Context Entities
Models
- GPT-3/4
- PaLM
- MiniCPM
- Qwen
- TinyLLaMA
- BabyLLaMA
Metrics
- time-to-first-token (TTFT)
- KV-cache eviction rate
- offload ratio
- thermal stability
Datasets
- MMLU variants
- LiveCodeBench
- Video QA suites
Benchmarks
- MobileLLM evaluations on smartphones
- Energy/Wh per query reports (heterogeneous)

