Transformer-Lite: run 2–10× faster LLM inference on phone GPUs via symbolic shapes, FP4, and KV-cache tricks

Overview

Decision SnapshotNeeds Validation

Engineering work validated on two real phones and multiple models. Methods are practical and mostly integration/format-level rather than new theory.

Citations1

Evidence Strength0.90

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 6/7

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

Links

Abstract / PDF

Why It Matters For Business

On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

Transformer-Lite is a mobile inference engine that combines four practical optimizations—symbolic dynamic-shape handling, operator fusions and GPU execution priority, an FP4 storage format called E0M4 to cut dequantization cost, and sub-tensor KV-cache writes to avoid copying. On two phones it runs 2–10× faster than existing open baselines: e.g., Gemma 2B achieves 330 tokens/s prefill and 30 tokens/s decoding; ChatGLM2 6B 121/14 tokens/s. The work is engineering-focused and trades slight quantization error for big on-device speedups.

Problem Statement

On-device LLMs suffer slow inference because models have dynamic input shapes, 4-bit weights require costly dequantization, KV caches are copied each step, and generic mobile engines are tuned for static CV models. This yields poor user experience and limits on-device model size and latency.

Main Contribution

A symbolic-expression system to derive and reuse memory for dynamic-shape tensors, reducing CPU-GPU sync and reallocations.

Operator-level optimizations: fused operators, separate matmul kernels for prefill vs decoding, and setting low GPU execution priority to reduce UI lag.

Key Findings

Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.

Numbersprefill >10×; decoding 2–3× (reported across Gemma 2B and ChatGLM2 6B)

Practical UseSwitching to Transformer-Lite-like optimizations yields order-of-magnitude prefill speed gains and 2–3× interactive decoding on phones, improving perceived responsiveness.

Evidence RefSection 3.3; Figures 5–6

Measured token throughput for representative models on Snapdragon 8 Gen 3.

NumbersGemma 2B: 330 prefill / 30 decoding tok/s; ChatGLM2 6B: 121 / 14 tok/s

Practical UseExpect ~hundreds tok/s prefill for 2–6B models on modern phone GPUs; use these numbers when budgeting latency and UX.

Evidence RefAbstract; Section 3.3; Figures 4 and 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
prefill throughput (Gemma 2B)	330 tokens/s (Snapdragon 8 Gen 3)	MLC-LLM reported 25 tokens/s	>13×	prompt length 128	Section 3.3; Figure 5	Fig.5, Sec.3.3
decoding throughput (Gemma 2B)	30 tokens/s (Snapdragon 8 Gen 3)	MLC-LLM reported 11 tokens/s	~2.7×	prompt length 128	Section 3.3; Figure 5	Fig.5, Sec.3.3

What To Try In 7 Days

Export your model to ONNX and test an ONNX-based mobile engine to measure baseline throughput.

Pad and batch input lengths to multiples of 64/128 to reduce dynamic-shape update overhead during decoding.

Profile matmul on your target phone GPU and try E0M4-style FP4 storage if the GPU is ARM/MTK for dequantization speedups.

Optimization Features

Token Efficiency

no KV cache quantization yet (future work)

Infra Optimization

Adreno vs ARM GPU-specific matmul tuning suggestedprofiling with ArchProbe to find TFLOPS gaps

Model Optimization

E0M4 FP4 storage (group-wise)minor ONNX model edits to reduce shape ops

System Optimization

ONNX-based deployment for model agnosticismuse of OpenCL image/buffer hybrid to match operator needs

Inference Optimization

symbolic dynamic-shape derivationmemory reuse via symbolic sizesoperator fusion (layer-norm, rms-norm, elementwise)separate matmul kernels for prefill vs decodingsub-tensor KV-cache writes (no copy)OpenCL low-priority execution to reduce UI lag

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Prefill is still below theoretical TFLOPS limits; more efficient matmul needed.

Decoding remains constrained by memory bandwidth and attention costs at long contexts.

When Not To Use

When you need best possible model accuracy without any quantization.

When deployment target is an NPU with its own optimized toolchain rather than a GPU.

Failure Modes

Performance gain varies strongly with GPU architecture; E0M4 helped MTK but not Adreno in profiling.

Inserted transposes for KV format can offset speed gains on some models.

Core Entities

Models

Gemma 2BQwen1.5 4BChatGLM2 6BLlama2 7BQwen1.5 14BOpenAI CLIP (ViT)

Metrics

tokens/slatency (ms)MAETFLOPSmatrix-multiplication latency

Context Entities

Models

ResNetMobileNetRWKVMamba

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.

Measured token throughput for representative models on Snapdragon 8 Gen 3.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Context Entities

Models

You May Also Want to Read

Quantizing large multilingual LLMs often hides big drops for non‑Latin languages and hard tasks

Key finding

Systematic benchmark shows small models can reason if trained and compressed carefully

Key finding

Pipeline that combines synthetic-data distillation, LoRA, Muon and GPTQ to make task-specialized LLMs fit on edge devices

Key finding

A practical survey of compression and speed tricks to run large language models on limited hardware

Key finding

Large-scale empirical benchmark showing how attention variants, PEFT, MoE, and int4 quantization trade performance for memory, latency, and

Key finding