Transformer-Lite: run 2–10× faster LLM inference on phone GPUs via symbolic shapes, FP4, and KV-cache tricks

March 29, 20248 min

Overview

Decision SnapshotNeeds Validation

Engineering work validated on two real phones and multiple models. Methods are practical and mostly integration/format-level rather than new theory.

Citations1

Evidence Strength0.90

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 6/7

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 50%

Authors

Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

Links

Abstract / PDF

Why It Matters For Business

On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.

Who Should Care

Summary TLDR

Transformer-Lite is a mobile inference engine that combines four practical optimizations—symbolic dynamic-shape handling, operator fusions and GPU execution priority, an FP4 storage format called E0M4 to cut dequantization cost, and sub-tensor KV-cache writes to avoid copying. On two phones it runs 2–10× faster than existing open baselines: e.g., Gemma 2B achieves 330 tokens/s prefill and 30 tokens/s decoding; ChatGLM2 6B 121/14 tokens/s. The work is engineering-focused and trades slight quantization error for big on-device speedups.

Problem Statement

On-device LLMs suffer slow inference because models have dynamic input shapes, 4-bit weights require costly dequantization, KV caches are copied each step, and generic mobile engines are tuned for static CV models. This yields poor user experience and limits on-device model size and latency.

Main Contribution

A symbolic-expression system to derive and reuse memory for dynamic-shape tensors, reducing CPU-GPU sync and reallocations.

Operator-level optimizations: fused operators, separate matmul kernels for prefill vs decoding, and setting low GPU execution priority to reduce UI lag.

Key Findings

Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.

Numbersprefill >10×; decoding 2 (reported across Gemma 2B and ChatGLM2 6B)

Practical UseSwitching to Transformer-Lite-like optimizations yields order-of-magnitude prefill speed gains and 2–3× interactive decoding on phones, improving perceived responsiveness.

Evidence RefSection 3.3; Figures 5–6

Measured token throughput for representative models on Snapdragon 8 Gen 3.

NumbersGemma 2B: 330 prefill / 30 decoding tok/s; ChatGLM2 6B: 121 / 14 tok/s

Practical UseExpect ~hundreds tok/s prefill for 2–6B models on modern phone GPUs; use these numbers when budgeting latency and UX.

Evidence RefAbstract; Section 3.3; Figures 4 and 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
prefill throughput (Gemma 2B)330 tokens/s (Snapdragon 8 Gen 3)MLC-LLM reported 25 tokens/s>13×prompt length 128Section 3.3; Figure 5Fig.5, Sec.3.3
decoding throughput (Gemma 2B)30 tokens/s (Snapdragon 8 Gen 3)MLC-LLM reported 11 tokens/s~2.7×prompt length 128Section 3.3; Figure 5Fig.5, Sec.3.3

What To Try In 7 Days

Export your model to ONNX and test an ONNX-based mobile engine to measure baseline throughput.

Pad and batch input lengths to multiples of 64/128 to reduce dynamic-shape update overhead during decoding.

Profile matmul on your target phone GPU and try E0M4-style FP4 storage if the GPU is ARM/MTK for dequantization speedups.

Optimization Features

Token Efficiency
no KV cache quantization yet (future work)
Infra Optimization
Adreno vs ARM GPU-specific matmul tuning suggestedprofiling with ArchProbe to find TFLOPS gaps
Model Optimization
E0M4 FP4 storage (group-wise)minor ONNX model edits to reduce shape ops
System Optimization
ONNX-based deployment for model agnosticismuse of OpenCL image/buffer hybrid to match operator needs
Inference Optimization
symbolic dynamic-shape derivationmemory reuse via symbolic sizesoperator fusion (layer-norm, rms-norm, elementwise)separate matmul kernels for prefill vs decodingsub-tensor KV-cache writes (no copy)OpenCL low-priority execution to reduce UI lag

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Prefill is still below theoretical TFLOPS limits; more efficient matmul needed.

Decoding remains constrained by memory bandwidth and attention costs at long contexts.

When Not To Use

When you need best possible model accuracy without any quantization.

When deployment target is an NPU with its own optimized toolchain rather than a GPU.

Failure Modes

Performance gain varies strongly with GPU architecture; E0M4 helped MTK but not Adreno in profiling.

Inserted transposes for KV format can offset speed gains on some models.

Core Entities

Models

Gemma 2BQwen1.5 4BChatGLM2 6BLlama2 7BQwen1.5 14BOpenAI CLIP (ViT)

Metrics

tokens/slatency (ms)MAETFLOPSmatrix-multiplication latency

Context Entities

Models

ResNetMobileNetRWKVMamba