Overview
Engineering work validated on two real phones and multiple models. Methods are practical and mostly integration/format-level rather than new theory.
Citations1
Evidence Strength0.90
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 6/7
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.
Who Should Care
Summary TLDR
Transformer-Lite is a mobile inference engine that combines four practical optimizations—symbolic dynamic-shape handling, operator fusions and GPU execution priority, an FP4 storage format called E0M4 to cut dequantization cost, and sub-tensor KV-cache writes to avoid copying. On two phones it runs 2–10× faster than existing open baselines: e.g., Gemma 2B achieves 330 tokens/s prefill and 30 tokens/s decoding; ChatGLM2 6B 121/14 tokens/s. The work is engineering-focused and trades slight quantization error for big on-device speedups.
Problem Statement
On-device LLMs suffer slow inference because models have dynamic input shapes, 4-bit weights require costly dequantization, KV caches are copied each step, and generic mobile engines are tuned for static CV models. This yields poor user experience and limits on-device model size and latency.
Main Contribution
A symbolic-expression system to derive and reuse memory for dynamic-shape tensors, reducing CPU-GPU sync and reallocations.
Operator-level optimizations: fused operators, separate matmul kernels for prefill vs decoding, and setting low GPU execution priority to reduce UI lag.
Key Findings
Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.
Measured token throughput for representative models on Snapdragon 8 Gen 3.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| prefill throughput (Gemma 2B) | 330 tokens/s (Snapdragon 8 Gen 3) | MLC-LLM reported 25 tokens/s | >13× | prompt length 128 | Section 3.3; Figure 5 | Fig.5, Sec.3.3 |
| decoding throughput (Gemma 2B) | 30 tokens/s (Snapdragon 8 Gen 3) | MLC-LLM reported 11 tokens/s | ~2.7× | prompt length 128 | Section 3.3; Figure 5 | Fig.5, Sec.3.3 |
What To Try In 7 Days
Export your model to ONNX and test an ONNX-based mobile engine to measure baseline throughput.
Pad and batch input lengths to multiples of 64/128 to reduce dynamic-shape update overhead during decoding.
Profile matmul on your target phone GPU and try E0M4-style FP4 storage if the GPU is ARM/MTK for dequantization speedups.
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Prefill is still below theoretical TFLOPS limits; more efficient matmul needed.
Decoding remains constrained by memory bandwidth and attention costs at long contexts.
When Not To Use
When you need best possible model accuracy without any quantization.
When deployment target is an NPU with its own optimized toolchain rather than a GPU.
Failure Modes
Performance gain varies strongly with GPU architecture; E0M4 helped MTK but not Adreno in profiling.
Inserted transposes for KV format can offset speed gains on some models.

