Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.
Summary TLDR
Transformer-Lite is a mobile inference engine that combines four practical optimizations—symbolic dynamic-shape handling, operator fusions and GPU execution priority, an FP4 storage format called E0M4 to cut dequantization cost, and sub-tensor KV-cache writes to avoid copying. On two phones it runs 2–10× faster than existing open baselines: e.g., Gemma 2B achieves 330 tokens/s prefill and 30 tokens/s decoding; ChatGLM2 6B 121/14 tokens/s. The work is engineering-focused and trades slight quantization error for big on-device speedups.
Problem Statement
On-device LLMs suffer slow inference because models have dynamic input shapes, 4-bit weights require costly dequantization, KV caches are copied each step, and generic mobile engines are tuned for static CV models. This yields poor user experience and limits on-device model size and latency.
Main Contribution
A symbolic-expression system to derive and reuse memory for dynamic-shape tensors, reducing CPU-GPU sync and reallocations.
Operator-level optimizations: fused operators, separate matmul kernels for prefill vs decoding, and setting low GPU execution priority to reduce UI lag.
E0M4 FP4 storage format that converts 4-bit values to half (FP16) using two bitwise ops, lowering dequantization cost and slightly improving quantization error versus INT4.
Sub-tensor KV-cache handling to avoid copying outputs back to inputs and reduce memory overhead; supports ONNX-exported models for easier deployment.
Key Findings
Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.
Measured token throughput for representative models on Snapdragon 8 Gen 3.
E0M4 FP4 reduces matrix-multiplication latency on MTK GPU and lowers quantization error vs INT4.
Results
prefill throughput (Gemma 2B)
decoding throughput (Gemma 2B)
prefill throughput (ChatGLM2 6B)
decoding throughput (ChatGLM2 6B)
E0M4 vs INT4 matmul latency (MTK Dimensity 9300)
Accuracy
deployable max model on 24GB phone
Who Should Care
What To Try In 7 Days
Export your model to ONNX and test an ONNX-based mobile engine to measure baseline throughput.
Pad and batch input lengths to multiples of 64/128 to reduce dynamic-shape update overhead during decoding.
Profile matmul on your target phone GPU and try E0M4-style FP4 storage if the GPU is ARM/MTK for dequantization speedups.
Optimization Features
Token Efficiency
- no KV cache quantization yet (future work)
Infra Optimization
- Adreno vs ARM GPU-specific matmul tuning suggested
- profiling with ArchProbe to find TFLOPS gaps
Model Optimization
- E0M4 FP4 storage (group-wise)
- minor ONNX model edits to reduce shape ops
System Optimization
- ONNX-based deployment for model agnosticism
- use of OpenCL image/buffer hybrid to match operator needs
Inference Optimization
- symbolic dynamic-shape derivation
- memory reuse via symbolic sizes
- operator fusion (layer-norm, rms-norm, elementwise)
- separate matmul kernels for prefill vs decoding
- sub-tensor KV-cache writes (no copy)
- OpenCL low-priority execution to reduce UI lag
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Prefill is still below theoretical TFLOPS limits; more efficient matmul needed.
- Decoding remains constrained by memory bandwidth and attention costs at long contexts.
- KV cache is not quantized, so memory for long contexts remains large.
- Some models (e.g., Llama2) require transposes to match KV format, adding overhead.
- Results are limited to tested phone GPUs and selected models; other chips may differ.
When Not To Use
- When you need best possible model accuracy without any quantization.
- When deployment target is an NPU with its own optimized toolchain rather than a GPU.
- If you cannot export a compatible ONNX model or modify KV cache format.
Failure Modes
- Performance gain varies strongly with GPU architecture; E0M4 helped MTK but not Adreno in profiling.
- Inserted transposes for KV format can offset speed gains on some models.
- Symbolic size relationships can be unknown and prevent memory reuse in some dynamic branches.
- If attention or memory bandwidth is the bottleneck, matmul optimizations give limited benefits.
Core Entities
Models
- Gemma 2B
- Qwen1.5 4B
- ChatGLM2 6B
- Llama2 7B
- Qwen1.5 14B
- OpenAI CLIP (ViT)
Metrics
- tokens/s
- latency (ms)
- MAE
- TFLOPS
- matrix-multiplication latency
Context Entities
Models
- ResNet
- MobileNet
- RWKV
- Mamba

