Transformer-Lite: run 2–10× faster LLM inference on phone GPUs via symbolic shapes, FP4, and KV-cache tricks

March 29, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

Links

Abstract / PDF

Why It Matters For Business

On-device LLM inference can cut cloud cost and latency while improving privacy; Transformer-Lite shows practical engineering steps to boost phone GPU throughput enough to make interactive mobile LLM apps feasible.

Summary TLDR

Transformer-Lite is a mobile inference engine that combines four practical optimizations—symbolic dynamic-shape handling, operator fusions and GPU execution priority, an FP4 storage format called E0M4 to cut dequantization cost, and sub-tensor KV-cache writes to avoid copying. On two phones it runs 2–10× faster than existing open baselines: e.g., Gemma 2B achieves 330 tokens/s prefill and 30 tokens/s decoding; ChatGLM2 6B 121/14 tokens/s. The work is engineering-focused and trades slight quantization error for big on-device speedups.

Problem Statement

On-device LLMs suffer slow inference because models have dynamic input shapes, 4-bit weights require costly dequantization, KV caches are copied each step, and generic mobile engines are tuned for static CV models. This yields poor user experience and limits on-device model size and latency.

Main Contribution

A symbolic-expression system to derive and reuse memory for dynamic-shape tensors, reducing CPU-GPU sync and reallocations.

Operator-level optimizations: fused operators, separate matmul kernels for prefill vs decoding, and setting low GPU execution priority to reduce UI lag.

E0M4 FP4 storage format that converts 4-bit values to half (FP16) using two bitwise ops, lowering dequantization cost and slightly improving quantization error versus INT4.

Sub-tensor KV-cache handling to avoid copying outputs back to inputs and reduce memory overhead; supports ONNX-exported models for easier deployment.

Key Findings

Transformer-Lite boosts prefill speed over MLC-LLM and FastLLM and improves decoding speed.

Numbersprefill >10×; decoding 2–3× (reported across Gemma 2B and ChatGLM2 6B)

Measured token throughput for representative models on Snapdragon 8 Gen 3.

NumbersGemma 2B: 330 prefill / 30 decoding tok/s; ChatGLM2 6B: 121 / 14 tok/s

E0M4 FP4 reduces matrix-multiplication latency on MTK GPU and lowers quantization error vs INT4.

NumbersMTK matmul speedups 1.33–1.56× across shapes; MAE ≈4.5% lower (ratio ≈0.955)

Results

prefill throughput (Gemma 2B)

Value330 tokens/s (Snapdragon 8 Gen 3)

BaselineMLC-LLM reported 25 tokens/s

decoding throughput (Gemma 2B)

Value30 tokens/s (Snapdragon 8 Gen 3)

BaselineMLC-LLM reported 11 tokens/s

prefill throughput (ChatGLM2 6B)

Value121 tokens/s (Snapdragon 8 Gen 3)

BaselineFastLLM reported 7 tokens/s (CPU)

decoding throughput (ChatGLM2 6B)

Value14 tokens/s (Snapdragon 8 Gen 3)

BaselineFastLLM reported 1.2 tokens/s (CPU)

E0M4 vs INT4 matmul latency (MTK Dimensity 9300)

Value3.2ms vs 5.0ms (shape 4096×4096)

BaselineINT4

Accuracy

ValueFP4 MAE ~4.5% smaller than INT4 (ratio ≈0.955)

BaselineINT4

deployable max model on 24GB phone

ValueQwen1.5 14B: 54 prefill / 5 decoding tok/s (24GB phone)

Who Should Care

What To Try In 7 Days

Export your model to ONNX and test an ONNX-based mobile engine to measure baseline throughput.

Pad and batch input lengths to multiples of 64/128 to reduce dynamic-shape update overhead during decoding.

Profile matmul on your target phone GPU and try E0M4-style FP4 storage if the GPU is ARM/MTK for dequantization speedups.

Optimization Features

Token Efficiency

  • no KV cache quantization yet (future work)

Infra Optimization

  • Adreno vs ARM GPU-specific matmul tuning suggested
  • profiling with ArchProbe to find TFLOPS gaps

Model Optimization

  • E0M4 FP4 storage (group-wise)
  • minor ONNX model edits to reduce shape ops

System Optimization

  • ONNX-based deployment for model agnosticism
  • use of OpenCL image/buffer hybrid to match operator needs

Inference Optimization

  • symbolic dynamic-shape derivation
  • memory reuse via symbolic sizes
  • operator fusion (layer-norm, rms-norm, elementwise)
  • separate matmul kernels for prefill vs decoding
  • sub-tensor KV-cache writes (no copy)
  • OpenCL low-priority execution to reduce UI lag

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Prefill is still below theoretical TFLOPS limits; more efficient matmul needed.
  • Decoding remains constrained by memory bandwidth and attention costs at long contexts.
  • KV cache is not quantized, so memory for long contexts remains large.
  • Some models (e.g., Llama2) require transposes to match KV format, adding overhead.
  • Results are limited to tested phone GPUs and selected models; other chips may differ.

When Not To Use

  • When you need best possible model accuracy without any quantization.
  • When deployment target is an NPU with its own optimized toolchain rather than a GPU.
  • If you cannot export a compatible ONNX model or modify KV cache format.

Failure Modes

  • Performance gain varies strongly with GPU architecture; E0M4 helped MTK but not Adreno in profiling.
  • Inserted transposes for KV format can offset speed gains on some models.
  • Symbolic size relationships can be unknown and prevent memory reuse in some dynamic branches.
  • If attention or memory bandwidth is the bottleneck, matmul optimizations give limited benefits.

Core Entities

Models

  • Gemma 2B
  • Qwen1.5 4B
  • ChatGLM2 6B
  • Llama2 7B
  • Qwen1.5 14B
  • OpenAI CLIP (ViT)

Metrics

  • tokens/s
  • latency (ms)
  • MAE
  • TFLOPS
  • matrix-multiplication latency

Context Entities

Models

  • ResNet
  • MobileNet
  • RWKV
  • Mamba