Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
LOOKAT can cut KV-cache memory and DRAM bandwidth on edge devices by tens of times without retraining, enabling larger context or lower-cost hardware for real-time inference.
Summary TLDR
LOOKAT replaces standard attention scoring with a lookup-table method built from product quantization and asymmetric distance computation (ADC). By compressing keys into small codebook indices and precomputing query×codebook dot-products, LOOKAT avoids dequantizing keys and cuts KV-cache bandwidth drastically. On GPT-2 experiments it reaches 64× compression with ~0.957 cosine output fidelity and Spearman rank correlation >0.95, without training or architecture changes. Main limits: values stay FP16 and lookup kernels need hardware support.
Problem Statement
KV-cache memory grows with sequence length and dominates edge inference memory. Standard INT4/INT8 quantization lowers storage but still needs dequantization, so bandwidth remains the bottleneck. The paper asks: can attention scoring be computed directly on compressed keys to remove the dequantization bandwidth cost?
Main Contribution
Show attention scoring is equivalent to inner-product retrieval and can use product quantization + ADC.
Introduce LOOKAT: compute attention scores from codebook indices via precomputed lookup tables, avoiding key dequantization.
Demonstrate up to 64× KV-cache compression with >95% output fidelity on GPT-2 without retraining or architectural changes.
Provide theoretical bound linking rank-correlation degradation to PQ parameters and empirical scaling up to 1024 tokens.
Key Findings
LOOKAT achieves 64× KV-cache compression while keeping model output close to FP16.
LOOKAT preserves attention ranking structure.
LOOKAT removes the dequantization bandwidth bottleneck by using ADC lookup tables.
Quality degrades with longer contexts but remains usable up to 1024 tokens.
Results
LOOKAT-2 cosine similarity (output fidelity)
LOOKAT-4 cosine similarity (output fidelity)
INT4 cosine similarity (scalar quant baseline)
Spearman rank correlation (LOOKAT range)
Long-context cosine similarity (LOOKAT-4)
Who Should Care
What To Try In 7 Days
Profile your model's KV-cache bandwidth; confirm attention scoring is the bottleneck.
Implement a PQ+ADC prototype for keys (m=4) and measure per-query DRAM load reduction.
Run functional checks on a few representative prompts to compare top-k attention overlap with FP16.
Optimization Features
Token Efficiency
- reduces per-token KV memory to 1–16 bytes depending on m
Infra Optimization
- suitable for low-bandwidth edge DRAM; reduces DRAM traffic for attention
Model Optimization
- product quantization of key vectors
- asymmetric distance computation for scoring
System Optimization
- shifts bottleneck from memory bandwidth to small compute and memory for tables
- requires optimized lookup kernels on target hardware
Training Optimization
- no retraining required (post-processing codebooks)
Inference Optimization
- precompute lookup tables per query (m × 256 dot-products)
- replace per-key FP16 loads with byte indices + lookups
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Only compresses keys; values remain FP16 so total memory savings are partial.
- Codebook quality depends on calibration data and domain; results reported on three small sample types.
- Needs hardware-optimized lookup kernels on NPUs/DSPs to realize latency gains.
- Experiments use GPT-2 first layer and small sample counts; large-model behavior remains untested.
When Not To Use
- When you need exact attention magnitudes rather than relative ordering.
- If your hardware cannot support fast table lookups or optimized byte-index kernels.
- When value compression is required but not yet implemented.
Failure Modes
- Long contexts degrade fidelity (cosine drops ~10% at 1024 tokens).
- High KL divergence in some samples can alter attention mass distribution.
- Poor codebook fit (mismatch between calibration and runtime data) reduces accuracy.
Core Entities
Models
- GPT-2
Metrics
- Cosine Similarity
- KL Divergence
- Spearman Rank Correlation
- Accuracy
Datasets
- natural language prose (samples)
- Python source code (samples)
- technical mixed text (samples)

