Overview
Production Readiness
0.8
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.
Summary TLDR
LLM batch queries over tables are slow and costly because repeated prompt prefixes are recomputed. The paper introduces algorithms that reorder rows and fields to maximize key-value (KV) cache prefix reuse in LLM serving. The practical, scalable solver (GGR) uses functional dependencies and table stats, runs in <15s on 30K-row tables, and in experiments yields 1.5–3.4× faster end-to-end runtimes and up to 32% measured cost savings on proprietary APIs while preserving accuracy.
Problem Statement
Running an LLM per row on large tables is slow and expensive. Existing prompt caching helps but typical table/field orderings lead to low cache hits. We need algorithms that reorder rows and fields to maximize prefix reuse in the LLM KV cache for relational analytics workloads.
Main Contribution
Show that per-row reordering of fields and rows can dramatically increase LLM prompt prefix reuse in relational workloads.
Introduce OPHR, an optimal recursive algorithm to maximize prefix hit count (exponential, impractical on large tables).
Introduce GGR, a practical greedy recursive solver that uses functional dependencies and table statistics to approximate OPHR quickly.
Implement the approach in PySpark with vLLM and evaluate on 16 LLM queries across 7 datasets, showing sizable latency and cost gains while keeping accuracy similar.
Key Findings
GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.
GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.
GGR raises prefix hit rates (PHR) massively: examples include Movies 35%→86%, Products 27%→83%, Beer 50%→80%.
Ordering changes do not meaningfully hurt accuracy for most models; median accuracy differences are within ±5% across tests.
GGR achieves near-optimal PHR (within ~2% of OPHR) on small samples while being orders of magnitude faster.
Solver runtime is small: GGR runs in under 15 seconds for datasets up to 30K rows and 57 fields.
Results
End-to-end latency (vs Cache Original)
End-to-end latency (vs No Cache)
Measured API cost savings
PHR (Prefix hit rate) examples
GGR solver time
Accuracy
Who Should Care
What To Try In 7 Days
Measure current prefix hit rate (PHR) on a representative batch.
Run GGR on a small sample and compare latency, PHR and accuracy.
If caching providers are used, simulate cost with new PHR to estimate savings (use provider pricing).
Optimization Features
Token Efficiency
- maximize prefix hit count (PHC)
- per-row field reordering to convert misses into reusable prefixes
Infra Optimization
- reduces memory and compute by reusing KV cache, enabling larger batches
System Optimization
- leverage functional dependencies to reduce search
- use table statistics and early stopping to limit solver work
Inference Optimization
- prompt caching / prefix reuse
- request reordering (row- and field-level)
- batch scheduling to maximize KV cache hits
Reproducibility
Data Urls
- Rotten Tomatoes Movies, Amazon Product Reviews, BIRD, PDMX, RateBeer, SQuAD, FEVER (referenced in paper)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires offline/batch knowledge of all requests (not suitable for streaming/live queries).
- Assumes exact cell-value matches for cache hits; substring or semantic matches not counted.
- Provider caching minimums (e.g., OpenAI 1,024 tokens) can limit gains without synthetic duplication.
- OPHR is exponential and impractical for large tables; GGR is heuristic and can be suboptimal in tie cases.
- Benefit depends on data repetitiveness and on whether prefill (not decode) dominates runtime.
When Not To Use
- Streaming workloads where requests arrive online and cannot be globally reordered.
- Workloads where decode time dominates (very long outputs) and prefix reuse gives little net gain.
- Situations where field order must be preserved for semantic or regulatory reasons.
- When provider caching policies do not offer discounts or enforce long minimum prefix lengths.
Failure Modes
- Reordering changes prompt layout and can slightly change model outputs for some models.
- GGR may choose suboptimal groups in tie cases, missing global-optimal PHC.
- Solver runtime or memory could increase on extremely wide tables with high cardinality without adequate early stopping.
- Provider cache thresholds or billing details can reduce or eliminate expected cost savings.
Core Entities
Models
- Llama-3-8B-Instruct
- Llama-3-70B-Instruct
- Llama-3-2-1B
- GPT-4o-mini
- Anthropic Claude 3.5 Sonnet
- AlibabaNLP/gte-base-en-v1.5 (embeddings)
Metrics
- Prefix Hit Rate (PHR)
- Prefix Hit Count (PHC)
- End-to-end query latency
- API cost (OpenAI/Anthropic pricing)
- Accuracy
Datasets
- Movies (Rotten Tomatoes)
- Products (Amazon Reviews)
- BIRD
- PDMX (Public Domain MusicXML)
- Beer (RateBeer)
- SQuAD
- FEVER
Benchmarks
- 16-query LLM relational benchmark (T1-T5)
- Small-sample OPHR vs GGR comparisons (Appendix D.1)

