Overview
The paper provides a practical algorithm, Spark implementation, and multiple real-dataset experiments showing consistent latency and cost gains; evaluation covers model sizes and cost models but relies on specific dataset repetitiveness and provider caching policies.
Citations6
Evidence Strength0.80
Confidence0.90
Risk Signals13
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/6
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 80%
Novelty: 60%
Why It Matters For Business
If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.
Who Should Care
Summary TLDR
LLM batch queries over tables are slow and costly because repeated prompt prefixes are recomputed. The paper introduces algorithms that reorder rows and fields to maximize key-value (KV) cache prefix reuse in LLM serving. The practical, scalable solver (GGR) uses functional dependencies and table stats, runs in <15s on 30K-row tables, and in experiments yields 1.5–3.4× faster end-to-end runtimes and up to 32% measured cost savings on proprietary APIs while preserving accuracy.
Problem Statement
Running an LLM per row on large tables is slow and expensive. Existing prompt caching helps but typical table/field orderings lead to low cache hits. We need algorithms that reorder rows and fields to maximize prefix reuse in the LLM KV cache for relational analytics workloads.
Main Contribution
Show that per-row reordering of fields and rows can dramatically increase LLM prompt prefix reuse in relational workloads.
Introduce OPHR, an optimal recursive algorithm to maximize prefix hit count (exponential, impractical on large tables).
Key Findings
GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.
GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| End-to-end latency (vs Cache Original) | 1.5–3.4× faster | Cache (Original) | — | 16 queries across 7 datasets (Sec 6.2) | Cache(GGR) achieves 1.5–3.4× speedup over Cache (Original) | Sec 6.2, Fig 3/4 |
| End-to-end latency (vs No Cache) | 1.8–3.8× faster | No Cache | — | 16 queries across 7 datasets (Sec 6.2) | Cache(GGR) achieves 1.8–3.8× speedup over No Cache | Sec 6.2, Fig 3/4 |
What To Try In 7 Days
Measure current prefix hit rate (PHR) on a representative batch.
Run GGR on a small sample and compare latency, PHR and accuracy.
If caching providers are used, simulate cost with new PHR to estimate savings (use provider pricing).
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Requires offline/batch knowledge of all requests (not suitable for streaming/live queries).
Assumes exact cell-value matches for cache hits; substring or semantic matches not counted.
When Not To Use
Streaming workloads where requests arrive online and cannot be globally reordered.
Workloads where decode time dominates (very long outputs) and prefix reuse gives little net gain.
Failure Modes
Reordering changes prompt layout and can slightly change model outputs for some models.
GGR may choose suboptimal groups in tie cases, missing global-optimal PHC.

