Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

Overview

Decision SnapshotNeeds Validation

The paper provides a practical algorithm, Spark implementation, and multiple real-dataset experiments showing consistent latency and cost gains; evaluation covers model sizes and cost models but relies on specific dataset repetitiveness and provider caching policies.

Citations6

Evidence Strength0.80

Confidence0.90

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

Links

Abstract / PDF / Data

Why It Matters For Business

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Who Should Care

Product Manager CTO ML Engineer Engineering Lead Founder

Summary TLDR

LLM batch queries over tables are slow and costly because repeated prompt prefixes are recomputed. The paper introduces algorithms that reorder rows and fields to maximize key-value (KV) cache prefix reuse in LLM serving. The practical, scalable solver (GGR) uses functional dependencies and table stats, runs in <15s on 30K-row tables, and in experiments yields 1.5–3.4× faster end-to-end runtimes and up to 32% measured cost savings on proprietary APIs while preserving accuracy.

Problem Statement

Running an LLM per row on large tables is slow and expensive. Existing prompt caching helps but typical table/field orderings lead to low cache hits. We need algorithms that reorder rows and fields to maximize prefix reuse in the LLM KV cache for relational analytics workloads.

Main Contribution

Show that per-row reordering of fields and rows can dramatically increase LLM prompt prefix reuse in relational workloads.

Introduce OPHR, an optimal recursive algorithm to maximize prefix hit count (exponential, impractical on large tables).

Key Findings

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers1.5–3.4× speedup (Sec 6.2; Fig 3/4)

Practical UseReordering rows/fields before batching LLM calls can cut runtime by up to ~3× on similar relational analytics jobs; try it before adding hardware.

Evidence RefSec 6.2, Fig 3/4

GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.

Numbers32% saved (OpenAI), 21% saved (Anthropic) on FEVER (Table 3)

Practical UseIf your provider supports prompt/prefix caching and you batch offline queries, reordering can meaningfully lower API spending.

Evidence RefTable 3, Sec 6.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
End-to-end latency (vs Cache Original)	1.5–3.4× faster	Cache (Original)	—	16 queries across 7 datasets (Sec 6.2)	Cache(GGR) achieves 1.5–3.4× speedup over Cache (Original)	Sec 6.2, Fig 3/4
End-to-end latency (vs No Cache)	1.8–3.8× faster	No Cache	—	16 queries across 7 datasets (Sec 6.2)	Cache(GGR) achieves 1.8–3.8× speedup over No Cache	Sec 6.2, Fig 3/4

What To Try In 7 Days

Measure current prefix hit rate (PHR) on a representative batch.

Run GGR on a small sample and compare latency, PHR and accuracy.

If caching providers are used, simulate cost with new PHR to estimate savings (use provider pricing).

Optimization Features

Token Efficiency

maximize prefix hit count (PHC)per-row field reordering to convert misses into reusable prefixes

Infra Optimization

reduces memory and compute by reusing KV cache, enabling larger batches

System Optimization

leverage functional dependencies to reduce searchuse table statistics and early stopping to limit solver work

Inference Optimization

prompt caching / prefix reuserequest reordering (row- and field-level)batch scheduling to maximize KV cache hits

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

Rotten Tomatoes Movies, Amazon Product Reviews, BIRD, PDMX, RateBeer, SQuAD, FEVER (referenced in paper)

Risks & Boundaries

Limitations

Requires offline/batch knowledge of all requests (not suitable for streaming/live queries).

Assumes exact cell-value matches for cache hits; substring or semantic matches not counted.

When Not To Use

Streaming workloads where requests arrive online and cannot be globally reordered.

Workloads where decode time dominates (very long outputs) and prefix reuse gives little net gain.

Failure Modes

Reordering changes prompt layout and can slightly change model outputs for some models.

GGR may choose suboptimal groups in tie cases, missing global-optimal PHC.

Core Entities

Models

Llama-3-8B-InstructLlama-3-70B-InstructLlama-3-2-1BGPT-4o-miniAnthropic Claude 3.5 SonnetAlibabaNLP/gte-base-en-v1.5 (embeddings)

Metrics

Prefix Hit Rate (PHR)Prefix Hit Count (PHC)End-to-end query latencyAPI cost (OpenAI/Anthropic pricing)Accuracy

Datasets

Movies (Rotten Tomatoes)Products (Amazon Reviews)BIRDPDMX (Public Domain MusicXML)Beer (RateBeer)SQuADFEVER

Benchmarks

16-query LLM relational benchmark (T1-T5)Small-sample OPHR vs GGR comparisons (Appendix D.1)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train agents to skip redundant thoughts and past observations to cut token cost while keeping accuracy

Key finding

D-MEM: dopamine-inspired memory router cuts token costs 80% and improves multi-hop reasoning

Key finding

BAVT: a training-free tree search that spends fewer tokens and tool calls to match or beat brute-force scaling

Key finding

Survey reframing LLM reasoning from fixed efficiency to input-aware adaptivity

Key finding

Pick the layer to prune based on attention stability, so token pruning adapts to task difficulty and keeps accuracy high

Key finding