Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

March 9, 20249 min

Overview

Decision SnapshotNeeds Validation

The paper provides a practical algorithm, Spark implementation, and multiple real-dataset experiments showing consistent latency and cost gains; evaluation covers model sizes and cost models but relies on specific dataset repetitiveness and provider caching policies.

Citations6

Evidence Strength0.80

Confidence0.90

Risk Signals13

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 60%

Authors

Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

Links

Abstract / PDF / Data

Why It Matters For Business

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Who Should Care

Summary TLDR

LLM batch queries over tables are slow and costly because repeated prompt prefixes are recomputed. The paper introduces algorithms that reorder rows and fields to maximize key-value (KV) cache prefix reuse in LLM serving. The practical, scalable solver (GGR) uses functional dependencies and table stats, runs in <15s on 30K-row tables, and in experiments yields 1.5–3.4× faster end-to-end runtimes and up to 32% measured cost savings on proprietary APIs while preserving accuracy.

Problem Statement

Running an LLM per row on large tables is slow and expensive. Existing prompt caching helps but typical table/field orderings lead to low cache hits. We need algorithms that reorder rows and fields to maximize prefix reuse in the LLM KV cache for relational analytics workloads.

Main Contribution

Show that per-row reordering of fields and rows can dramatically increase LLM prompt prefix reuse in relational workloads.

Introduce OPHR, an optimal recursive algorithm to maximize prefix hit count (exponential, impractical on large tables).

Key Findings

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers1.53.4× speedup (Sec 6.2; Fig 3/4)

Practical UseReordering rows/fields before batching LLM calls can cut runtime by up to ~3× on similar relational analytics jobs; try it before adding hardware.

Evidence RefSec 6.2, Fig 3/4

GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.

Numbers32% saved (OpenAI), 21% saved (Anthropic) on FEVER (Table 3)

Practical UseIf your provider supports prompt/prefix caching and you batch offline queries, reordering can meaningfully lower API spending.

Evidence RefTable 3, Sec 6.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
End-to-end latency (vs Cache Original)1.53.4× fasterCache (Original)16 queries across 7 datasets (Sec 6.2)Cache(GGR) achieves 1.5–3.4× speedup over Cache (Original)Sec 6.2, Fig 3/4
End-to-end latency (vs No Cache)1.83.8× fasterNo Cache16 queries across 7 datasets (Sec 6.2)Cache(GGR) achieves 1.8–3.8× speedup over No CacheSec 6.2, Fig 3/4

What To Try In 7 Days

Measure current prefix hit rate (PHR) on a representative batch.

Run GGR on a small sample and compare latency, PHR and accuracy.

If caching providers are used, simulate cost with new PHR to estimate savings (use provider pricing).

Optimization Features

Token Efficiency
maximize prefix hit count (PHC)per-row field reordering to convert misses into reusable prefixes
Infra Optimization
reduces memory and compute by reusing KV cache, enabling larger batches
System Optimization
leverage functional dependencies to reduce searchuse table statistics and early stopping to limit solver work
Inference Optimization
prompt caching / prefix reuserequest reordering (row- and field-level)batch scheduling to maximize KV cache hits

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

Rotten Tomatoes Movies, Amazon Product Reviews, BIRD, PDMX, RateBeer, SQuAD, FEVER (referenced in paper)

Risks & Boundaries

Limitations

Requires offline/batch knowledge of all requests (not suitable for streaming/live queries).

Assumes exact cell-value matches for cache hits; substring or semantic matches not counted.

When Not To Use

Streaming workloads where requests arrive online and cannot be globally reordered.

Workloads where decode time dominates (very long outputs) and prefix reuse gives little net gain.

Failure Modes

Reordering changes prompt layout and can slightly change model outputs for some models.

GGR may choose suboptimal groups in tie cases, missing global-optimal PHC.

Core Entities

Models

Llama-3-8B-InstructLlama-3-70B-InstructLlama-3-2-1BGPT-4o-miniAnthropic Claude 3.5 SonnetAlibabaNLP/gte-base-en-v1.5 (embeddings)

Metrics

Prefix Hit Rate (PHR)Prefix Hit Count (PHC)End-to-end query latencyAPI cost (OpenAI/Anthropic pricing)Accuracy

Datasets

Movies (Rotten Tomatoes)Products (Amazon Reviews)BIRDPDMX (Public Domain MusicXML)Beer (RateBeer)SQuADFEVER

Benchmarks

16-query LLM relational benchmark (T1-T5)Small-sample OPHR vs GGR comparisons (Appendix D.1)