Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

March 9, 20249 min

Overview

Production Readiness

0.8

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

6

Authors

Shu Liu, Asim Biswal, Amog Kamsetty, Audrey Cheng, Luis Gaspar Schroeder, Liana Patel, Shiyi Cao, Xiangxi Mo, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia

Links

Abstract / PDF

Why It Matters For Business

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Summary TLDR

LLM batch queries over tables are slow and costly because repeated prompt prefixes are recomputed. The paper introduces algorithms that reorder rows and fields to maximize key-value (KV) cache prefix reuse in LLM serving. The practical, scalable solver (GGR) uses functional dependencies and table stats, runs in <15s on 30K-row tables, and in experiments yields 1.5–3.4× faster end-to-end runtimes and up to 32% measured cost savings on proprietary APIs while preserving accuracy.

Problem Statement

Running an LLM per row on large tables is slow and expensive. Existing prompt caching helps but typical table/field orderings lead to low cache hits. We need algorithms that reorder rows and fields to maximize prefix reuse in the LLM KV cache for relational analytics workloads.

Main Contribution

Show that per-row reordering of fields and rows can dramatically increase LLM prompt prefix reuse in relational workloads.

Introduce OPHR, an optimal recursive algorithm to maximize prefix hit count (exponential, impractical on large tables).

Introduce GGR, a practical greedy recursive solver that uses functional dependencies and table statistics to approximate OPHR quickly.

Implement the approach in PySpark with vLLM and evaluate on 16 LLM queries across 7 datasets, showing sizable latency and cost gains while keeping accuracy similar.

Key Findings

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers1.5–3.4× speedup (Sec 6.2; Fig 3/4)

GGR yields measured cost savings up to 32% on OpenAI GPT-4o-mini and 21% on Anthropic Sonnet for FEVER-like workloads.

Numbers32% saved (OpenAI), 21% saved (Anthropic) on FEVER (Table 3)

GGR raises prefix hit rates (PHR) massively: examples include Movies 35%→86%, Products 27%→83%, Beer 50%→80%.

NumbersPHR increases of 30–75 points (Table 2)

Ordering changes do not meaningfully hurt accuracy for most models; median accuracy differences are within ±5% across tests.

Numbers<=5% median accuracy change except FEVER with Llama3-8B (+14.2%) (Fig 6)

GGR achieves near-optimal PHR (within ~2% of OPHR) on small samples while being orders of magnitude faster.

NumbersGGR within 0–1.3% PHR diff; OPHR solver times hours vs GGR <1s (Table 6)

Solver runtime is small: GGR runs in under 15 seconds for datasets up to 30K rows and 57 fields.

NumbersSolver times 1.2–12.6s across datasets; all <15s (Table 5)

Results

End-to-end latency (vs Cache Original)

Value1.5–3.4× faster

BaselineCache (Original)

End-to-end latency (vs No Cache)

Value1.8–3.8× faster

BaselineNo Cache

Measured API cost savings

Value32% (OpenAI GPT-4o-mini) on FEVER

BaselineOriginal ordering

PHR (Prefix hit rate) examples

ValueMovies 35%→86%, Products 27%→83%, Beer 50%→80%

BaselineOriginal ordering

GGR solver time

Value<15 s

Accuracy

ValueWithin ±5% median difference for most setups; one case +14.2%

BaselineOriginal ordering

Who Should Care

What To Try In 7 Days

Measure current prefix hit rate (PHR) on a representative batch.

Run GGR on a small sample and compare latency, PHR and accuracy.

If caching providers are used, simulate cost with new PHR to estimate savings (use provider pricing).

Optimization Features

Token Efficiency

  • maximize prefix hit count (PHC)
  • per-row field reordering to convert misses into reusable prefixes

Infra Optimization

  • reduces memory and compute by reusing KV cache, enabling larger batches

System Optimization

  • leverage functional dependencies to reduce search
  • use table statistics and early stopping to limit solver work

Inference Optimization

  • prompt caching / prefix reuse
  • request reordering (row- and field-level)
  • batch scheduling to maximize KV cache hits

Reproducibility

Data Urls

  • Rotten Tomatoes Movies, Amazon Product Reviews, BIRD, PDMX, RateBeer, SQuAD, FEVER (referenced in paper)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires offline/batch knowledge of all requests (not suitable for streaming/live queries).
  • Assumes exact cell-value matches for cache hits; substring or semantic matches not counted.
  • Provider caching minimums (e.g., OpenAI 1,024 tokens) can limit gains without synthetic duplication.
  • OPHR is exponential and impractical for large tables; GGR is heuristic and can be suboptimal in tie cases.
  • Benefit depends on data repetitiveness and on whether prefill (not decode) dominates runtime.

When Not To Use

  • Streaming workloads where requests arrive online and cannot be globally reordered.
  • Workloads where decode time dominates (very long outputs) and prefix reuse gives little net gain.
  • Situations where field order must be preserved for semantic or regulatory reasons.
  • When provider caching policies do not offer discounts or enforce long minimum prefix lengths.

Failure Modes

  • Reordering changes prompt layout and can slightly change model outputs for some models.
  • GGR may choose suboptimal groups in tie cases, missing global-optimal PHC.
  • Solver runtime or memory could increase on extremely wide tables with high cardinality without adequate early stopping.
  • Provider cache thresholds or billing details can reduce or eliminate expected cost savings.

Core Entities

Models

  • Llama-3-8B-Instruct
  • Llama-3-70B-Instruct
  • Llama-3-2-1B
  • GPT-4o-mini
  • Anthropic Claude 3.5 Sonnet
  • AlibabaNLP/gte-base-en-v1.5 (embeddings)

Metrics

  • Prefix Hit Rate (PHR)
  • Prefix Hit Count (PHC)
  • End-to-end query latency
  • API cost (OpenAI/Anthropic pricing)
  • Accuracy

Datasets

  • Movies (Rotten Tomatoes)
  • Products (Amazon Reviews)
  • BIRD
  • PDMX (Public Domain MusicXML)
  • Beer (RateBeer)
  • SQuAD
  • FEVER

Benchmarks

  • 16-query LLM relational benchmark (T1-T5)
  • Small-sample OPHR vs GGR comparisons (Appendix D.1)