Precompute table-level KV caches (guided by primary–foreign keys) to cut Text-to‑SQL prefill latency up to 3.62× while keeping accuracy.

Overview

Decision SnapshotReady For Pilot

The system integrates with common inference engines and uses public benchmarks. Results show strong latency reduction and ablations that explain gains, but evaluation is limited to Text-to‑SQL workloads and tuned models.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

TableCache cuts Text-to‑SQL response latency by precomputing and reusing table caches, improving user experience and lowering repeated GPU compute costs in applications where users query shared tables.

Who Should Care

CTO Engineering Lead ML Engineer Product Manager Data Scientist

Summary TLDR

TableCache precomputes key-value (KV) cache entries per database table offline, preserving primary–foreign key (PFK) attention between related tables. It stores these table caches in a CPU-resident Table Trie and loads only the needed caches into GPU at inference. Combined with query reranking (to group similar table accesses) and a CPU→GPU prefetch pipeline, TableCache reduces Time To First Token (TTFT) substantially (reported up to 3.62×) while keeping SQL execution accuracy nearly unchanged. The method integrates with common serving engines (vLLM, SGLang) and targets Text-to‑SQL workloads with repeated table access patterns.

Problem Statement

LLM-based Text-to‑SQL systems include large database schemas in prompts. This makes prefill (prefix) computation long and slow. Existing KV cache reuse needs exact prefix matches and fails when table order varies, causing redundant cache recomputation and high latency. The paper seeks a practical way to reuse table-level cache across queries while retaining inter-table attention.

Main Contribution

TableCache: offline precomputation of per-table KV caches that preserve primary–foreign key attention relationships.

Table Trie: a token-level trie for fast table-name matching to retrieve precomputed table caches at inference.

Key Findings

TableCache greatly reduces prefix latency (TTFT) on Text-to‑SQL benchmarks.

Numbersup to 3.62× TTFT speedup (reported max)

Practical UsePrecompute and reuse table caches to cut user-visible response time for multi-table Text-to‑SQL workloads.

Evidence RefAbstract, Sec.5.3

Accuracy is preserved for tuned Text‑to‑SQL models.

Numbersaccuracy gaps within ≈1% for adapted models

Practical UseYou can deploy TableCache without retraining concerns for domain-adapted models.

Evidence RefSec.5.3 (near-lossless performance)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
TTFT (Time To First Token)	36.23 s (TableCache on Spider dev)	≈98.13 s (baseline transformers on Spider dev)	≈ -61.9 s (~2.7–3.6× reduction vs baselines)	Spider dev	Table 2; Sec.5.3	Table 2, Sec.5.3
Accuracy	76.9% (TableCache on Spider, tuned backbone)	72.0% (w/o PFK-guided representation in ablation)	+4.9 pp (absolute)	Spider (PFTR ablation, Table 3)	Table 3 (PFTR ablation)	Table 3

What To Try In 7 Days

Profile your Text-to‑SQL traffic to find hot tables and repeated table sets.

Precompute table-level KV caches and store them in CPU memory for quick lookup.

Build a simple Table Trie to map tokenized queries to table IDs and fetch caches by match order.

Optimization Features

Token Efficiency

block-wise encoding reduces attention span per chunk

Infra Optimization

store KV caches in CPU memory and swap to GPU on demandoverlap CPU↔GPU transfers with GPU compute

Model Optimization

adaptive fine-tuning with attention mask (3 epochs)

System Optimization

Table Trie for fast table matchingGPU eviction policies (LRU/FIFO/LFU)

Training Optimization

attention-mask training to adapt to block-wise KV cache

Inference Optimization

offline precomputed table KV cachesquery reranking to group similar table accessescomputation loading pipeline to prefetch caches during compute

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusUnknown

LicenseUnknown

Data URLs

Spider datasetBIRD dataset

Risks & Boundaries

Limitations

Designed and evaluated only for Text‑to‑SQL; extension to unstructured QA is non-trivial.

Requires static or slowly changing schemas; frequent schema changes raise precompute cost.

When Not To Use

When schemas change rapidly and precomputation cannot be amortized.

For open-ended language tasks without clear table boundaries.

Failure Modes

If primary–foreign key graph is incomplete, inter-table attention reconstruction may be incorrect.

Mismatched tokenization or table-naming variants can reduce Table Trie match rates and lower hit ratio.

Core Entities

Models

OmniSQL-7BQwen2.5-7B-Coder

Metrics

TTFT (Time To First Token)Accuracy

Datasets

SpiderBIRD

Benchmarks

Spider devBIRD dev

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TableCache greatly reduces prefix latency (TTFT) on Text-to‑SQL benchmarks.

Accuracy is preserved for tuned Text‑to‑SQL models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Skip 25–30% of expensive FFN blocks to speed decoding while keeping knowledge accuracy

Key finding

KV-CoRE: an SVD-based tool and benchmark that measures how compressible LLM KV-caches are, per layer and per dataset.

Key finding

Share the common KV cache across LoRA-adapted agents and keep tiny low-rank adapters to cut memory and speed up multi-agent inference.

Key finding

KV-cache compression breaks attention routing: reachability, a 90% safety cliff, and two failure modes

Key finding

Use per-token unstructured pruning + a bitmap sparse kernel to cut KV cache to ~45% size and speed decoding up to 2.23×

Key finding