Precompute table-level KV caches (guided by primary–foreign keys) to cut Text-to‑SQL prefill latency up to 3.62× while keeping accuracy.

January 13, 20267 min

Overview

Decision SnapshotReady For Pilot

The system integrates with common inference engines and uses public benchmarks. Results show strong latency reduction and ablations that explain gains, but evaluation is limited to Text-to‑SQL workloads and tuned models.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Partial assets available

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

TableCache cuts Text-to‑SQL response latency by precomputing and reusing table caches, improving user experience and lowering repeated GPU compute costs in applications where users query shared tables.

Who Should Care

Summary TLDR

TableCache precomputes key-value (KV) cache entries per database table offline, preserving primary–foreign key (PFK) attention between related tables. It stores these table caches in a CPU-resident Table Trie and loads only the needed caches into GPU at inference. Combined with query reranking (to group similar table accesses) and a CPU→GPU prefetch pipeline, TableCache reduces Time To First Token (TTFT) substantially (reported up to 3.62×) while keeping SQL execution accuracy nearly unchanged. The method integrates with common serving engines (vLLM, SGLang) and targets Text-to‑SQL workloads with repeated table access patterns.

Problem Statement

LLM-based Text-to‑SQL systems include large database schemas in prompts. This makes prefill (prefix) computation long and slow. Existing KV cache reuse needs exact prefix matches and fails when table order varies, causing redundant cache recomputation and high latency. The paper seeks a practical way to reuse table-level cache across queries while retaining inter-table attention.

Main Contribution

TableCache: offline precomputation of per-table KV caches that preserve primary–foreign key attention relationships.

Table Trie: a token-level trie for fast table-name matching to retrieve precomputed table caches at inference.

Key Findings

TableCache greatly reduces prefix latency (TTFT) on Text-to‑SQL benchmarks.

Numbersup to 3.62× TTFT speedup (reported max)

Practical UsePrecompute and reuse table caches to cut user-visible response time for multi-table Text-to‑SQL workloads.

Evidence RefAbstract, Sec.5.3

Accuracy is preserved for tuned Text‑to‑SQL models.

Numbersaccuracy gaps within ≈1% for adapted models

Practical UseYou can deploy TableCache without retraining concerns for domain-adapted models.

Evidence RefSec.5.3 (near-lossless performance)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
TTFT (Time To First Token)36.23 s (TableCache on Spider dev)≈98.13 s (baseline transformers on Spider dev)≈ -61.9 s (~2.73.6× reduction vs baselines)Spider devTable 2; Sec.5.3Table 2, Sec.5.3
Accuracy76.9% (TableCache on Spider, tuned backbone)72.0% (w/o PFK-guided representation in ablation)+4.9 pp (absolute)Spider (PFTR ablation, Table 3)Table 3 (PFTR ablation)Table 3

What To Try In 7 Days

Profile your Text-to‑SQL traffic to find hot tables and repeated table sets.

Precompute table-level KV caches and store them in CPU memory for quick lookup.

Build a simple Table Trie to map tokenized queries to table IDs and fetch caches by match order.

Optimization Features

Token Efficiency
block-wise encoding reduces attention span per chunk
Infra Optimization
store KV caches in CPU memory and swap to GPU on demandoverlap CPU↔GPU transfers with GPU compute
Model Optimization
adaptive fine-tuning with attention mask (3 epochs)
System Optimization
Table Trie for fast table matchingGPU eviction policies (LRU/FIFO/LFU)
Training Optimization
attention-mask training to adapt to block-wise KV cache
Inference Optimization
offline precomputed table KV cachesquery reranking to group similar table accessescomputation loading pipeline to prefetch caches during compute

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusUnknown
LicenseUnknown

Data URLs

Spider datasetBIRD dataset

Risks & Boundaries

Limitations

Designed and evaluated only for Text‑to‑SQL; extension to unstructured QA is non-trivial.

Requires static or slowly changing schemas; frequent schema changes raise precompute cost.

When Not To Use

When schemas change rapidly and precomputation cannot be amortized.

For open-ended language tasks without clear table boundaries.

Failure Modes

If primary–foreign key graph is incomplete, inter-table attention reconstruction may be incorrect.

Mismatched tokenization or table-naming variants can reduce Table Trie match rates and lower hit ratio.

Core Entities

Models

OmniSQL-7BQwen2.5-7B-Coder

Metrics

TTFT (Time To First Token)Accuracy

Datasets

SpiderBIRD

Benchmarks

Spider devBIRD dev