Precompute table-level KV caches (guided by primary–foreign keys) to cut Text-to‑SQL prefill latency up to 3.62× while keeping accuracy.

January 13, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang

Links

Abstract / PDF

Why It Matters For Business

TableCache cuts Text-to‑SQL response latency by precomputing and reusing table caches, improving user experience and lowering repeated GPU compute costs in applications where users query shared tables.

Summary TLDR

TableCache precomputes key-value (KV) cache entries per database table offline, preserving primary–foreign key (PFK) attention between related tables. It stores these table caches in a CPU-resident Table Trie and loads only the needed caches into GPU at inference. Combined with query reranking (to group similar table accesses) and a CPU→GPU prefetch pipeline, TableCache reduces Time To First Token (TTFT) substantially (reported up to 3.62×) while keeping SQL execution accuracy nearly unchanged. The method integrates with common serving engines (vLLM, SGLang) and targets Text-to‑SQL workloads with repeated table access patterns.

Problem Statement

LLM-based Text-to‑SQL systems include large database schemas in prompts. This makes prefill (prefix) computation long and slow. Existing KV cache reuse needs exact prefix matches and fails when table order varies, causing redundant cache recomputation and high latency. The paper seeks a practical way to reuse table-level cache across queries while retaining inter-table attention.

Main Contribution

TableCache: offline precomputation of per-table KV caches that preserve primary–foreign key attention relationships.

Table Trie: a token-level trie for fast table-name matching to retrieve precomputed table caches at inference.

Cache management: CPU↔GPU cache eviction policies, a query reranking strategy to raise cache hits, and a computation-loading pipeline to overlap prefetching with model compute.

Key Findings

TableCache greatly reduces prefix latency (TTFT) on Text-to‑SQL benchmarks.

Numbersup to 3.62× TTFT speedup (reported max)

Accuracy is preserved for tuned Text‑to‑SQL models.

Numbersaccuracy gaps within ≈1% for adapted models

Reconstructing PFK links recovers substantial accuracy lost by naive block-wise encoding.

NumbersSpider: 72.0% → 76.9%; BIRD: 50.9% → 58.1% (absolute acc.)

Cache management and batching strategies materially impact latency.

NumbersSpider dev TTFT: 99.346s (w/o cache mgmt) → 36.229s (TableCache)

Overlapping cache loading with compute adds extra but smaller gains.

NumbersSpider dev TTFT: 38.549s (w/o comp. load) → 36.229s (TableCache)

Results

TTFT (Time To First Token)

Value36.23 s (TableCache on Spider dev)

Baseline≈98.13 s (baseline transformers on Spider dev)

Accuracy

Value76.9% (TableCache on Spider, tuned backbone)

Baseline72.0% (w/o PFK-guided representation in ablation)

TTFT (ablation w/o cache mgmt)

Value99.35 s (Spider dev, no cache mgmt)

Baseline36.23 s (TableCache)

Who Should Care

What To Try In 7 Days

Profile your Text-to‑SQL traffic to find hot tables and repeated table sets.

Precompute table-level KV caches and store them in CPU memory for quick lookup.

Build a simple Table Trie to map tokenized queries to table IDs and fetch caches by match order.

Optimization Features

Token Efficiency

  • block-wise encoding reduces attention span per chunk

Infra Optimization

  • store KV caches in CPU memory and swap to GPU on demand
  • overlap CPU↔GPU transfers with GPU compute

Model Optimization

  • adaptive fine-tuning with attention mask (3 epochs)

System Optimization

  • Table Trie for fast table matching
  • GPU eviction policies (LRU/FIFO/LFU)

Training Optimization

  • attention-mask training to adapt to block-wise KV cache

Inference Optimization

  • offline precomputed table KV caches
  • query reranking to group similar table accesses
  • computation loading pipeline to prefetch caches during compute

Reproducibility

Data Urls

  • Spider dataset
  • BIRD dataset

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Designed and evaluated only for Text‑to‑SQL; extension to unstructured QA is non-trivial.
  • Requires static or slowly changing schemas; frequent schema changes raise precompute cost.
  • Some model adaptation (fine-tuning) improves results; untuned large general models may see small accuracy drops.

When Not To Use

  • When schemas change rapidly and precomputation cannot be amortized.
  • For open-ended language tasks without clear table boundaries.
  • When queries rarely reuse the same tables (no hotspot patterns).

Failure Modes

  • If primary–foreign key graph is incomplete, inter-table attention reconstruction may be incorrect.
  • Mismatched tokenization or table-naming variants can reduce Table Trie match rates and lower hit ratio.
  • Excessive cache churn if reranking or eviction policies are not tuned for workload.

Core Entities

Models

  • OmniSQL-7B
  • Qwen2.5-7B-Coder

Metrics

  • TTFT (Time To First Token)
  • Accuracy

Datasets

  • Spider
  • BIRD

Benchmarks

  • Spider dev
  • BIRD dev