DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

September 5, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li

Links

Abstract / PDF

Why It Matters For Business

If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.

Summary TLDR

This paper builds DQABench: a 200,000+ bilingual (EN/ZH) database QA dataset and DQATestbed: a modular pipeline (QCR, PTE, RAG, TIG, pretrain/finetune) to evaluate LLMs on three DBQA types—general, product-specific, and instance-specific. Key findings: model size and DB-specific pretraining/finetuning improve results; routing and RAG/TIG modules help when they return accurate info; retrieval recall is the main bottleneck; tool invocation succeeds only for instruction-tuned, larger models. The benchmark is practical for testing DB Q&A systems end-to-end.

Problem Statement

There is no comprehensive, DB-focused benchmark and testbed that measures LLMs across (1) general DB knowledge, (2) product/manual grounded answers, and (3) instance-specific tool-driven diagnosis. Existing datasets are noisy, narrow, or omit retrieval and tool-invocation requirements.

Main Contribution

DQABench dataset: 200,000+ English/Chinese DB QA pairs covering general, product-specific, and instance-specific questions.

DQATestbed: a plug-and-play, modular pipeline combining pretraining, fine-tuning, question routing (QCR), prompt template engineering (PTE), retrieval (RAG), and tool invocation (TIG).

A standardized evaluation pipeline and modular metrics (WinRate, MCA, TSA, TFA, recall) to compare nine LLM variants and module variants.

Key Findings

Large models and DB-specialized training improve DB QA quality.

NumbersBaichuan2-cpt-sft avg WinRate gain +0.44 (ZH) / +0.35 (EN) vs vanilla Baichuan2

DQABench size and bilingual coverage.

Numbers200,000+ QA pairs in English and Chinese

Retrieval is the main bottleneck: low recall reduces RAG gains.

NumbersAll RAG methods show recall <50% on DB documents

Successful tool invocation almost guarantees a winning answer.

Numbers≈93% of answers with successful tool calls win in pairwise judge comparisons

Automated question routing (QCR) improves answers and is faster than LLM classification.

NumbersHierarchical classifier ACC ≈0.94 (ZH) / 0.92 (EN); latency 0.68s vs GPT-4 2.6s

Results

Dataset size

Value200,000+ QA pairs (EN+ZH)

Pretraining corpus size

Value≈100M tokens (47k entries per language)

WinRate improvement for Baichuan2 after pretrain+finetune

Value+0.44 (ZH average WinRate) / +0.35 (EN) vs vanilla Baichuan2

BaselineBaichuan2-13B (vanilla)

RAG recall

Value<50% (recall of relevant text blocks in top-3)

Tool invocation -> answer win correlation

Value≈93% of cases with successful tool invocations win

Accuracy

ValueACC 0.94 (ZH) / 0.92 (EN); latency 0.68s

BaselineGPT-4 classifier ACC 0.55 (ZH) / 0.63 (EN); latency ~2.6s

Who Should Care

What To Try In 7 Days

Run your assistant on a subset of DQABench (product+instance types) to identify retrieval recall and tool-format failures.

Add a lightweight question router (hierarchical classifier) to route prompts and reduce hallucination risk.

Collect or index product manuals into a vector DB with finer chunking and test recall; tune embedding and chunk-size.

Agent Features

Planning

  • Tool planning for chain-of-tool calls (TIG uses COT/ReAct)

Tool Use

  • Tool Invocation Generation (TIG) for DB tools
  • Tool pool selection with tool name + formatted Action_Input

Frameworks

  • Prompt Template Engineering (PTE)
  • Question Classification Routing (QCR)

Optimization Features

Token Efficiency

  • Document chunking (manual segments ≤8k tokens) to fit LLM context

Training Optimization

  • Continual domain pretraining on ~100M DB tokens
  • Sequential fine-tuning stages for NL2SQL, conversational, and expert answer alignment

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • RAG recall is low (<50%) on technical DB docs, so retrieval improvements are required before RAG helps reliably.
  • Many answers and labels were generated or polished with GPT-4, introducing potential bias and leakage into the dataset.
  • Tool invocation evaluation uses ground-truth tool outputs in end-to-end tests, which hides real-world noise from tools.

When Not To Use

  • If you lack high-quality retrieval for your product manuals—RAG may degrade answers.
  • For safety-critical DB actions without human review; tool invocation failures can produce wrong commands.
  • If you need immediate off-the-shelf code or data release; paper does not confirm open release.

Failure Modes

  • Grounding on irrelevant documents when retrieval recall is low, causing confident but wrong answers.
  • LLMs hallucinating non-existent tools or producing wrong Action_Input formats.
  • Classifier mis-route leading to use of wrong prompt and missed tool invocation for instance queries.

Core Entities

Models

  • GPT-4
  • GPT-3.5-Turbo
  • GLM-3-Turbo
  • Llama3-8B-Instruct
  • Llama2-13B-Chat
  • Yuan2-2B
  • Baichuan2-13B
  • SFT

Metrics

  • WinRate
  • Accuracy
  • Recall Rate (RAG)

Datasets

  • DQABench
  • Spider
  • StackOverflow (DB tags)
  • DBA StackExchange

Benchmarks

  • DQABench (this work)