DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

September 5, 20248 min

Overview

Decision SnapshotReady For Pilot

The paper provides a large dataset, modular testbed, and measured metrics; results are reproducible in principle but dataset and code release are not explicitly linked in the text.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li

Links

Abstract / PDF

Why It Matters For Business

If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.

Who Should Care

Summary TLDR

This paper builds DQABench: a 200,000+ bilingual (EN/ZH) database QA dataset and DQATestbed: a modular pipeline (QCR, PTE, RAG, TIG, pretrain/finetune) to evaluate LLMs on three DBQA types—general, product-specific, and instance-specific. Key findings: model size and DB-specific pretraining/finetuning improve results; routing and RAG/TIG modules help when they return accurate info; retrieval recall is the main bottleneck; tool invocation succeeds only for instruction-tuned, larger models. The benchmark is practical for testing DB Q&A systems end-to-end.

Problem Statement

There is no comprehensive, DB-focused benchmark and testbed that measures LLMs across (1) general DB knowledge, (2) product/manual grounded answers, and (3) instance-specific tool-driven diagnosis. Existing datasets are noisy, narrow, or omit retrieval and tool-invocation requirements.

Main Contribution

DQABench dataset: 200,000+ English/Chinese DB QA pairs covering general, product-specific, and instance-specific questions.

DQATestbed: a plug-and-play, modular pipeline combining pretraining, fine-tuning, question routing (QCR), prompt template engineering (PTE), retrieval (RAG), and tool invocation (TIG).

Key Findings

Large models and DB-specialized training improve DB QA quality.

NumbersBaichuan2-cpt-sft avg WinRate gain +0.44 (ZH) / +0.35 (EN) vs vanilla Baichuan2

Practical UseInvest in domain pretraining and instruction fine-tuning: mid-size models can match or beat closed-source models for DB tasks.

Evidence RefTable 5 (WinRate improvements for Baichuan2-cpt-sft)

DQABench size and bilingual coverage.

Numbers200,000+ QA pairs in English and Chinese

Practical UseUse this large, bilingual dataset to stress-test DB assistants across common and product-specific scenarios.

Evidence RefAbstract and Section 3 (dataset statistics, Table 1)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size200,000+ QA pairs (EN+ZH)DQABenchAbstract; Section 3; Table 1Section 3
Pretraining corpus size≈100M tokens (47k entries per language)Pretraining data (paper)Section 4.1: continued pretraining corpus ~100M tokensSection 4.1

What To Try In 7 Days

Run your assistant on a subset of DQABench (product+instance types) to identify retrieval recall and tool-format failures.

Add a lightweight question router (hierarchical classifier) to route prompts and reduce hallucination risk.

Collect or index product manuals into a vector DB with finer chunking and test recall; tune embedding and chunk-size.

Agent Features

Planning
Tool planning for chain-of-tool calls (TIG uses COT/ReAct)
Tool Use
Tool Invocation Generation (TIG) for DB toolsTool pool selection with tool name + formatted Action_Input
Frameworks
Prompt Template Engineering (PTE)Question Classification Routing (QCR)

Optimization Features

Token Efficiency
Document chunking (manual segments ≤8k tokens) to fit LLM context
Training Optimization
Continual domain pretraining on ~100M DB tokensSequential fine-tuning stages for NL2SQL, conversational, and expert answer alignment

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

RAG recall is low (<50%) on technical DB docs, so retrieval improvements are required before RAG helps reliably.

Many answers and labels were generated or polished with GPT-4, introducing potential bias and leakage into the dataset.

When Not To Use

If you lack high-quality retrieval for your product manuals—RAG may degrade answers.

For safety-critical DB actions without human review; tool invocation failures can produce wrong commands.

Failure Modes

Grounding on irrelevant documents when retrieval recall is low, causing confident but wrong answers.

LLMs hallucinating non-existent tools or producing wrong Action_Input formats.

Core Entities

Models

GPT-4GPT-3.5-TurboGLM-3-TurboLlama3-8B-InstructLlama2-13B-ChatYuan2-2BBaichuan2-13BSFT

Metrics

WinRateAccuracyRecall Rate (RAG)

Datasets

DQABenchSpiderStackOverflow (DB tags)DBA StackExchange

Benchmarks

DQABench (this work)