DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

Overview

Decision SnapshotReady For Pilot

The paper provides a large dataset, modular testbed, and measured metrics; results are reproducible in principle but dataset and code release are not explicitly linked in the text.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li

Links

Abstract / PDF

Why It Matters For Business

If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

This paper builds DQABench: a 200,000+ bilingual (EN/ZH) database QA dataset and DQATestbed: a modular pipeline (QCR, PTE, RAG, TIG, pretrain/finetune) to evaluate LLMs on three DBQA types—general, product-specific, and instance-specific. Key findings: model size and DB-specific pretraining/finetuning improve results; routing and RAG/TIG modules help when they return accurate info; retrieval recall is the main bottleneck; tool invocation succeeds only for instruction-tuned, larger models. The benchmark is practical for testing DB Q&A systems end-to-end.

Problem Statement

There is no comprehensive, DB-focused benchmark and testbed that measures LLMs across (1) general DB knowledge, (2) product/manual grounded answers, and (3) instance-specific tool-driven diagnosis. Existing datasets are noisy, narrow, or omit retrieval and tool-invocation requirements.

Main Contribution

DQABench dataset: 200,000+ English/Chinese DB QA pairs covering general, product-specific, and instance-specific questions.

DQATestbed: a plug-and-play, modular pipeline combining pretraining, fine-tuning, question routing (QCR), prompt template engineering (PTE), retrieval (RAG), and tool invocation (TIG).

Key Findings

Large models and DB-specialized training improve DB QA quality.

NumbersBaichuan2-cpt-sft avg WinRate gain +0.44 (ZH) / +0.35 (EN) vs vanilla Baichuan2

Practical UseInvest in domain pretraining and instruction fine-tuning: mid-size models can match or beat closed-source models for DB tasks.

Evidence RefTable 5 (WinRate improvements for Baichuan2-cpt-sft)

DQABench size and bilingual coverage.

Numbers200,000+ QA pairs in English and Chinese

Practical UseUse this large, bilingual dataset to stress-test DB assistants across common and product-specific scenarios.

Evidence RefAbstract and Section 3 (dataset statistics, Table 1)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	200,000+ QA pairs (EN+ZH)	—	—	DQABench	Abstract; Section 3; Table 1	Section 3
Pretraining corpus size	≈100M tokens (47k entries per language)	—	—	Pretraining data (paper)	Section 4.1: continued pretraining corpus ~100M tokens	Section 4.1

What To Try In 7 Days

Run your assistant on a subset of DQABench (product+instance types) to identify retrieval recall and tool-format failures.

Add a lightweight question router (hierarchical classifier) to route prompts and reduce hallucination risk.

Collect or index product manuals into a vector DB with finer chunking and test recall; tune embedding and chunk-size.

Agent Features

Planning

Tool planning for chain-of-tool calls (TIG uses COT/ReAct)

Tool Use

Tool Invocation Generation (TIG) for DB toolsTool pool selection with tool name + formatted Action_Input

Frameworks

Prompt Template Engineering (PTE)Question Classification Routing (QCR)

Optimization Features

Token Efficiency

Document chunking (manual segments ≤8k tokens) to fit LLM context

Training Optimization

Continual domain pretraining on ~100M DB tokensSequential fine-tuning stages for NL2SQL, conversational, and expert answer alignment

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

RAG recall is low (<50%) on technical DB docs, so retrieval improvements are required before RAG helps reliably.

Many answers and labels were generated or polished with GPT-4, introducing potential bias and leakage into the dataset.

When Not To Use

If you lack high-quality retrieval for your product manuals—RAG may degrade answers.

For safety-critical DB actions without human review; tool invocation failures can produce wrong commands.

Failure Modes

Grounding on irrelevant documents when retrieval recall is low, causing confident but wrong answers.

LLMs hallucinating non-existent tools or producing wrong Action_Input formats.

Core Entities

Models

GPT-4GPT-3.5-TurboGLM-3-TurboLlama3-8B-InstructLlama2-13B-ChatYuan2-2BBaichuan2-13BSFT

Metrics

WinRateAccuracyRecall Rate (RAG)

Datasets

DQABenchSpiderStackOverflow (DB tags)DBA StackExchange

Benchmarks

DQABench (this work)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large models and DB-specialized training improve DB QA quality.

DQABench size and bilingual coverage.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding