Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.
Summary TLDR
This paper builds DQABench: a 200,000+ bilingual (EN/ZH) database QA dataset and DQATestbed: a modular pipeline (QCR, PTE, RAG, TIG, pretrain/finetune) to evaluate LLMs on three DBQA types—general, product-specific, and instance-specific. Key findings: model size and DB-specific pretraining/finetuning improve results; routing and RAG/TIG modules help when they return accurate info; retrieval recall is the main bottleneck; tool invocation succeeds only for instruction-tuned, larger models. The benchmark is practical for testing DB Q&A systems end-to-end.
Problem Statement
There is no comprehensive, DB-focused benchmark and testbed that measures LLMs across (1) general DB knowledge, (2) product/manual grounded answers, and (3) instance-specific tool-driven diagnosis. Existing datasets are noisy, narrow, or omit retrieval and tool-invocation requirements.
Main Contribution
DQABench dataset: 200,000+ English/Chinese DB QA pairs covering general, product-specific, and instance-specific questions.
DQATestbed: a plug-and-play, modular pipeline combining pretraining, fine-tuning, question routing (QCR), prompt template engineering (PTE), retrieval (RAG), and tool invocation (TIG).
A standardized evaluation pipeline and modular metrics (WinRate, MCA, TSA, TFA, recall) to compare nine LLM variants and module variants.
Key Findings
Large models and DB-specialized training improve DB QA quality.
DQABench size and bilingual coverage.
Retrieval is the main bottleneck: low recall reduces RAG gains.
Successful tool invocation almost guarantees a winning answer.
Automated question routing (QCR) improves answers and is faster than LLM classification.
Results
Dataset size
Pretraining corpus size
WinRate improvement for Baichuan2 after pretrain+finetune
RAG recall
Tool invocation -> answer win correlation
Accuracy
Who Should Care
What To Try In 7 Days
Run your assistant on a subset of DQABench (product+instance types) to identify retrieval recall and tool-format failures.
Add a lightweight question router (hierarchical classifier) to route prompts and reduce hallucination risk.
Collect or index product manuals into a vector DB with finer chunking and test recall; tune embedding and chunk-size.
Agent Features
Planning
- Tool planning for chain-of-tool calls (TIG uses COT/ReAct)
Tool Use
- Tool Invocation Generation (TIG) for DB tools
- Tool pool selection with tool name + formatted Action_Input
Frameworks
- Prompt Template Engineering (PTE)
- Question Classification Routing (QCR)
Optimization Features
Token Efficiency
- Document chunking (manual segments ≤8k tokens) to fit LLM context
Training Optimization
- Continual domain pretraining on ~100M DB tokens
- Sequential fine-tuning stages for NL2SQL, conversational, and expert answer alignment
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- RAG recall is low (<50%) on technical DB docs, so retrieval improvements are required before RAG helps reliably.
- Many answers and labels were generated or polished with GPT-4, introducing potential bias and leakage into the dataset.
- Tool invocation evaluation uses ground-truth tool outputs in end-to-end tests, which hides real-world noise from tools.
When Not To Use
- If you lack high-quality retrieval for your product manuals—RAG may degrade answers.
- For safety-critical DB actions without human review; tool invocation failures can produce wrong commands.
- If you need immediate off-the-shelf code or data release; paper does not confirm open release.
Failure Modes
- Grounding on irrelevant documents when retrieval recall is low, causing confident but wrong answers.
- LLMs hallucinating non-existent tools or producing wrong Action_Input formats.
- Classifier mis-route leading to use of wrong prompt and missed tool invocation for instance queries.
Core Entities
Models
- GPT-4
- GPT-3.5-Turbo
- GLM-3-Turbo
- Llama3-8B-Instruct
- Llama2-13B-Chat
- Yuan2-2B
- Baichuan2-13B
- SFT
Metrics
- WinRate
- Accuracy
- Recall Rate (RAG)
Datasets
- DQABench
- Spider
- StackOverflow (DB tags)
- DBA StackExchange
Benchmarks
- DQABench (this work)

