Overview
The paper provides a large dataset, modular testbed, and measured metrics; results are reproducible in principle but dataset and code release are not explicitly linked in the text.
Citations0
Evidence Strength0.80
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/6
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you build DB assistants, measure three things separately: core LLM skill, retrieval quality, and tool invocation. Improving retrieval and tool-format handling yields bigger gains than switching LLMs alone.
Who Should Care
Summary TLDR
This paper builds DQABench: a 200,000+ bilingual (EN/ZH) database QA dataset and DQATestbed: a modular pipeline (QCR, PTE, RAG, TIG, pretrain/finetune) to evaluate LLMs on three DBQA types—general, product-specific, and instance-specific. Key findings: model size and DB-specific pretraining/finetuning improve results; routing and RAG/TIG modules help when they return accurate info; retrieval recall is the main bottleneck; tool invocation succeeds only for instruction-tuned, larger models. The benchmark is practical for testing DB Q&A systems end-to-end.
Problem Statement
There is no comprehensive, DB-focused benchmark and testbed that measures LLMs across (1) general DB knowledge, (2) product/manual grounded answers, and (3) instance-specific tool-driven diagnosis. Existing datasets are noisy, narrow, or omit retrieval and tool-invocation requirements.
Main Contribution
DQABench dataset: 200,000+ English/Chinese DB QA pairs covering general, product-specific, and instance-specific questions.
DQATestbed: a plug-and-play, modular pipeline combining pretraining, fine-tuning, question routing (QCR), prompt template engineering (PTE), retrieval (RAG), and tool invocation (TIG).
Key Findings
Large models and DB-specialized training improve DB QA quality.
DQABench size and bilingual coverage.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Dataset size | 200,000+ QA pairs (EN+ZH) | — | — | DQABench | Abstract; Section 3; Table 1 | Section 3 |
| Pretraining corpus size | ≈100M tokens (47k entries per language) | — | — | Pretraining data (paper) | Section 4.1: continued pretraining corpus ~100M tokens | Section 4.1 |
What To Try In 7 Days
Run your assistant on a subset of DQABench (product+instance types) to identify retrieval recall and tool-format failures.
Add a lightweight question router (hierarchical classifier) to route prompts and reduce hallucination risk.
Collect or index product manuals into a vector DB with finer chunking and test recall; tune embedding and chunk-size.
Agent Features
Planning
Tool Use
Frameworks
Optimization Features
Token Efficiency
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
RAG recall is low (<50%) on technical DB docs, so retrieval improvements are required before RAG helps reliably.
Many answers and labels were generated or polished with GPT-4, introducing potential bias and leakage into the dataset.
When Not To Use
If you lack high-quality retrieval for your product manuals—RAG may degrade answers.
For safety-critical DB actions without human review; tool invocation failures can produce wrong commands.
Failure Modes
Grounding on irrelevant documents when retrieval recall is low, causing confident but wrong answers.
LLMs hallucinating non-existent tools or producing wrong Action_Input formats.

