Use a multi-agent LLM pipeline to synthesize 30–90K high‑quality math QA that let 3–8B models match or beat models trained on 400K–2.3M

October 22, 20257 min

Overview

Decision SnapshotReady For Pilot

Paper provides thorough experiments across multiple base models and ablations. Results are strong for text-only math tasks but rest on a closed-source teacher (GPT-4o-mini) and reproducibility depends on releasing data and prompts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut synthetic-data volume by an order of magnitude and keep or improve model math performance. That lowers labeling costs and GPU training time while enabling smaller models to reach stronger production math competence.

Who Should Care

Summary TLDR

AgenticMath is a multi-agent, LLM-driven pipeline that filters seeds, rephrases problems, generates chain-of-thought solutions, and ranks pairs. With curated 30K–90K datasets it raises 3–8B models' math accuracy to match or beat baselines trained on hundreds of thousands to millions of samples.

Problem Statement

Synthetic math data is cheap but often low quality. Poorly phrased problems and wrong solutions limit supervised fine-tuning gains. The paper targets the data-quality bottleneck: create fewer, higher-value problem-solution pairs so small models learn better.

Main Contribution

AgenticMath: a 4‑stage multi-agent pipeline (seed filtering, rephrase + review + revise, solution generation with CoT, joint evaluation) to synthesize math QA.

AgenticMathQA: curated datasets in 30K, 60K, 90K sizes focused on clarity, correctness, and diversity rather than scale.

Key Findings

AgenticMath produces competitive performance with far less data.

Numbers30K90K AgenticMath vs 400K2.3M baselines (Table 2)

Practical UseCurate high-quality synthetic math pairs instead of scaling raw synthetic data; expect similar benchmark gains with an order-of-magnitude less labeling cost.

Evidence RefTable 2, Sec. 4.2

Large per-base-model gain from quality-focused synthesis (Qwen2.5-3B, 30K).

NumbersAgenticMath-Qwen2.5-3B avg 53.7 vs RefAug 34.6 (+19.1) (Table 1)

Practical UseIf you fine-tune Qwen2.5-3B, replacing generic 30K synthetic data with AgenticMath 30K can yield ~+19 points average accuracy on the paper's benchmarks.

Evidence RefTable 1, Sec. 4.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy53.7 (AgenticMath-Qwen2.5-3B, 30K)RefAug Qwen2.5-3B (30K) 34.6+19.1Table 1 (30K, Qwen2.5-3B)Table 1; Sec. 4.2Table 1
Accuracy49.3 (AgenticMath-DSMath-7B, 60K)DeepSeekMath-7B-RFT (590K) 48.3+1.0Table 2 (60K, DeepSeekMath-7B)Table 2; Sec. 4.2Table 2

What To Try In 7 Days

Run seed filtering: score your seed problems (complexity, info value, clarity) and drop low-quality ones.

Add a rephrase + review + revise loop on synthetic generation to fix ambiguity before solution synthesis.

Generate chain-of-thought solutions and perform joint problem-solution scoring; keep top-ranked samples for SFT (aim for 15K high-quality pairs).

Agent Features

Planning
iterative review-revise loop (up to 3 rounds)ranking-based selection with long-tail diversity
Tool Use
LLM-as-evaluator (GPT-4o-mini)Chain-of-Thought prompting for solution agentsrephrase/reviewer/revise/solver agent roles
Frameworks
AgenticMath
Is Agentic

Yes

Architectures
multi-agent pipeline with distinct role agents
Collaboration
review ↔ revise agent interactionmulti-step agent coordination for quality control

Optimization Features

System Optimization
ranking selection that trades quality for diversityscore curation using k-NN/Score Transition Matrix to stabilize LLM ratings
Training Optimization
SFTduplication-based scaling to 60K/90K (repeat solutions) as practiced in experiments

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

AgenticMathQA (paper claims release; see paper appendix and abstract)

Risks & Boundaries

Limitations

Focuses only on text-based math; method not evaluated on diagrams or multimodal problems.

Relies on a closed-source/paid teacher-evaluator (GPT-4o-mini) for much of the pipeline, which can be a practical cost or availability bottleneck.

When Not To Use

When your tasks require diagrams, figures, or other visual reasoning (multimodal problems).

When you already have large, high-quality human-annotated math corpora.

Failure Modes

Teacher/evaluator LLM produces incorrect solutions or biased scores, which propagate into training data.

Over-filtering: strict seed thresholds may drop simpler but useful examples and hurt some use-cases.

Core Entities

Models

GPT-4o-miniQwen2.5-3BDeepSeekMath-7BMistral-7BLlama3-8B

Metrics

Accuracy

Datasets

AgenticMathQA (30K/60K/90K)GSM8KMATH

Benchmarks

GSM8KMATHCollegeMathDeepMind-MathematicsOlympiadBenchTheoremQA

Context Entities

Models

WizardMathMathFusionDART-MathMetaMathMMIQCDeepSeekMath variants

Datasets

RefAugMathFusion datasetslarge-scale math collections (400K–2.3M)