Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can cut synthetic-data volume by an order of magnitude and keep or improve model math performance. That lowers labeling costs and GPU training time while enabling smaller models to reach stronger production math competence.
Summary TLDR
AgenticMath is a multi-agent, LLM-driven pipeline that filters seeds, rephrases problems, generates chain-of-thought solutions, and ranks pairs. With curated 30K–90K datasets it raises 3–8B models' math accuracy to match or beat baselines trained on hundreds of thousands to millions of samples.
Problem Statement
Synthetic math data is cheap but often low quality. Poorly phrased problems and wrong solutions limit supervised fine-tuning gains. The paper targets the data-quality bottleneck: create fewer, higher-value problem-solution pairs so small models learn better.
Main Contribution
AgenticMath: a 4‑stage multi-agent pipeline (seed filtering, rephrase + review + revise, solution generation with CoT, joint evaluation) to synthesize math QA.
AgenticMathQA: curated datasets in 30K, 60K, 90K sizes focused on clarity, correctness, and diversity rather than scale.
Empirical validation: 3–8B models fine-tuned on 30K–90K AgenticMath data match or outperform models trained on 400K–2.3M samples across six math benchmarks.
Key Findings
AgenticMath produces competitive performance with far less data.
Large per-base-model gain from quality-focused synthesis (Qwen2.5-3B, 30K).
Each pipeline module contributes measurable improvements.
A 60K AgenticMath dataset can match or exceed much larger baselines.
Results
Accuracy
Accuracy
Effect of pipeline modules (15K synthesized)
Quality distribution of refined synthetic problems
Who Should Care
What To Try In 7 Days
Run seed filtering: score your seed problems (complexity, info value, clarity) and drop low-quality ones.
Add a rephrase + review + revise loop on synthetic generation to fix ambiguity before solution synthesis.
Generate chain-of-thought solutions and perform joint problem-solution scoring; keep top-ranked samples for SFT (aim for 15K high-quality pairs).
Agent Features
Planning
- iterative review-revise loop (up to 3 rounds)
- ranking-based selection with long-tail diversity
Tool Use
- LLM-as-evaluator (GPT-4o-mini)
- Chain-of-Thought prompting for solution agents
- rephrase/reviewer/revise/solver agent roles
Frameworks
- AgenticMath
Is Agentic
true
Architectures
- multi-agent pipeline with distinct role agents
Collaboration
- review ↔ revise agent interaction
- multi-step agent coordination for quality control
Optimization Features
System Optimization
- ranking selection that trades quality for diversity
- score curation using k-NN/Score Transition Matrix to stabilize LLM ratings
Training Optimization
- SFT
- duplication-based scaling to 60K/90K (repeat solutions) as practiced in experiments
Reproducibility
Data Urls
- AgenticMathQA (paper claims release; see paper appendix and abstract)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Focuses only on text-based math; method not evaluated on diagrams or multimodal problems.
- Relies on a closed-source/paid teacher-evaluator (GPT-4o-mini) for much of the pipeline, which can be a practical cost or availability bottleneck.
- Experiments stop at 90K; scalability behavior for much larger synthetic budgets is untested.
When Not To Use
- When your tasks require diagrams, figures, or other visual reasoning (multimodal problems).
- When you already have large, high-quality human-annotated math corpora.
- If you cannot run or afford a strong teacher/evaluator LLM (the pipeline depends heavily on it).
Failure Modes
- Teacher/evaluator LLM produces incorrect solutions or biased scores, which propagate into training data.
- Over-filtering: strict seed thresholds may drop simpler but useful examples and hurt some use-cases.
- Synthetic problems may still contain subtle logical flaws that escape automated scoring and mislead the model.
Core Entities
Models
- GPT-4o-mini
- Qwen2.5-3B
- DeepSeekMath-7B
- Mistral-7B
- Llama3-8B
Metrics
- Accuracy
Datasets
- AgenticMathQA (30K/60K/90K)
- GSM8K
- MATH
Benchmarks
- GSM8K
- MATH
- CollegeMath
- DeepMind-Mathematics
- OlympiadBench
- TheoremQA
Context Entities
Models
- WizardMath
- MathFusion
- DART-Math
- MetaMath
- MMIQC
- DeepSeekMath variants
Datasets
- RefAug
- MathFusion datasets
- large-scale math collections (400K–2.3M)

