Use a multi-agent LLM pipeline to synthesize 30–90K high‑quality math QA that let 3–8B models match or beat models trained on 400K–2.3M

October 22, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

Links

Abstract / PDF

Why It Matters For Business

You can cut synthetic-data volume by an order of magnitude and keep or improve model math performance. That lowers labeling costs and GPU training time while enabling smaller models to reach stronger production math competence.

Summary TLDR

AgenticMath is a multi-agent, LLM-driven pipeline that filters seeds, rephrases problems, generates chain-of-thought solutions, and ranks pairs. With curated 30K–90K datasets it raises 3–8B models' math accuracy to match or beat baselines trained on hundreds of thousands to millions of samples.

Problem Statement

Synthetic math data is cheap but often low quality. Poorly phrased problems and wrong solutions limit supervised fine-tuning gains. The paper targets the data-quality bottleneck: create fewer, higher-value problem-solution pairs so small models learn better.

Main Contribution

AgenticMath: a 4‑stage multi-agent pipeline (seed filtering, rephrase + review + revise, solution generation with CoT, joint evaluation) to synthesize math QA.

AgenticMathQA: curated datasets in 30K, 60K, 90K sizes focused on clarity, correctness, and diversity rather than scale.

Empirical validation: 3–8B models fine-tuned on 30K–90K AgenticMath data match or outperform models trained on 400K–2.3M samples across six math benchmarks.

Key Findings

AgenticMath produces competitive performance with far less data.

Numbers30K–90K AgenticMath vs 400K–2.3M baselines (Table 2)

Large per-base-model gain from quality-focused synthesis (Qwen2.5-3B, 30K).

NumbersAgenticMath-Qwen2.5-3B avg 53.7 vs RefAug 34.6 (+19.1) (Table 1)

Each pipeline module contributes measurable improvements.

NumbersAblation (15K): +0.6 (seed filtering), +1.0 (review-revise), +0.2 (synthetic evaluation) (Table 3)

A 60K AgenticMath dataset can match or exceed much larger baselines.

NumbersAgenticMath-DSMath-7B (60K) avg 49.3 vs DeepSeekMath-7B-RFT (590K) 48.3 (Table 2)

Results

Accuracy

Value53.7 (AgenticMath-Qwen2.5-3B, 30K)

BaselineRefAug Qwen2.5-3B (30K) 34.6

Accuracy

Value49.3 (AgenticMath-DSMath-7B, 60K)

BaselineDeepSeekMath-7B-RFT (590K) 48.3

Effect of pipeline modules (15K synthesized)

ValueProblem Rephrase 31.4 → +0.6 seed filtering → +1.0 review-revise → +0.2 synthetic eval → 33.2

BaselineProblem Rephrase alone 31.4

Quality distribution of refined synthetic problems

Value65% scored 4, 27% scored 3 (n=18,679)

Who Should Care

What To Try In 7 Days

Run seed filtering: score your seed problems (complexity, info value, clarity) and drop low-quality ones.

Add a rephrase + review + revise loop on synthetic generation to fix ambiguity before solution synthesis.

Generate chain-of-thought solutions and perform joint problem-solution scoring; keep top-ranked samples for SFT (aim for 15K high-quality pairs).

Agent Features

Planning

  • iterative review-revise loop (up to 3 rounds)
  • ranking-based selection with long-tail diversity

Tool Use

  • LLM-as-evaluator (GPT-4o-mini)
  • Chain-of-Thought prompting for solution agents
  • rephrase/reviewer/revise/solver agent roles

Frameworks

  • AgenticMath

Is Agentic

true

Architectures

  • multi-agent pipeline with distinct role agents

Collaboration

  • review ↔ revise agent interaction
  • multi-step agent coordination for quality control

Optimization Features

System Optimization

  • ranking selection that trades quality for diversity
  • score curation using k-NN/Score Transition Matrix to stabilize LLM ratings

Training Optimization

  • SFT
  • duplication-based scaling to 60K/90K (repeat solutions) as practiced in experiments

Reproducibility

Data Urls

  • AgenticMathQA (paper claims release; see paper appendix and abstract)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Focuses only on text-based math; method not evaluated on diagrams or multimodal problems.
  • Relies on a closed-source/paid teacher-evaluator (GPT-4o-mini) for much of the pipeline, which can be a practical cost or availability bottleneck.
  • Experiments stop at 90K; scalability behavior for much larger synthetic budgets is untested.

When Not To Use

  • When your tasks require diagrams, figures, or other visual reasoning (multimodal problems).
  • When you already have large, high-quality human-annotated math corpora.
  • If you cannot run or afford a strong teacher/evaluator LLM (the pipeline depends heavily on it).

Failure Modes

  • Teacher/evaluator LLM produces incorrect solutions or biased scores, which propagate into training data.
  • Over-filtering: strict seed thresholds may drop simpler but useful examples and hurt some use-cases.
  • Synthetic problems may still contain subtle logical flaws that escape automated scoring and mislead the model.

Core Entities

Models

  • GPT-4o-mini
  • Qwen2.5-3B
  • DeepSeekMath-7B
  • Mistral-7B
  • Llama3-8B

Metrics

  • Accuracy

Datasets

  • AgenticMathQA (30K/60K/90K)
  • GSM8K
  • MATH

Benchmarks

  • GSM8K
  • MATH
  • CollegeMath
  • DeepMind-Mathematics
  • OlympiadBench
  • TheoremQA

Context Entities

Models

  • WizardMath
  • MathFusion
  • DART-Math
  • MetaMath
  • MMIQC
  • DeepSeekMath variants

Datasets

  • RefAug
  • MathFusion datasets
  • large-scale math collections (400K–2.3M)