Use a multi-agent LLM pipeline to synthesize 30–90K high‑quality math QA that let 3–8B models match or beat models trained on 400K–2.3M

Overview

Decision SnapshotReady For Pilot

Paper provides thorough experiments across multiple base models and ablations. Results are strong for text-only math tasks but rest on a closed-source teacher (GPT-4o-mini) and reproducibility depends on releasing data and prompts.

Citations0

Evidence Strength0.80

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

Links

Abstract / PDF / Data

Why It Matters For Business

You can cut synthetic-data volume by an order of magnitude and keep or improve model math performance. That lowers labeling costs and GPU training time while enabling smaller models to reach stronger production math competence.

Who Should Care

CTO ML Engineer Product Manager Data Scientist Engineering Lead Founder

Summary TLDR

AgenticMath is a multi-agent, LLM-driven pipeline that filters seeds, rephrases problems, generates chain-of-thought solutions, and ranks pairs. With curated 30K–90K datasets it raises 3–8B models' math accuracy to match or beat baselines trained on hundreds of thousands to millions of samples.

Problem Statement

Synthetic math data is cheap but often low quality. Poorly phrased problems and wrong solutions limit supervised fine-tuning gains. The paper targets the data-quality bottleneck: create fewer, higher-value problem-solution pairs so small models learn better.

Main Contribution

AgenticMath: a 4‑stage multi-agent pipeline (seed filtering, rephrase + review + revise, solution generation with CoT, joint evaluation) to synthesize math QA.

AgenticMathQA: curated datasets in 30K, 60K, 90K sizes focused on clarity, correctness, and diversity rather than scale.

Key Findings

AgenticMath produces competitive performance with far less data.

Numbers30K–90K AgenticMath vs 400K–2.3M baselines (Table 2)

Practical UseCurate high-quality synthetic math pairs instead of scaling raw synthetic data; expect similar benchmark gains with an order-of-magnitude less labeling cost.

Evidence RefTable 2, Sec. 4.2

Large per-base-model gain from quality-focused synthesis (Qwen2.5-3B, 30K).

NumbersAgenticMath-Qwen2.5-3B avg 53.7 vs RefAug 34.6 (+19.1) (Table 1)

Practical UseIf you fine-tune Qwen2.5-3B, replacing generic 30K synthetic data with AgenticMath 30K can yield ~+19 points average accuracy on the paper's benchmarks.

Evidence RefTable 1, Sec. 4.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	53.7 (AgenticMath-Qwen2.5-3B, 30K)	RefAug Qwen2.5-3B (30K) 34.6	+19.1	Table 1 (30K, Qwen2.5-3B)	Table 1; Sec. 4.2	Table 1
Accuracy	49.3 (AgenticMath-DSMath-7B, 60K)	DeepSeekMath-7B-RFT (590K) 48.3	+1.0	Table 2 (60K, DeepSeekMath-7B)	Table 2; Sec. 4.2	Table 2

What To Try In 7 Days

Run seed filtering: score your seed problems (complexity, info value, clarity) and drop low-quality ones.

Add a rephrase + review + revise loop on synthetic generation to fix ambiguity before solution synthesis.

Generate chain-of-thought solutions and perform joint problem-solution scoring; keep top-ranked samples for SFT (aim for 15K high-quality pairs).

Agent Features

Planning

iterative review-revise loop (up to 3 rounds)ranking-based selection with long-tail diversity

Tool Use

LLM-as-evaluator (GPT-4o-mini)Chain-of-Thought prompting for solution agentsrephrase/reviewer/revise/solver agent roles

Frameworks

AgenticMath

Is Agentic

Yes

Architectures

multi-agent pipeline with distinct role agents

Collaboration

review ↔ revise agent interactionmulti-step agent coordination for quality control

Optimization Features

System Optimization

ranking selection that trades quality for diversityscore curation using k-NN/Score Transition Matrix to stabilize LLM ratings

Training Optimization

SFTduplication-based scaling to 60K/90K (repeat solutions) as practiced in experiments

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

AgenticMathQA (paper claims release; see paper appendix and abstract)

Risks & Boundaries

Limitations

Focuses only on text-based math; method not evaluated on diagrams or multimodal problems.

Relies on a closed-source/paid teacher-evaluator (GPT-4o-mini) for much of the pipeline, which can be a practical cost or availability bottleneck.

When Not To Use

When your tasks require diagrams, figures, or other visual reasoning (multimodal problems).

When you already have large, high-quality human-annotated math corpora.

Failure Modes

Teacher/evaluator LLM produces incorrect solutions or biased scores, which propagate into training data.

Over-filtering: strict seed thresholds may drop simpler but useful examples and hurt some use-cases.

Core Entities

Models

GPT-4o-miniQwen2.5-3BDeepSeekMath-7BMistral-7BLlama3-8B

Metrics

Accuracy

Datasets

AgenticMathQA (30K/60K/90K)GSM8KMATH

Benchmarks

GSM8KMATHCollegeMathDeepMind-MathematicsOlympiadBenchTheoremQA

Context Entities

Models

WizardMathMathFusionDART-MathMetaMathMMIQCDeepSeekMath variants

Datasets

RefAugMathFusion datasetslarge-scale math collections (400K–2.3M)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AgenticMath produces competitive performance with far less data.

Large per-base-model gain from quality-focused synthesis (Qwen2.5-3B, 30K).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding