Overview
The paper gives consistent benchmark gains and ablations, but relies on proprietary LLMs for data generation and reports only zero-shot evals without verifier-based inference pipelines.
Citations1
Evidence Strength0.80
Confidence0.85
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 4/4
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.
Who Should Care
Summary TLDR
The authors build MMIQC, a mixed dataset of 1.5M filtered web problems and 200K+ synthetic question-answer pairs, and introduce Iterative Question Composing (IQC) — an LLM-driven loop that composes new problems from seeds and filters them by rejection sampling. Fine-tuning open models on MMIQC improves MATH benchmark accuracy across sizes. Top result: Qwen-72B-MMIQC reaches 45.0% on MATH, outperforming prior open-source baselines and the 2023 GPT-4 reported number. Gains come from both a large Math SE subset and the IQC augmentation method.
Problem Statement
Open-source base LLMs struggle at competition-level math problems. The paper asks: can mixing cleaned web math QAs with targeted synthetic augmentation, especially iterative composition (IQC), produce small but generalizable accuracy gains when used for supervised fine-tuning?
Main Contribution
Introduce IQC (Iterative Question Composing): iteratively ask an LLM to compose new problems from seed problems and filter by another LLM (rejection sampling).
Release MMIQC: a mixed dataset combining 1.2M Math Stack Exchange QAs and ~203K synthetic or filtered QA pairs from several augmentation methods.
Key Findings
Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.
IQC augmentation yields a measurable boost from a small synthetic set.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 45.0% | Qwen-72B 35.2% | +9.8% absolute vs unfine-tuned Qwen-72B; +3.3% vs Qwen-72B-MetaMathQA | MATH (5000 test problems, zero-shot) | Qwen-72B-MMIQC scored 45.0% on MATH; table compares baselines | Table 3 |
| Accuracy | 41.0% | DeepSeek-67B 18.7% | +22.3% absolute vs unfine-tuned; +4.2% vs DeepSeek-67B-MetaMathQA | MATH | DeepSeek-67B-MMIQC 41.0% reported in Table 3 | Table 3 |
What To Try In 7 Days
Fine-tune a lightweight open model on the released MMIQC subset and measure MATH-like accuracy.
Implement a 1–2 iteration IQC loop: use a strong LLM to compose questions and a second model to reject mismatches.
Add a filtered Math StackExchange dump (or equivalent) to training data and re-evaluate on held-out problems.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
Synthetic generation depends on proprietary LLMs (GPT-4/GPT-3.5) for composing and filtering.
Generated answers can be incorrect; authors report ~85% correctness in a 100-sample check.
When Not To Use
When you require provable symbolic correctness or tool-assisted execution (e.g., program interpreters).
If you cannot afford the compute to fine-tune medium-to-large models or pay for LLM generation costs.
Failure Modes
IQC can produce invalid or subtly incorrect augmented problems that survive rejection sampling.
Overfitting to web-answer style or intermediate-step collisions that mimic test solutions.

