Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.
Summary TLDR
The authors build MMIQC, a mixed dataset of 1.5M filtered web problems and 200K+ synthetic question-answer pairs, and introduce Iterative Question Composing (IQC) — an LLM-driven loop that composes new problems from seeds and filters them by rejection sampling. Fine-tuning open models on MMIQC improves MATH benchmark accuracy across sizes. Top result: Qwen-72B-MMIQC reaches 45.0% on MATH, outperforming prior open-source baselines and the 2023 GPT-4 reported number. Gains come from both a large Math SE subset and the IQC augmentation method.
Problem Statement
Open-source base LLMs struggle at competition-level math problems. The paper asks: can mixing cleaned web math QAs with targeted synthetic augmentation, especially iterative composition (IQC), produce small but generalizable accuracy gains when used for supervised fine-tuning?
Main Contribution
Introduce IQC (Iterative Question Composing): iteratively ask an LLM to compose new problems from seed problems and filter by another LLM (rejection sampling).
Release MMIQC: a mixed dataset combining 1.2M Math Stack Exchange QAs and ~203K synthetic or filtered QA pairs from several augmentation methods.
Show consistent accuracy gains on the MATH benchmark across model sizes after fine-tuning on MMIQC, with top open-model result Qwen-72B-MMIQC at 45.0%.
Key Findings
Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.
IQC augmentation yields a measurable boost from a small synthetic set.
Reusing high-quality web math QAs drives the largest single jump.
Low risk of test-set memorization by MMIQC synthetic data.
Results
Accuracy
Accuracy
Accuracy
Ablation incremental gain
Who Should Care
What To Try In 7 Days
Fine-tune a lightweight open model on the released MMIQC subset and measure MATH-like accuracy.
Implement a 1–2 iteration IQC loop: use a strong LLM to compose questions and a second model to reject mismatches.
Add a filtered Math StackExchange dump (or equivalent) to training data and re-evaluate on held-out problems.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic generation depends on proprietary LLMs (GPT-4/GPT-3.5) for composing and filtering.
- Generated answers can be incorrect; authors report ~85% correctness in a 100-sample check.
- Evaluation excludes verifier/code-execution methods, so comparisons to tool-using systems are not apples-to-apples.
- Fine-tuning cost and compute requirements are nontrivial for larger models (multi-node setups reported).
When Not To Use
- When you require provable symbolic correctness or tool-assisted execution (e.g., program interpreters).
- If you cannot afford the compute to fine-tune medium-to-large models or pay for LLM generation costs.
- If legal/data-privacy rules prohibit using web-extracted content.
Failure Modes
- IQC can produce invalid or subtly incorrect augmented problems that survive rejection sampling.
- Overfitting to web-answer style or intermediate-step collisions that mimic test solutions.
- Reliance on generation LLM quality: weaker composer or verifier will lower augmentation value.
Core Entities
Models
- Qwen-72B-MMIQC
- Qwen-72B
- DeepSeek-67B-MMIQC
- DeepSeek-67B
- Llemma-34B-MMIQC
- Llemma-34B
- Mistral-7B-MMIQC
- Mistral-7B
Metrics
- Accuracy
- grade (Hungarian finals, max 117)
Datasets
- MMIQC
- MetaMathQA
- MetaMathQA (subset)
- MATH
- GSM8K
- MathStackExchange
- MetaMathQA (filtered subset)
Benchmarks
- MATH
- 2023 Hungarian National High School Finals (math)

