IQC + MMIQC: generate diverse math word problems to raise open LLM math accuracy

January 17, 20246 min

Overview

Decision SnapshotReady For Pilot

The paper gives consistent benchmark gains and ablations, but relies on proprietary LLMs for data generation and reports only zero-shot evals without verifier-based inference pipelines.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.

Who Should Care

Summary TLDR

The authors build MMIQC, a mixed dataset of 1.5M filtered web problems and 200K+ synthetic question-answer pairs, and introduce Iterative Question Composing (IQC) — an LLM-driven loop that composes new problems from seeds and filters them by rejection sampling. Fine-tuning open models on MMIQC improves MATH benchmark accuracy across sizes. Top result: Qwen-72B-MMIQC reaches 45.0% on MATH, outperforming prior open-source baselines and the 2023 GPT-4 reported number. Gains come from both a large Math SE subset and the IQC augmentation method.

Problem Statement

Open-source base LLMs struggle at competition-level math problems. The paper asks: can mixing cleaned web math QAs with targeted synthetic augmentation, especially iterative composition (IQC), produce small but generalizable accuracy gains when used for supervised fine-tuning?

Main Contribution

Introduce IQC (Iterative Question Composing): iteratively ask an LLM to compose new problems from seed problems and filter by another LLM (rejection sampling).

Release MMIQC: a mixed dataset combining 1.2M Math Stack Exchange QAs and ~203K synthetic or filtered QA pairs from several augmentation methods.

Key Findings

Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.

NumbersQwen-72B-MMIQC 45.0% (MATH); Qwen-72B baseline 35.2%

Practical UseFine-tune your open model on MMIQC to get several-point absolute gains on competition math tasks.

Evidence RefTable 3

IQC augmentation yields a measurable boost from a small synthetic set.

NumbersAdding IQC (55.1K samples) produced +3.1% absolute on MATH in ablation

Practical UseUse IQC-style iterative composition to cheaply create a small, high-impact augmentation set when data budget is limited.

Evidence RefTable 4; ablation text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy45.0%Qwen-72B 35.2%+9.8% absolute vs unfine-tuned Qwen-72B; +3.3% vs Qwen-72B-MetaMathQAMATH (5000 test problems, zero-shot)Qwen-72B-MMIQC scored 45.0% on MATH; table compares baselinesTable 3
Accuracy41.0%DeepSeek-67B 18.7%+22.3% absolute vs unfine-tuned; +4.2% vs DeepSeek-67B-MetaMathQAMATHDeepSeek-67B-MMIQC 41.0% reported in Table 3Table 3

What To Try In 7 Days

Fine-tune a lightweight open model on the released MMIQC subset and measure MATH-like accuracy.

Implement a 1–2 iteration IQC loop: use a strong LLM to compose questions and a second model to reject mismatches.

Add a filtered Math StackExchange dump (or equivalent) to training data and re-evaluate on held-out problems.

Reproducibility

Risks & Boundaries

Limitations

Synthetic generation depends on proprietary LLMs (GPT-4/GPT-3.5) for composing and filtering.

Generated answers can be incorrect; authors report ~85% correctness in a 100-sample check.

When Not To Use

When you require provable symbolic correctness or tool-assisted execution (e.g., program interpreters).

If you cannot afford the compute to fine-tune medium-to-large models or pay for LLM generation costs.

Failure Modes

IQC can produce invalid or subtly incorrect augmented problems that survive rejection sampling.

Overfitting to web-answer style or intermediate-step collisions that mimic test solutions.

Core Entities

Models

Qwen-72B-MMIQCQwen-72BDeepSeek-67B-MMIQCDeepSeek-67BLlemma-34B-MMIQCLlemma-34BMistral-7B-MMIQCMistral-7B

Metrics

Accuracygrade (Hungarian finals, max 117)

Datasets

MMIQCMetaMathQAMetaMathQA (subset)MATHGSM8KMathStackExchangeMetaMathQA (filtered subset)

Benchmarks

MATH2023 Hungarian National High School Finals (math)