IQC + MMIQC: generate diverse math word problems to raise open LLM math accuracy

Overview

Decision SnapshotReady For Pilot

The paper gives consistent benchmark gains and ablations, but relies on proprietary LLMs for data generation and reports only zero-shot evals without verifier-based inference pipelines.

Citations1

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.

Who Should Care

ML Engineer Data Scientist Product Manager CTO Founder

Summary TLDR

The authors build MMIQC, a mixed dataset of 1.5M filtered web problems and 200K+ synthetic question-answer pairs, and introduce Iterative Question Composing (IQC) — an LLM-driven loop that composes new problems from seeds and filters them by rejection sampling. Fine-tuning open models on MMIQC improves MATH benchmark accuracy across sizes. Top result: Qwen-72B-MMIQC reaches 45.0% on MATH, outperforming prior open-source baselines and the 2023 GPT-4 reported number. Gains come from both a large Math SE subset and the IQC augmentation method.

Problem Statement

Open-source base LLMs struggle at competition-level math problems. The paper asks: can mixing cleaned web math QAs with targeted synthetic augmentation, especially iterative composition (IQC), produce small but generalizable accuracy gains when used for supervised fine-tuning?

Main Contribution

Introduce IQC (Iterative Question Composing): iteratively ask an LLM to compose new problems from seed problems and filter by another LLM (rejection sampling).

Release MMIQC: a mixed dataset combining 1.2M Math Stack Exchange QAs and ~203K synthetic or filtered QA pairs from several augmentation methods.

Key Findings

Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.

NumbersQwen-72B-MMIQC 45.0% (MATH); Qwen-72B baseline 35.2%

Practical UseFine-tune your open model on MMIQC to get several-point absolute gains on competition math tasks.

Evidence RefTable 3

IQC augmentation yields a measurable boost from a small synthetic set.

NumbersAdding IQC (55.1K samples) produced +3.1% absolute on MATH in ablation

Practical UseUse IQC-style iterative composition to cheaply create a small, high-impact augmentation set when data budget is limited.

Evidence RefTable 4; ablation text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	45.0%	Qwen-72B 35.2%	+9.8% absolute vs unfine-tuned Qwen-72B; +3.3% vs Qwen-72B-MetaMathQA	MATH (5000 test problems, zero-shot)	Qwen-72B-MMIQC scored 45.0% on MATH; table compares baselines	Table 3
Accuracy	41.0%	DeepSeek-67B 18.7%	+22.3% absolute vs unfine-tuned; +4.2% vs DeepSeek-67B-MetaMathQA	MATH	DeepSeek-67B-MMIQC 41.0% reported in Table 3	Table 3

What To Try In 7 Days

Fine-tune a lightweight open model on the released MMIQC subset and measure MATH-like accuracy.

Implement a 1–2 iteration IQC loop: use a strong LLM to compose questions and a second model to reject mismatches.

Add a filtered Math StackExchange dump (or equivalent) to training data and re-evaluate on held-out problems.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/iiis-ai/IterativeQuestionComposing https://huggingface.co/datasets/Vivacem/MMIQC

Data URLs

https://huggingface.co/datasets/Vivacem/MMIQC

Risks & Boundaries

Limitations

Synthetic generation depends on proprietary LLMs (GPT-4/GPT-3.5) for composing and filtering.

Generated answers can be incorrect; authors report ~85% correctness in a 100-sample check.

When Not To Use

When you require provable symbolic correctness or tool-assisted execution (e.g., program interpreters).

If you cannot afford the compute to fine-tune medium-to-large models or pay for LLM generation costs.

Failure Modes

IQC can produce invalid or subtly incorrect augmented problems that survive rejection sampling.

Overfitting to web-answer style or intermediate-step collisions that mimic test solutions.

Core Entities

Models

Qwen-72B-MMIQCQwen-72BDeepSeek-67B-MMIQCDeepSeek-67BLlemma-34B-MMIQCLlemma-34BMistral-7B-MMIQCMistral-7B

Metrics

Accuracygrade (Hungarian finals, max 117)

Datasets

MMIQCMetaMathQAMetaMathQA (subset)MATHGSM8KMathStackExchangeMetaMathQA (filtered subset)

Benchmarks

MATH2023 Hungarian National High School Finals (math)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.

IQC augmentation yields a measurable boost from a small synthetic set.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Key finding