IQC + MMIQC: generate diverse math word problems to raise open LLM math accuracy

January 17, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao

Links

Abstract / PDF

Why It Matters For Business

If you need better math reasoning from open models, combine cleaned web QAs with focused synthetic augmentation (IQC) to get consistent, low-effort accuracy gains without external tools.

Summary TLDR

The authors build MMIQC, a mixed dataset of 1.5M filtered web problems and 200K+ synthetic question-answer pairs, and introduce Iterative Question Composing (IQC) — an LLM-driven loop that composes new problems from seeds and filters them by rejection sampling. Fine-tuning open models on MMIQC improves MATH benchmark accuracy across sizes. Top result: Qwen-72B-MMIQC reaches 45.0% on MATH, outperforming prior open-source baselines and the 2023 GPT-4 reported number. Gains come from both a large Math SE subset and the IQC augmentation method.

Problem Statement

Open-source base LLMs struggle at competition-level math problems. The paper asks: can mixing cleaned web math QAs with targeted synthetic augmentation, especially iterative composition (IQC), produce small but generalizable accuracy gains when used for supervised fine-tuning?

Main Contribution

Introduce IQC (Iterative Question Composing): iteratively ask an LLM to compose new problems from seed problems and filter by another LLM (rejection sampling).

Release MMIQC: a mixed dataset combining 1.2M Math Stack Exchange QAs and ~203K synthetic or filtered QA pairs from several augmentation methods.

Show consistent accuracy gains on the MATH benchmark across model sizes after fine-tuning on MMIQC, with top open-model result Qwen-72B-MMIQC at 45.0%.

Key Findings

Fine-tuning on MMIQC raises MATH accuracy for models of multiple sizes.

NumbersQwen-72B-MMIQC 45.0% (MATH); Qwen-72B baseline 35.2%

IQC augmentation yields a measurable boost from a small synthetic set.

NumbersAdding IQC (55.1K samples) produced +3.1% absolute on MATH in ablation

Reusing high-quality web math QAs drives the largest single jump.

NumbersAdding Math Stack Exchange (+1.2M samples) gave +7.8% absolute on MATH

Low risk of test-set memorization by MMIQC synthetic data.

Numbers30-gram match: 44 hits vs 168 between MATH test and MATH train; 43/44 were intermediate-step collisions

Results

Accuracy

Value45.0%

BaselineQwen-72B 35.2%

Accuracy

Value41.0%

BaselineDeepSeek-67B 18.7%

Accuracy

Value38.6%

BaselineLlemma-34B 34.8%

Ablation incremental gain

Value+7.8% absolute

Baselineafter adding MetaMathQA subset and augmentations

Who Should Care

What To Try In 7 Days

Fine-tune a lightweight open model on the released MMIQC subset and measure MATH-like accuracy.

Implement a 1–2 iteration IQC loop: use a strong LLM to compose questions and a second model to reject mismatches.

Add a filtered Math StackExchange dump (or equivalent) to training data and re-evaluate on held-out problems.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic generation depends on proprietary LLMs (GPT-4/GPT-3.5) for composing and filtering.
  • Generated answers can be incorrect; authors report ~85% correctness in a 100-sample check.
  • Evaluation excludes verifier/code-execution methods, so comparisons to tool-using systems are not apples-to-apples.
  • Fine-tuning cost and compute requirements are nontrivial for larger models (multi-node setups reported).

When Not To Use

  • When you require provable symbolic correctness or tool-assisted execution (e.g., program interpreters).
  • If you cannot afford the compute to fine-tune medium-to-large models or pay for LLM generation costs.
  • If legal/data-privacy rules prohibit using web-extracted content.

Failure Modes

  • IQC can produce invalid or subtly incorrect augmented problems that survive rejection sampling.
  • Overfitting to web-answer style or intermediate-step collisions that mimic test solutions.
  • Reliance on generation LLM quality: weaker composer or verifier will lower augmentation value.

Core Entities

Models

  • Qwen-72B-MMIQC
  • Qwen-72B
  • DeepSeek-67B-MMIQC
  • DeepSeek-67B
  • Llemma-34B-MMIQC
  • Llemma-34B
  • Mistral-7B-MMIQC
  • Mistral-7B

Metrics

  • Accuracy
  • grade (Hungarian finals, max 117)

Datasets

  • MMIQC
  • MetaMathQA
  • MetaMathQA (subset)
  • MATH
  • GSM8K
  • MathStackExchange
  • MetaMathQA (filtered subset)

Benchmarks

  • MATH
  • 2023 Hungarian National High School Finals (math)