Systematic benchmark shows small models can reason if trained and compressed carefully

February 17, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.45

Cost Impact Score

0.65

Citation Count

1

Authors

Gaurav Srivastava, Shuxiang Cao, Xuan Wang

Links

Abstract / PDF

Why It Matters For Business

You can deploy cost-efficient models for math and stepwise reasoning by choosing strong training recipes and quantizing large models rather than simply scaling parameters.

Summary TLDR

This paper builds THINKSLM, a large benchmark and leaderboard that evaluates reasoning in 72 small language models (SLMs) across 17 tasks. It validates LLM-as-a-judge (GPT-4-turbo/GPT-4o) against humans, then shows that training recipe and data quality matter more than raw parameter count. Quantization usually preserves reasoning, pruning often breaks it, and some well-trained SLMs (notably Qwen2.5 variants) approach or match much larger models on math and intermediate reasoning.

Problem Statement

Reasoning is often treated as an emergent property of very large models. We lack a systematic, reproducible evaluation that measures whether small models and compressed variants (quantized, pruned, distilled) can reach strong reasoning performance and remain robust under stressors.

Main Contribution

THINKSLM benchmark and public leaderboard evaluating 72 SLM variants across 17 reasoning tasks.

Empirical validation that GPT-4-turbo/GPT-4o align closely with human judgments and can be used as primary evaluators.

Large-scale study of compression: quantization preserves reasoning broadly, pruning often causes severe failures, distillation helps but is mixed.

Actionable insights showing training data, instruction tuning, and distillation can beat simple parameter scaling for reasoning.

Key Findings

GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation

Numbersagreement ρ = 0.99 (95% CI [0.98,1.00])

Some SLMs reach high math accuracy when well trained

NumbersQwen2.5-14B ≈ 94.29% on GSM8K (Direct I/O)

Quantization keeps reasoning quality with tiny memory cost

Numbers4-bit GPTQ on Qwen2.5-14B reduces GSM8K by <1 percentage point while cutting memory ≈80%

Pruning often catastrophically damages reasoning

NumbersPruned Llama-8B lost ≈32% absolute on GSM8K; some pruned variants scored 0 on ARC-Challenge

Intermediate/meta-reasoning can be matched by open SLMs

NumbersQwen2.5-32B MR-Score = 55.6 vs GPT-4-turbo ≈ 53.0 on MR-GSM8K

Prompt complexity gives small or negative gains for modern SLMs

NumbersChain-of-Thought adds ~2% on GSM8K for post-2024 models; sometimes hurts accuracy

Results

LLM-as-judge agreement with humans

ValueGPT-4-turbo ρ=0.99; GPT-4o ρ=0.98

BaselineHuman eval (ρ=1.00)

Accuracy

ValueQwen2.5-14B: 94.29% (GSM8K, Direct I/O)

Baselinemany SLMs range 30–90%

Quantization impact (example)

ValueQwen2.5-14B GPTQ-4bit: <1 percentage point drop on GSM8K

Baselinefull-precision 14B

Pruning impact (example)

ValuePruned Llama-8B: ≈32% absolute drop on GSM8K; some pruned variants fail on ARC-Challenge

Baselineunpruned Llama-8B

Intermediate reasoning (MR-Score)

ValueQwen2.5-32B MR-Score = 55.6; GPT-4-turbo ≈ 53.0

Baselinesmaller models (e.g., Mistral-7B) MR-Score ≈ 4

Who Should Care

What To Try In 7 Days

Run GPT-4o/GPT-4-turbo on a 1000-sample audit to validate internal grading and catch judge biases.

Quantize a top-performing pre-trained model (e.g., GPTQ 8/4-bit) and compare GSM8K and key tasks to the FP baseline.

Avoid aggressive pruning for reasoning workloads; if needed, run a targeted recovery fine-tune on diverse reasoning data (GSM8K + MR-GSM8K).

Optimization Features

Token Efficiency

  • Accuracy

Infra Optimization

  • Measured GPU memory on A100/H100; total compute ≈24k GPU hours
  • Use disk+GPU tradeoffs described in resource tables

Model Optimization

  • Post-training quantization (GPTQ, INT8/INT4, FP8)
  • Knowledge distillation with teacher outputs
  • Avoid aggressive unstructured/structured pruning for reasoning models

System Optimization

  • vLLM and Hugging Face accelerate multi-GPU inference
  • Automatic GPU allocation depending on model size

Training Optimization

  • Instruction tuning improves reasoning (e.g., Qwen2.5 instruction-tuned gains)
  • Teacher-driven distillation and curated synthetic reasoning data
  • Multi-stage RL-based alignment for stronger downstream reasoning

Inference Optimization

  • Use GPT-4o for lower-cost judge tasks; GPT-4-turbo for high-precision math judging
  • Prefer simple Direct I/O prompts for modern SLMs before adding CoT

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Primary evaluator relies heavily on GPT-4 variants which can mislabel long or nonsensical model outputs (noted in D.8).
  • Sorting parsing uses regex variants and fallback; parsing errors still occur and can mis-evaluate outputs (D.7).
  • Human evaluation was internal (single graduate annotator + authors) which limits diversity and external validation (G).
  • Study focuses on common academic benchmarks; real-world domain reasoning may differ.

When Not To Use

  • When strict correctness on out-of-distribution symbolic tasks is required and pruning has been applied without recovery fine-tuning.
  • When ground-truth is ambiguous and requires nuanced human judgment—LLM-as-judge should be spot-checked by humans.

Failure Modes

  • Pruned models returning nonsensical or empty outputs, including code or irrelevant tokens (Sections D.2–D.3).
  • LLM-judge misclassifying partially correct long answers as correct (D.8).
  • Parsing scripts failing on unexpected output formats leading to false negatives or false positives in sequence tasks (D.7).

Core Entities

Models

  • Qwen2.5
  • Qwen2
  • Llama-3.1
  • Llama-3.2
  • Mistral
  • Phi-3.5
  • SmolLM2
  • Hymba
  • Minitron
  • Phi-4
  • DeepSeek-R1 (referenced)

Metrics

  • Accuracy
  • MR-Score (meta-reasoning composite)
  • Agreement ρ with human
  • True Positive Rate (TPR)
  • True Negative Rate (TNR)
  • Mean Absolute Error (MAE)

Datasets

  • GSM8K
  • MATH
  • MathQA
  • ARC-Easy
  • ARC-Challenge
  • CommonsenseQA
  • OpenBookQA
  • HellaSwag
  • GSM-Plus
  • MR-GSM8K
  • MR-Ben
  • Sorting tasks (8/16/32, positive/mixed)

Benchmarks

  • THINKSLM (this paper)
  • GSM-Plus
  • MR-GSM8K
  • MR-Ben