Systematic benchmark shows small models can reason if trained and compressed carefully

February 17, 20258 min

Overview

Decision SnapshotReady For Pilot

The study is practical: evaluation pipeline is validated against humans and covers many models and compressions; however reliance on LLM judges and some parser edge cases lower absolute confidence.

Citations1

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Gaurav Srivastava, Shuxiang Cao, Xuan Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can deploy cost-efficient models for math and stepwise reasoning by choosing strong training recipes and quantizing large models rather than simply scaling parameters.

Who Should Care

Summary TLDR

This paper builds THINKSLM, a large benchmark and leaderboard that evaluates reasoning in 72 small language models (SLMs) across 17 tasks. It validates LLM-as-a-judge (GPT-4-turbo/GPT-4o) against humans, then shows that training recipe and data quality matter more than raw parameter count. Quantization usually preserves reasoning, pruning often breaks it, and some well-trained SLMs (notably Qwen2.5 variants) approach or match much larger models on math and intermediate reasoning.

Problem Statement

Reasoning is often treated as an emergent property of very large models. We lack a systematic, reproducible evaluation that measures whether small models and compressed variants (quantized, pruned, distilled) can reach strong reasoning performance and remain robust under stressors.

Main Contribution

THINKSLM benchmark and public leaderboard evaluating 72 SLM variants across 17 reasoning tasks.

Empirical validation that GPT-4-turbo/GPT-4o align closely with human judgments and can be used as primary evaluators.

Key Findings

GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation

Numbersagreement ρ = 0.99 (95% CI [0.98,1.00])

Practical UseUse GPT-4-turbo/GPT-4o as a scalable judge; verify a sample of cases with humans when edge cases or nonsensical outputs appear

Evidence RefTable 1, Sec. 3.3

Some SLMs reach high math accuracy when well trained

NumbersQwen2.5-14B ≈ 94.29% on GSM8K (Direct I/O)

Practical UsePrefer investing in better pretraining and instruction tuning over only increasing parameter count for math tasks

Evidence RefTable 3, Sec. 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LLM-as-judge agreement with humansGPT-4-turbo ρ=0.99; GPT-4o ρ=0.98Human eval (ρ=1.00)GPT-4-turbo ≈ +0 vs humanGSM8K, ARC-E, ARC-C, CommonsenseQA, GSM-Plus (1000 samples)Table 1; Sec. 3.3Table 1
AccuracyQwen2.5-14B: 94.29% (GSM8K, Direct I/O)many SLMs range 3090%Substantial vs family peers (e.g., Mistral-7B lower)GSM8K (test split)Table 3, Sec. 4.1Table 3

What To Try In 7 Days

Run GPT-4o/GPT-4-turbo on a 1000-sample audit to validate internal grading and catch judge biases.

Quantize a top-performing pre-trained model (e.g., GPTQ 8/4-bit) and compare GSM8K and key tasks to the FP baseline.

Avoid aggressive pruning for reasoning workloads; if needed, run a targeted recovery fine-tune on diverse reasoning data (GSM8K + MR-GSM8K).

Optimization Features

Token Efficiency
Accuracy
Infra Optimization
Measured GPU memory on A100/H100; total compute ≈24k GPU hoursUse disk+GPU tradeoffs described in resource tables
Model Optimization
Post-training quantization (GPTQ, INT8/INT4, FP8)Knowledge distillation with teacher outputsAvoid aggressive unstructured/structured pruning for reasoning models
System Optimization
vLLM and Hugging Face accelerate multi-GPU inferenceAutomatic GPU allocation depending on model size
Training Optimization
Instruction tuning improves reasoning (e.g., Qwen2.5 instruction-tuned gains)Teacher-driven distillation and curated synthetic reasoning dataMulti-stage RL-based alignment for stronger downstream reasoning
Inference Optimization
Use GPT-4o for lower-cost judge tasks; GPT-4-turbo for high-precision math judgingPrefer simple Direct I/O prompts for modern SLMs before adding CoT

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

https://huggingface.co/models (models used)Public benchmarks: GSM8K, MATH, MathQA, ARC, CommonsenseQA, OpenBookQA, HellaSwag (standard datasets)

Risks & Boundaries

Limitations

Primary evaluator relies heavily on GPT-4 variants which can mislabel long or nonsensical model outputs (noted in D.8).

Sorting parsing uses regex variants and fallback; parsing errors still occur and can mis-evaluate outputs (D.7).

When Not To Use

When strict correctness on out-of-distribution symbolic tasks is required and pruning has been applied without recovery fine-tuning.

When ground-truth is ambiguous and requires nuanced human judgment—LLM-as-judge should be spot-checked by humans.

Failure Modes

Pruned models returning nonsensical or empty outputs, including code or irrelevant tokens (Sections D.2–D.3).

LLM-judge misclassifying partially correct long answers as correct (D.8).

Core Entities

Models

Qwen2.5Qwen2Llama-3.1Llama-3.2MistralPhi-3.5SmolLM2HymbaMinitronPhi-4DeepSeek-R1 (referenced)

Metrics

AccuracyMR-Score (meta-reasoning composite)Agreement ρ with humanTrue Positive Rate (TPR)True Negative Rate (TNR)Mean Absolute Error (MAE)

Datasets

GSM8KMATHMathQAARC-EasyARC-ChallengeCommonsenseQAOpenBookQAHellaSwagGSM-PlusMR-GSM8KMR-BenSorting tasks (8/16/32, positive/mixed)

Benchmarks

THINKSLM (this paper)GSM-PlusMR-GSM8KMR-Ben