Systematic benchmark shows small models can reason if trained and compressed carefully

Overview

Decision SnapshotReady For Pilot

The study is practical: evaluation pipeline is validated against humans and covers many models and compressions; however reliance on LLM judges and some parser edge cases lower absolute confidence.

Citations1

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 45%

Authors

Gaurav Srivastava, Shuxiang Cao, Xuan Wang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can deploy cost-efficient models for math and stepwise reasoning by choosing strong training recipes and quantizing large models rather than simply scaling parameters.

Who Should Care

ML Engineer Product Manager CTO

Summary TLDR

This paper builds THINKSLM, a large benchmark and leaderboard that evaluates reasoning in 72 small language models (SLMs) across 17 tasks. It validates LLM-as-a-judge (GPT-4-turbo/GPT-4o) against humans, then shows that training recipe and data quality matter more than raw parameter count. Quantization usually preserves reasoning, pruning often breaks it, and some well-trained SLMs (notably Qwen2.5 variants) approach or match much larger models on math and intermediate reasoning.

Problem Statement

Reasoning is often treated as an emergent property of very large models. We lack a systematic, reproducible evaluation that measures whether small models and compressed variants (quantized, pruned, distilled) can reach strong reasoning performance and remain robust under stressors.

Main Contribution

THINKSLM benchmark and public leaderboard evaluating 72 SLM variants across 17 reasoning tasks.

Empirical validation that GPT-4-turbo/GPT-4o align closely with human judgments and can be used as primary evaluators.

Key Findings

GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation

Numbersagreement ρ = 0.99 (95% CI [0.98,1.00])

Practical UseUse GPT-4-turbo/GPT-4o as a scalable judge; verify a sample of cases with humans when edge cases or nonsensical outputs appear

Evidence RefTable 1, Sec. 3.3

Some SLMs reach high math accuracy when well trained

NumbersQwen2.5-14B ≈ 94.29% on GSM8K (Direct I/O)

Practical UsePrefer investing in better pretraining and instruction tuning over only increasing parameter count for math tasks

Evidence RefTable 3, Sec. 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LLM-as-judge agreement with humans	GPT-4-turbo ρ=0.99; GPT-4o ρ=0.98	Human eval (ρ=1.00)	GPT-4-turbo ≈ +0 vs human	GSM8K, ARC-E, ARC-C, CommonsenseQA, GSM-Plus (1000 samples)	Table 1; Sec. 3.3	Table 1
Accuracy	Qwen2.5-14B: 94.29% (GSM8K, Direct I/O)	many SLMs range 30–90%	Substantial vs family peers (e.g., Mistral-7B lower)	GSM8K (test split)	Table 3, Sec. 4.1	Table 3

What To Try In 7 Days

Run GPT-4o/GPT-4-turbo on a 1000-sample audit to validate internal grading and catch judge biases.

Quantize a top-performing pre-trained model (e.g., GPTQ 8/4-bit) and compare GSM8K and key tasks to the FP baseline.

Avoid aggressive pruning for reasoning workloads; if needed, run a targeted recovery fine-tune on diverse reasoning data (GSM8K + MR-GSM8K).

Optimization Features

Token Efficiency

Accuracy

Infra Optimization

Measured GPU memory on A100/H100; total compute ≈24k GPU hoursUse disk+GPU tradeoffs described in resource tables

Model Optimization

Post-training quantization (GPTQ, INT8/INT4, FP8)Knowledge distillation with teacher outputsAvoid aggressive unstructured/structured pruning for reasoning models

System Optimization

vLLM and Hugging Face accelerate multi-GPU inferenceAutomatic GPU allocation depending on model size

Training Optimization

Instruction tuning improves reasoning (e.g., Qwen2.5 instruction-tuned gains)Teacher-driven distillation and curated synthetic reasoning dataMulti-stage RL-based alignment for stronger downstream reasoning

Inference Optimization

Use GPT-4o for lower-cost judge tasks; GPT-4-turbo for high-precision math judgingPrefer simple Direct I/O prompts for modern SLMs before adding CoT

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://ctrlgaurav.github.io/thinkslm.github.io/https://github.com/thinkslm (paper references leaderboard and scripts; prompts and parsing scripts in Appendix C)

Data URLs

https://huggingface.co/models (models used)Public benchmarks: GSM8K, MATH, MathQA, ARC, CommonsenseQA, OpenBookQA, HellaSwag (standard datasets)

Risks & Boundaries

Limitations

Primary evaluator relies heavily on GPT-4 variants which can mislabel long or nonsensical model outputs (noted in D.8).

Sorting parsing uses regex variants and fallback; parsing errors still occur and can mis-evaluate outputs (D.7).

When Not To Use

When strict correctness on out-of-distribution symbolic tasks is required and pruning has been applied without recovery fine-tuning.

When ground-truth is ambiguous and requires nuanced human judgment—LLM-as-judge should be spot-checked by humans.

Failure Modes

Pruned models returning nonsensical or empty outputs, including code or irrelevant tokens (Sections D.2–D.3).

LLM-judge misclassifying partially correct long answers as correct (D.8).

Core Entities

Models

Qwen2.5Qwen2Llama-3.1Llama-3.2MistralPhi-3.5SmolLM2HymbaMinitronPhi-4DeepSeek-R1 (referenced)

Metrics

AccuracyMR-Score (meta-reasoning composite)Agreement ρ with humanTrue Positive Rate (TPR)True Negative Rate (TNR)Mean Absolute Error (MAE)

Datasets

GSM8KMATHMathQAARC-EasyARC-ChallengeCommonsenseQAOpenBookQAHellaSwagGSM-PlusMR-GSM8KMR-BenSorting tasks (8/16/32, positive/mixed)

Benchmarks

THINKSLM (this paper)GSM-PlusMR-GSM8KMR-Ben

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation

Some SLMs reach high math accuracy when well trained

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding