Overview
The study is practical: evaluation pipeline is validated against humans and covers many models and compressions; however reliance on LLM judges and some parser edge cases lower absolute confidence.
Citations1
Evidence Strength0.85
Confidence0.82
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 70%
Novelty: 45%
Why It Matters For Business
You can deploy cost-efficient models for math and stepwise reasoning by choosing strong training recipes and quantizing large models rather than simply scaling parameters.
Who Should Care
Summary TLDR
This paper builds THINKSLM, a large benchmark and leaderboard that evaluates reasoning in 72 small language models (SLMs) across 17 tasks. It validates LLM-as-a-judge (GPT-4-turbo/GPT-4o) against humans, then shows that training recipe and data quality matter more than raw parameter count. Quantization usually preserves reasoning, pruning often breaks it, and some well-trained SLMs (notably Qwen2.5 variants) approach or match much larger models on math and intermediate reasoning.
Problem Statement
Reasoning is often treated as an emergent property of very large models. We lack a systematic, reproducible evaluation that measures whether small models and compressed variants (quantized, pruned, distilled) can reach strong reasoning performance and remain robust under stressors.
Main Contribution
THINKSLM benchmark and public leaderboard evaluating 72 SLM variants across 17 reasoning tasks.
Empirical validation that GPT-4-turbo/GPT-4o align closely with human judgments and can be used as primary evaluators.
Key Findings
GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation
Some SLMs reach high math accuracy when well trained
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LLM-as-judge agreement with humans | GPT-4-turbo ρ=0.99; GPT-4o ρ=0.98 | Human eval (ρ=1.00) | GPT-4-turbo ≈ +0 vs human | GSM8K, ARC-E, ARC-C, CommonsenseQA, GSM-Plus (1000 samples) | Table 1; Sec. 3.3 | Table 1 |
| Accuracy | Qwen2.5-14B: 94.29% (GSM8K, Direct I/O) | many SLMs range 30–90% | Substantial vs family peers (e.g., Mistral-7B lower) | GSM8K (test split) | Table 3, Sec. 4.1 | Table 3 |
What To Try In 7 Days
Run GPT-4o/GPT-4-turbo on a 1000-sample audit to validate internal grading and catch judge biases.
Quantize a top-performing pre-trained model (e.g., GPTQ 8/4-bit) and compare GSM8K and key tasks to the FP baseline.
Avoid aggressive pruning for reasoning workloads; if needed, run a targeted recovery fine-tune on diverse reasoning data (GSM8K + MR-GSM8K).
Optimization Features
Token Efficiency
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Primary evaluator relies heavily on GPT-4 variants which can mislabel long or nonsensical model outputs (noted in D.8).
Sorting parsing uses regex variants and fallback; parsing errors still occur and can mis-evaluate outputs (D.7).
When Not To Use
When strict correctness on out-of-distribution symbolic tasks is required and pruning has been applied without recovery fine-tuning.
When ground-truth is ambiguous and requires nuanced human judgment—LLM-as-judge should be spot-checked by humans.
Failure Modes
Pruned models returning nonsensical or empty outputs, including code or irrelevant tokens (Sections D.2–D.3).
LLM-judge misclassifying partially correct long answers as correct (D.8).

