Overview
Production Readiness
0.7
Novelty Score
0.45
Cost Impact Score
0.65
Citation Count
1
Why It Matters For Business
You can deploy cost-efficient models for math and stepwise reasoning by choosing strong training recipes and quantizing large models rather than simply scaling parameters.
Summary TLDR
This paper builds THINKSLM, a large benchmark and leaderboard that evaluates reasoning in 72 small language models (SLMs) across 17 tasks. It validates LLM-as-a-judge (GPT-4-turbo/GPT-4o) against humans, then shows that training recipe and data quality matter more than raw parameter count. Quantization usually preserves reasoning, pruning often breaks it, and some well-trained SLMs (notably Qwen2.5 variants) approach or match much larger models on math and intermediate reasoning.
Problem Statement
Reasoning is often treated as an emergent property of very large models. We lack a systematic, reproducible evaluation that measures whether small models and compressed variants (quantized, pruned, distilled) can reach strong reasoning performance and remain robust under stressors.
Main Contribution
THINKSLM benchmark and public leaderboard evaluating 72 SLM variants across 17 reasoning tasks.
Empirical validation that GPT-4-turbo/GPT-4o align closely with human judgments and can be used as primary evaluators.
Large-scale study of compression: quantization preserves reasoning broadly, pruning often causes severe failures, distillation helps but is mixed.
Actionable insights showing training data, instruction tuning, and distillation can beat simple parameter scaling for reasoning.
Key Findings
GPT-4-turbo agrees with human judgments nearly perfectly on reasoning evaluation
Some SLMs reach high math accuracy when well trained
Quantization keeps reasoning quality with tiny memory cost
Pruning often catastrophically damages reasoning
Intermediate/meta-reasoning can be matched by open SLMs
Prompt complexity gives small or negative gains for modern SLMs
Results
LLM-as-judge agreement with humans
Accuracy
Quantization impact (example)
Pruning impact (example)
Intermediate reasoning (MR-Score)
Who Should Care
What To Try In 7 Days
Run GPT-4o/GPT-4-turbo on a 1000-sample audit to validate internal grading and catch judge biases.
Quantize a top-performing pre-trained model (e.g., GPTQ 8/4-bit) and compare GSM8K and key tasks to the FP baseline.
Avoid aggressive pruning for reasoning workloads; if needed, run a targeted recovery fine-tune on diverse reasoning data (GSM8K + MR-GSM8K).
Optimization Features
Token Efficiency
- Accuracy
Infra Optimization
- Measured GPU memory on A100/H100; total compute ≈24k GPU hours
- Use disk+GPU tradeoffs described in resource tables
Model Optimization
- Post-training quantization (GPTQ, INT8/INT4, FP8)
- Knowledge distillation with teacher outputs
- Avoid aggressive unstructured/structured pruning for reasoning models
System Optimization
- vLLM and Hugging Face accelerate multi-GPU inference
- Automatic GPU allocation depending on model size
Training Optimization
- Instruction tuning improves reasoning (e.g., Qwen2.5 instruction-tuned gains)
- Teacher-driven distillation and curated synthetic reasoning data
- Multi-stage RL-based alignment for stronger downstream reasoning
Inference Optimization
- Use GPT-4o for lower-cost judge tasks; GPT-4-turbo for high-precision math judging
- Prefer simple Direct I/O prompts for modern SLMs before adding CoT
Reproducibility
Code Urls
Data Urls
- https://huggingface.co/models (models used)
- Public benchmarks: GSM8K, MATH, MathQA, ARC, CommonsenseQA, OpenBookQA, HellaSwag (standard datasets)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Primary evaluator relies heavily on GPT-4 variants which can mislabel long or nonsensical model outputs (noted in D.8).
- Sorting parsing uses regex variants and fallback; parsing errors still occur and can mis-evaluate outputs (D.7).
- Human evaluation was internal (single graduate annotator + authors) which limits diversity and external validation (G).
- Study focuses on common academic benchmarks; real-world domain reasoning may differ.
When Not To Use
- When strict correctness on out-of-distribution symbolic tasks is required and pruning has been applied without recovery fine-tuning.
- When ground-truth is ambiguous and requires nuanced human judgment—LLM-as-judge should be spot-checked by humans.
Failure Modes
- Pruned models returning nonsensical or empty outputs, including code or irrelevant tokens (Sections D.2–D.3).
- LLM-judge misclassifying partially correct long answers as correct (D.8).
- Parsing scripts failing on unexpected output formats leading to false negatives or false positives in sequence tasks (D.7).
Core Entities
Models
- Qwen2.5
- Qwen2
- Llama-3.1
- Llama-3.2
- Mistral
- Phi-3.5
- SmolLM2
- Hymba
- Minitron
- Phi-4
- DeepSeek-R1 (referenced)
Metrics
- Accuracy
- MR-Score (meta-reasoning composite)
- Agreement ρ with human
- True Positive Rate (TPR)
- True Negative Rate (TNR)
- Mean Absolute Error (MAE)
Datasets
- GSM8K
- MATH
- MathQA
- ARC-Easy
- ARC-Challenge
- CommonsenseQA
- OpenBookQA
- HellaSwag
- GSM-Plus
- MR-GSM8K
- MR-Ben
- Sorting tasks (8/16/32, positive/mixed)
Benchmarks
- THINKSLM (this paper)
- GSM-Plus
- MR-GSM8K
- MR-Ben

