Overview
Shows practical routing gains on multiple public math datasets. Results are promising but reported mainly on reasoning/math tasks and summarized by figures rather than exhaustive numeric tables.
Citations0
Evidence Strength0.60
Confidence0.70
Risk Signals9
Trust Signals
Findings with numeric evidence: 1/3
Findings with evidence refs: 3/3
Results with explicit delta: 3/3
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
Routing saves inference cost by sending easy queries to cheaper models. That lowers cloud bills and lets you scale reasoning services while keeping top-model accuracy.
Who Should Care
Summary TLDR
Train small classifiers on intermediate LLM embeddings to predict problem difficulty or a model's chance of success. Use those predictions to route each problem to the smallest model likely to solve it. On mixed math benchmarks the router matches the big model's accuracy while using about two-thirds of its inference compute.
Problem Statement
Large reasoning models are expensive. Many problems need less compute. Can we predict which problems are easy and route them to cheaper models without losing accuracy?
Main Contribution
Train lightweight classifiers on intermediate embeddings of s1.1-32B to predict problem difficulty (1–5) and per-model correctness (binary).
Design threshold-based routers that send each problem to the smallest model predicted to succeed.
Key Findings
Middle layers of a strong reasoning model carry the most signal for difficulty and correctness prediction.
Accuracy-based routing can match or slightly beat the large model's accuracy on evaluated math tasks while using less compute.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| inference compute | ≈ 2/3 of s1.1-32B compute | s1.1-32B | ≈ -33% | MathCombined (evaluation split) | Accuracy-matched or slightly better while using two-thirds compute | Figure 4; Section 4 |
| Accuracy | comparable or slightly higher than s1.1-32B on evaluated benchmarks | s1.1-32B | small improvement reported (no exact % given) | MathCombined (evaluation split) | Router can achieve comparable and even slightly better performance than s1.1-32B | Section 4; Figure 4 |
What To Try In 7 Days
Collect a small labeled subset of your tasks and extract mid-layer embeddings from your strongest model.
Train a simple MLP to predict task difficulty or per-model correctness using those embeddings.
Implement a threshold-based router that forwards inputs to the smallest model with predicted success probability above the threshold; measure cost and accuracy trade-offs.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluation is focused on math reasoning datasets; generalization to other domains is untested.
Router uses embeddings from a single representative model (s1.1-32B); this may not generalize to heterogeneous model pools.
When Not To Use
If your workload is non-mathematical or differs strongly from the evaluated benchmarks.
If per-query latency is critical and routing overhead could negate savings.
Failure Modes
Predictor misclassification routes hard problems to weak models, causing accuracy drops.
Domain shift: embeddings from the representative model may not reflect other models' failure modes.

