Overview
Production Readiness
0.88
Novelty Score
0.3
Cost Impact Score
0.92
Citation Count
0
Why It Matters For Business
Choosing LLM prompting for routine fixed-label classification often increases operating cost by 10–100× and raises tail latency risk; using fine-tuned encoders saves money, stabilizes SLAs, and eases governance.
Summary TLDR
This paper benchmarks fine-tuned BERT-family encoders against zero- and few-shot LLM prompting (GPT-4o, Claude Sonnet 4.5) on IMDB, SST-2, AG News, and DBPedia. It measures macro-F1, end-to-end latency (p50/p95/p99), TTFT, and per-request cost. Result: encoders match or exceed LLM accuracy on these fixed-label tasks while delivering one to two orders of magnitude lower inference cost and much tighter tail latency. DistilBERT often wins the utility ranking (best trade-off). The authors release code and configs to reproduce cost/latency estimates.
Problem Statement
Model choice for fixed-label text classification is usually driven by accuracy alone. In production, latency, tail behavior, recurring inference cost, and governance (reproducibility, versioning) matter just as much. The paper asks: when does LLM prompting justify its higher operational overhead compared to fine-tuned encoders?
Main Contribution
A reproducible benchmark comparing fine-tuned BERT-family encoders and zero/few-shot LLM prompting on four standard datasets, reporting macro-F1, latency percentiles, TTFT, and per-request cost.
A decision framework: Pareto-front analysis plus a parameterized utility function that ranks models under different latency tolerances.
Practical guidance and released artifacts (code, prompts, deployment configs) so teams can re-run the cost-latency-accuracy trade-offs for their own pricing and SLA assumptions.
Key Findings
Fine-tuned encoders match or exceed LLM prompting on structured fixed-label tasks while running far cheaper and faster.
Tail latency and TTFT for LLM APIs are much higher and more variable than for encoders.
Few-shot prompting increases token usage, cost, and latency with small or no accuracy gains on these tasks.
Utility ranking that combines F1, p50 latency, and cost consistently favors compact encoders; DistilBERT ranks first across datasets and latency regimes.
Encoders provide stronger operational governance: versioned artifacts, logits for calibration, and on-prem deployment.
Results
IMDB macro-F1
IMDB estimated cost (USD / 1M req)
AG News macro-F1
AG News estimated cost (USD / 1M req)
DBPedia macro-F1
Latency p95
Who Should Care
What To Try In 7 Days
Run this paper's repo to measure your p50/p95 latency and cost using your pricing snapshot.
Fine-tune DistilBERT on your label set as a baseline and measure macro-F1 vs latency and cost.
Compute the provided utility function with your τ (latency tolerance) to pick a deployment candidate.
Optimization Features
Token Efficiency
- few-shot prompting substantially increases input tokens and token cost
Infra Optimization
- measure end-to-end latency on realistic serverless infra (Cloud Run) rather than raw hardware timing
Model Optimization
- fine-tuning encoder weights for task specialization
- pick compact models (DistilBERT) to reduce latency/cost
System Optimization
- use utility + Pareto analysis to choose models under SLA constraints
Training Optimization
- select checkpoint by a gap-penalized generalization score to avoid overfitting
Inference Optimization
- deploy encoders as stateless services to minimize end-to-end latency
- avoid few-shot prompting when throughput and cost matter
Reproducibility
Code Urls
Data Urls
- IMDB
- SST-2
- AG News
- DBPedia
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Benchmarks limited to four English fixed-label datasets; results may differ for open-ended, high-ambiguity tasks.
- Latency and cost estimates tied to specific Cloud Run config and January 22, 2026 pricing snapshot.
- LLM experiments use deterministic decoding (T=0) and fixed prompt templates; advanced prompting methods were not evaluated.
- Provider-side variability and network conditions remain uncontrolled for API LLM runs.
When Not To Use
- When the task requires open-ended generation, schema discovery, or evolving taxonomies where LLM reasoning adds unique value.
- Low-volume prototypes where per-request cost is negligible and developer convenience matters more than operating cost.
- Scenarios demanding very long context understanding that encoder fine-tuning cannot capture without architectural changes.
Failure Modes
- Adopting few-shot LLM prompting at scale causes high token bills and unpredictable tail latencies.
- Relying on API LLMs creates vendor-lock and silent behavior changes from provider updates.
- Selecting models by raw F1 without cost/latency can produce choices that violate SLAs under real traffic.
Core Entities
Models
- BERT
- RoBERTa
- DistilBERT
- GPT-4o
- Claude Sonnet 4.5
Metrics
- macro F1
- precision
- recall
- Accuracy
- inference latency p50/p95/p99
- time-to-first-token (TTFT)
- avg input/output tokens
- estimated cost per 1M requests
- utility score (F1/cost with latency decay)
- Pareto dominance in (F1, latency, cost)
Datasets
- IMDB
- SST-2
- AG News
- DBPedia
Benchmarks
- fixed-label text classification benchmark (this work)

