For fixed-label text classification, fine-tuned encoders give near-equal accuracy with 10–100× lower cost and much lower tail latency than L

February 6, 20269 min

Overview

Production Readiness

0.88

Novelty Score

0.3

Cost Impact Score

0.92

Citation Count

0

Authors

Alberto Andres Valdes Gonzalez

Links

Abstract / PDF

Why It Matters For Business

Choosing LLM prompting for routine fixed-label classification often increases operating cost by 10–100× and raises tail latency risk; using fine-tuned encoders saves money, stabilizes SLAs, and eases governance.

Summary TLDR

This paper benchmarks fine-tuned BERT-family encoders against zero- and few-shot LLM prompting (GPT-4o, Claude Sonnet 4.5) on IMDB, SST-2, AG News, and DBPedia. It measures macro-F1, end-to-end latency (p50/p95/p99), TTFT, and per-request cost. Result: encoders match or exceed LLM accuracy on these fixed-label tasks while delivering one to two orders of magnitude lower inference cost and much tighter tail latency. DistilBERT often wins the utility ranking (best trade-off). The authors release code and configs to reproduce cost/latency estimates.

Problem Statement

Model choice for fixed-label text classification is usually driven by accuracy alone. In production, latency, tail behavior, recurring inference cost, and governance (reproducibility, versioning) matter just as much. The paper asks: when does LLM prompting justify its higher operational overhead compared to fine-tuned encoders?

Main Contribution

A reproducible benchmark comparing fine-tuned BERT-family encoders and zero/few-shot LLM prompting on four standard datasets, reporting macro-F1, latency percentiles, TTFT, and per-request cost.

A decision framework: Pareto-front analysis plus a parameterized utility function that ranks models under different latency tolerances.

Practical guidance and released artifacts (code, prompts, deployment configs) so teams can re-run the cost-latency-accuracy trade-offs for their own pricing and SLA assumptions.

Key Findings

Fine-tuned encoders match or exceed LLM prompting on structured fixed-label tasks while running far cheaper and faster.

NumbersExamples: DistilBERT cost $5.73–$12.44 vs GPT-4o/Claude $276–$2,701 per 1M req (datasets AG News, IMDB, DBPedia).

Tail latency and TTFT for LLM APIs are much higher and more variable than for encoders.

NumbersLLM p95 latencies often ≥1.7–2.0s; encoder p50 latencies typically 98–622 ms with tighter tails.

Few-shot prompting increases token usage, cost, and latency with small or no accuracy gains on these tasks.

NumbersFew-shot roughly doubled input tokens (e.g., GPT-4o IMDB 333→611) and increased cost from $842.78→$1,537.78 per 1M req,F

Utility ranking that combines F1, p50 latency, and cost consistently favors compact encoders; DistilBERT ranks first across datasets and latency regimes.

NumbersUtility tables (100×U) show DistilBERT top-ranked for τ=250/500/1000ms across all datasets (Tables 6-9).

Encoders provide stronger operational governance: versioned artifacts, logits for calibration, and on-prem deployment.

NumbersPaper documents artifact-based deployment and notes LLM APIs lack comparable probabilistic transparency.

Results

IMDB macro-F1

ValueRoBERTa 94.84% ±0.12 vs Claude 4.5 FS 96.48% ±0.01

BaselineRoBERTa encoder

IMDB estimated cost (USD / 1M req)

ValueDistilBERT $12.44 vs GPT-4o ZS $842.78

BaselineDistilBERT encoder

AG News macro-F1

ValueRoBERTa 94.63% ±0.14 vs Claude 4.5 ZS 91.35% ±0.11

BaselineRoBERTa encoder

AG News estimated cost (USD / 1M req)

ValueDistilBERT $5.73 vs GPT-4o ZS $276.00

BaselineDistilBERT encoder

DBPedia macro-F1

ValueBERT 99.40% ±0.04 vs Claude 4.5 ZS 98.83% ±0.04

BaselineBERT encoder

Latency p95

ValueEncoder p95 typically 123–1868 ms; LLM p95 often ≥519–2568 ms

BaselineEncoder p95

Who Should Care

What To Try In 7 Days

Run this paper's repo to measure your p50/p95 latency and cost using your pricing snapshot.

Fine-tune DistilBERT on your label set as a baseline and measure macro-F1 vs latency and cost.

Compute the provided utility function with your τ (latency tolerance) to pick a deployment candidate.

Optimization Features

Token Efficiency

  • few-shot prompting substantially increases input tokens and token cost

Infra Optimization

  • measure end-to-end latency on realistic serverless infra (Cloud Run) rather than raw hardware timing

Model Optimization

  • fine-tuning encoder weights for task specialization
  • pick compact models (DistilBERT) to reduce latency/cost

System Optimization

  • use utility + Pareto analysis to choose models under SLA constraints

Training Optimization

  • select checkpoint by a gap-penalized generalization score to avoid overfitting

Inference Optimization

  • deploy encoders as stateless services to minimize end-to-end latency
  • avoid few-shot prompting when throughput and cost matter

Reproducibility

Data Urls

  • IMDB
  • SST-2
  • AG News
  • DBPedia

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Benchmarks limited to four English fixed-label datasets; results may differ for open-ended, high-ambiguity tasks.
  • Latency and cost estimates tied to specific Cloud Run config and January 22, 2026 pricing snapshot.
  • LLM experiments use deterministic decoding (T=0) and fixed prompt templates; advanced prompting methods were not evaluated.
  • Provider-side variability and network conditions remain uncontrolled for API LLM runs.

When Not To Use

  • When the task requires open-ended generation, schema discovery, or evolving taxonomies where LLM reasoning adds unique value.
  • Low-volume prototypes where per-request cost is negligible and developer convenience matters more than operating cost.
  • Scenarios demanding very long context understanding that encoder fine-tuning cannot capture without architectural changes.

Failure Modes

  • Adopting few-shot LLM prompting at scale causes high token bills and unpredictable tail latencies.
  • Relying on API LLMs creates vendor-lock and silent behavior changes from provider updates.
  • Selecting models by raw F1 without cost/latency can produce choices that violate SLAs under real traffic.

Core Entities

Models

  • BERT
  • RoBERTa
  • DistilBERT
  • GPT-4o
  • Claude Sonnet 4.5

Metrics

  • macro F1
  • precision
  • recall
  • Accuracy
  • inference latency p50/p95/p99
  • time-to-first-token (TTFT)
  • avg input/output tokens
  • estimated cost per 1M requests
  • utility score (F1/cost with latency decay)
  • Pareto dominance in (F1, latency, cost)

Datasets

  • IMDB
  • SST-2
  • AG News
  • DBPedia

Benchmarks

  • fixed-label text classification benchmark (this work)