Overview
The approach is straightforward and practical: reward = α·quality - β·cost. Experiments on 5 models, 3 evaluators, and 5k simulation rounds give moderate evidence. Results depend on measured latency on one GPU and honest participants, so readiness is good for prototype marketplaces but needs more stress testing for unv
Citations0
Evidence Strength0.70
Confidence0.82
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/5
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 80%
Production readiness: 60%
Novelty: 45%
Why It Matters For Business
If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.
Who Should Care
Summary TLDR
The paper extends Proof of Quality (PoQ) — a verification-by-voting approach for trustless LLM inference — to explicitly include computational cost in rewards. It combines token-level F1, lightweight learned evaluators, and GPT judgments, and uses a linear reward R = α·quality - β·cost. Experiments use 5 inference models (1.1B–3.8B), 3 evaluators, 400 sampled prompts from SQuAD and CNN/DailyMail, and 5,000 Monte Carlo PoQ rounds. Results: a semantic STS bi-encoder correlates best with ground truth and GPT; cost-aware rewards favour high quality low-latency models and efficient evaluators.
Problem Statement
Existing PoQ rewards only raw output quality. In decentralized networks nodes have very different latency/energy costs. Without cost-awareness, incentives can favor expensive models and waste resources. The paper asks: can PoQ reward quality-to-cost efficiency so decentralized inference becomes economically sustainable?
Main Contribution
Cost-aware PoQ framework that adds explicit node costs into rewards via a linear trade-off R = α·quality - β·cost for both inference and evaluator nodes.
Empirical comparison of three lightweight evaluator architectures (CE-MiniLM, CE-DeBERTa, STS-DistilRoBERTa) and their correlation with token-level F1 and GPT judgments.
Key Findings
A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.
Two largest inference models delivered the best quality and some of the lowest latency in this setup.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Evaluator correlation with ground truth F1 | STS-DistilRoBERTa r ≈ 0.66; CE-DeBERTa r ≈ -0.04; CE-MiniLM r ≈ -0.24 | — | — | avg over SQuAD & CNN/DailyMail sampled sets | Figure 4 and Sec. 5.1 | Figure 4 |
| Evaluator correlation with GPT judge | STS-DistilRoBERTa r ≈ 0.29; CE-DeBERTa r ≈ 0.03; CE-MiniLM r ≈ -0.17 | — | — | avg over judged subset (up to 30 per model/task) | Figure 4 and Sec. 5.1 | Figure 4 |
What To Try In 7 Days
Profile your candidate models on your target hardware (latency, throughput, memory) and compute normalized cost scores.
Run a small PoQ simulation (few hundred prompts, K ≤ 3 evaluators) using R = α·quality - β·cost to see how rewards shift as you vary β.
Replace or add a semantic STS bi-encoder evaluator and measure correlation with your ground-truth metric or human judge.
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Data URLs
Risks & Boundaries
Limitations
Evaluation limited to English QA and summarization with short contexts (400 sampled prompts). Results may change for long documents, multi-turn dialogue, or other languages.
All efficiency profiles run on a single high-end GPU (RTX 4090). Heterogeneous real-world hardware and energy pricing are not modeled.
When Not To Use
High-stakes settings where cryptographic proof of exact computation is required (PoQ judges output quality, not execution correctness).
Multilingual or long-context tasks not covered by the evaluated datasets.
Failure Modes
Evaluator bias or low correlation: poor evaluators can distort rewards and promote wrong models.
Collusion between inference and evaluator nodes to inflate scores (not simulated).

