Make decentralized LLM markets pay for quality per cost, not just raw accuracy

Overview

Decision SnapshotNeeds Validation

The approach is straightforward and practical: reward = α·quality - β·cost. Experiments on 5 models, 3 evaluators, and 5k simulation rounds give moderate evidence. Results depend on measured latency on one GPU and honest participants, so readiness is good for prototype marketplaces but needs more stress testing for unv

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 45%

Authors

Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

The paper extends Proof of Quality (PoQ) — a verification-by-voting approach for trustless LLM inference — to explicitly include computational cost in rewards. It combines token-level F1, lightweight learned evaluators, and GPT judgments, and uses a linear reward R = α·quality - β·cost. Experiments use 5 inference models (1.1B–3.8B), 3 evaluators, 400 sampled prompts from SQuAD and CNN/DailyMail, and 5,000 Monte Carlo PoQ rounds. Results: a semantic STS bi-encoder correlates best with ground truth and GPT; cost-aware rewards favour high quality low-latency models and efficient evaluators.

Problem Statement

Existing PoQ rewards only raw output quality. In decentralized networks nodes have very different latency/energy costs. Without cost-awareness, incentives can favor expensive models and waste resources. The paper asks: can PoQ reward quality-to-cost efficiency so decentralized inference becomes economically sustainable?

Main Contribution

Cost-aware PoQ framework that adds explicit node costs into rewards via a linear trade-off R = α·quality - β·cost for both inference and evaluator nodes.

Empirical comparison of three lightweight evaluator architectures (CE-MiniLM, CE-DeBERTa, STS-DistilRoBERTa) and their correlation with token-level F1 and GPT judgments.

Key Findings

A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.

NumbersPearson r ≈ 0.66 vs F1; ≈ 0.29 vs GPT

Practical UsePrefer STS-style bi-encoders as primary PoQ evaluators: they are both informative and cheap to run.

Evidence RefFigure 4, Sec. 5.1

Two largest inference models delivered the best quality and some of the lowest latency in this setup.

NumbersLlama-3.2-3B & Gemma-2-2B: avg F1 ≈ 5.3/10; GPT ≈ 9.0 & 8.7; latency ≈ 1.1s

Practical UseDo not assume smaller models are always cheaper or more efficient — measure latency and quality on your hardware before deciding what to run.

Evidence RefFigure 5, Sec. 5.2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Evaluator correlation with ground truth F1	STS-DistilRoBERTa r ≈ 0.66; CE-DeBERTa r ≈ -0.04; CE-MiniLM r ≈ -0.24	—	—	avg over SQuAD & CNN/DailyMail sampled sets	Figure 4 and Sec. 5.1	Figure 4
Evaluator correlation with GPT judge	STS-DistilRoBERTa r ≈ 0.29; CE-DeBERTa r ≈ 0.03; CE-MiniLM r ≈ -0.17	—	—	avg over judged subset (up to 30 per model/task)	Figure 4 and Sec. 5.1	Figure 4

What To Try In 7 Days

Profile your candidate models on your target hardware (latency, throughput, memory) and compute normalized cost scores.

Run a small PoQ simulation (few hundred prompts, K ≤ 3 evaluators) using R = α·quality - β·cost to see how rewards shift as you vary β.

Replace or add a semantic STS bi-encoder evaluator and measure correlation with your ground-truth metric or human judge.

Optimization Features

Infra Optimization

GPU profiling (RTX 4090)Batch-size tradeoffs for evaluators

System Optimization

Quality-to-cost reward balancingEvaluator batching for cost reduction

Inference Optimization

Distributed InferenceLatency OptimizationThroughput profiling

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

SQuAD v1.1CNN/DailyMail

Risks & Boundaries

Limitations

Evaluation limited to English QA and summarization with short contexts (400 sampled prompts). Results may change for long documents, multi-turn dialogue, or other languages.

All efficiency profiles run on a single high-end GPU (RTX 4090). Heterogeneous real-world hardware and energy pricing are not modeled.

When Not To Use

High-stakes settings where cryptographic proof of exact computation is required (PoQ judges output quality, not execution correctness).

Multilingual or long-context tasks not covered by the evaluated datasets.

Failure Modes

Evaluator bias or low correlation: poor evaluators can distort rewards and promote wrong models.

Collusion between inference and evaluator nodes to inflate scores (not simulated).

Core Entities

Models

TinyLlama-1.1BQwen2-1.5BGemma-2-2BPhi-3-mini-4k (3.8B)Llama-3.2-3BCE-MiniLMCE-DeBERTaSTS-DistilRoBERTagpt-4o-mini (judge)

Metrics

token-level F1 (scaled 0-10)GPT judge score (0-10)Pearson correlationlatency (ms)throughput (samples/sec)GPU memory (MB)normalized PoQ reward

Datasets

SQuAD v1.1 (dev)CNN/DailyMail (test)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.

Two largest inference models delivered the best quality and some of the lowest latency in this setup.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding