Make decentralized LLM markets pay for quality per cost, not just raw accuracy

December 18, 20259 min

Overview

Decision SnapshotNeeds Validation

The approach is straightforward and practical: reward = α·quality - β·cost. Experiments on 5 models, 3 evaluators, and 5k simulation rounds give moderate evidence. Results depend on measured latency on one GPU and honest participants, so readiness is good for prototype marketplaces but needs more stress testing for unv

Citations0

Evidence Strength0.70

Confidence0.82

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 45%

Authors

Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.

Who Should Care

Summary TLDR

The paper extends Proof of Quality (PoQ) — a verification-by-voting approach for trustless LLM inference — to explicitly include computational cost in rewards. It combines token-level F1, lightweight learned evaluators, and GPT judgments, and uses a linear reward R = α·quality - β·cost. Experiments use 5 inference models (1.1B–3.8B), 3 evaluators, 400 sampled prompts from SQuAD and CNN/DailyMail, and 5,000 Monte Carlo PoQ rounds. Results: a semantic STS bi-encoder correlates best with ground truth and GPT; cost-aware rewards favour high quality low-latency models and efficient evaluators.

Problem Statement

Existing PoQ rewards only raw output quality. In decentralized networks nodes have very different latency/energy costs. Without cost-awareness, incentives can favor expensive models and waste resources. The paper asks: can PoQ reward quality-to-cost efficiency so decentralized inference becomes economically sustainable?

Main Contribution

Cost-aware PoQ framework that adds explicit node costs into rewards via a linear trade-off R = α·quality - β·cost for both inference and evaluator nodes.

Empirical comparison of three lightweight evaluator architectures (CE-MiniLM, CE-DeBERTa, STS-DistilRoBERTa) and their correlation with token-level F1 and GPT judgments.

Key Findings

A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.

NumbersPearson r ≈ 0.66 vs F1; ≈ 0.29 vs GPT

Practical UsePrefer STS-style bi-encoders as primary PoQ evaluators: they are both informative and cheap to run.

Evidence RefFigure 4, Sec. 5.1

Two largest inference models delivered the best quality and some of the lowest latency in this setup.

NumbersLlama-3.2-3B & Gemma-2-2B: avg F1 ≈ 5.3/10; GPT ≈ 9.0 & 8.7; latency ≈ 1.1s

Practical UseDo not assume smaller models are always cheaper or more efficient — measure latency and quality on your hardware before deciding what to run.

Evidence RefFigure 5, Sec. 5.2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Evaluator correlation with ground truth F1STS-DistilRoBERTa r ≈ 0.66; CE-DeBERTa r ≈ -0.04; CE-MiniLM r ≈ -0.24avg over SQuAD & CNN/DailyMail sampled setsFigure 4 and Sec. 5.1Figure 4
Evaluator correlation with GPT judgeSTS-DistilRoBERTa r ≈ 0.29; CE-DeBERTa r ≈ 0.03; CE-MiniLM r ≈ -0.17avg over judged subset (up to 30 per model/task)Figure 4 and Sec. 5.1Figure 4

What To Try In 7 Days

Profile your candidate models on your target hardware (latency, throughput, memory) and compute normalized cost scores.

Run a small PoQ simulation (few hundred prompts, K ≤ 3 evaluators) using R = α·quality - β·cost to see how rewards shift as you vary β.

Replace or add a semantic STS bi-encoder evaluator and measure correlation with your ground-truth metric or human judge.

Optimization Features

Infra Optimization
GPU profiling (RTX 4090)Batch-size tradeoffs for evaluators
System Optimization
Quality-to-cost reward balancingEvaluator batching for cost reduction
Inference Optimization
Distributed InferenceLatency OptimizationThroughput profiling

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

SQuAD v1.1CNN/DailyMail

Risks & Boundaries

Limitations

Evaluation limited to English QA and summarization with short contexts (400 sampled prompts). Results may change for long documents, multi-turn dialogue, or other languages.

All efficiency profiles run on a single high-end GPU (RTX 4090). Heterogeneous real-world hardware and energy pricing are not modeled.

When Not To Use

High-stakes settings where cryptographic proof of exact computation is required (PoQ judges output quality, not execution correctness).

Multilingual or long-context tasks not covered by the evaluated datasets.

Failure Modes

Evaluator bias or low correlation: poor evaluators can distort rewards and promote wrong models.

Collusion between inference and evaluator nodes to inflate scores (not simulated).

Core Entities

Models

TinyLlama-1.1BQwen2-1.5BGemma-2-2BPhi-3-mini-4k (3.8B)Llama-3.2-3BCE-MiniLMCE-DeBERTaSTS-DistilRoBERTagpt-4o-mini (judge)

Metrics

token-level F1 (scaled 0-10)GPT judge score (0-10)Pearson correlationlatency (ms)throughput (samples/sec)GPU memory (MB)normalized PoQ reward

Datasets

SQuAD v1.1 (dev)CNN/DailyMail (test)