Make decentralized LLM markets pay for quality per cost, not just raw accuracy

December 18, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.8

Citation Count

0

Authors

Arther Tian, Alex Ding, Frank Chen, Alan Wu, Aaron Chan, Bruce Zhang

Links

Abstract / PDF

Why It Matters For Business

If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.

Summary TLDR

The paper extends Proof of Quality (PoQ) — a verification-by-voting approach for trustless LLM inference — to explicitly include computational cost in rewards. It combines token-level F1, lightweight learned evaluators, and GPT judgments, and uses a linear reward R = α·quality - β·cost. Experiments use 5 inference models (1.1B–3.8B), 3 evaluators, 400 sampled prompts from SQuAD and CNN/DailyMail, and 5,000 Monte Carlo PoQ rounds. Results: a semantic STS bi-encoder correlates best with ground truth and GPT; cost-aware rewards favour high quality low-latency models and efficient evaluators.

Problem Statement

Existing PoQ rewards only raw output quality. In decentralized networks nodes have very different latency/energy costs. Without cost-awareness, incentives can favor expensive models and waste resources. The paper asks: can PoQ reward quality-to-cost efficiency so decentralized inference becomes economically sustainable?

Main Contribution

Cost-aware PoQ framework that adds explicit node costs into rewards via a linear trade-off R = α·quality - β·cost for both inference and evaluator nodes.

Empirical comparison of three lightweight evaluator architectures (CE-MiniLM, CE-DeBERTa, STS-DistilRoBERTa) and their correlation with token-level F1 and GPT judgments.

Efficiency profiling of five inference models and three evaluators on a single GPU to build realistic cost norms (latency, throughput, memory).

5,000-round Monte Carlo simulation showing cost-aware PoQ shifts rewards toward high quality low cost inference and efficient evaluators.

Practical deployment guidance: evaluator selection, consensus sizing (K ≤ 3), cost normalization, and reward tuning.

Key Findings

A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.

NumbersPearson r ≈ 0.66 vs F1; ≈ 0.29 vs GPT

Two largest inference models delivered the best quality and some of the lowest latency in this setup.

NumbersLlama-3.2-3B & Gemma-2-2B: avg F1 ≈ 5.3/10; GPT ≈ 9.0 & 8.7; latency ≈ 1.1s

Cost-aware rewards preferentially compensate high quality low cost nodes and efficient evaluators.

NumbersMonte Carlo (5k rounds): Llama-3.2-3B avg reward ≈0.623, Gemma ≈0.598, Phi-3-mini ≈0.126; STS eval ≈0.856, CE-MiniLM ≈0.

Some popular cross-encoder evaluators correlated poorly or negatively with ground truth and GPT for these generation tasks.

NumbersCE-DeBERTa r ≈ -0.04 vs F1; CE-MiniLM r ≈ -0.24 vs F1

Results

Evaluator correlation with ground truth F1

ValueSTS-DistilRoBERTa r ≈ 0.66; CE-DeBERTa r ≈ -0.04; CE-MiniLM r ≈ -0.24

Evaluator correlation with GPT judge

ValueSTS-DistilRoBERTa r ≈ 0.29; CE-DeBERTa r ≈ 0.03; CE-MiniLM r ≈ -0.17

Inference quality vs latency (examples)

ValueLlama-3.2-3B & Gemma-2-2B avg F1 ≈5.3/10; GPT ≈9.0 & 8.7; latency ≈1.1s. Phi-3-mini & Qwen2 avg F1 <1.7; latency >2.3s

Average PoQ rewards (normalized)

ValueInference: Llama-3.2-3B ≈0.623; Gemma ≈0.598; TinyLlama ≈0.426; Phi-3-mini ≈0.126; Qwen2 ≈0.157. Evaluators: STS ≈0.856;

Simulation scale

Value5,000 PoQ rounds; evaluation corpus = 400 prompts; up to K ≤ 3 evaluators per round

Who Should Care

What To Try In 7 Days

Profile your candidate models on your target hardware (latency, throughput, memory) and compute normalized cost scores.

Run a small PoQ simulation (few hundred prompts, K ≤ 3 evaluators) using R = α·quality - β·cost to see how rewards shift as you vary β.

Replace or add a semantic STS bi-encoder evaluator and measure correlation with your ground-truth metric or human judge.

Optimization Features

Infra Optimization

  • GPU profiling (RTX 4090)
  • Batch-size tradeoffs for evaluators

System Optimization

  • Quality-to-cost reward balancing
  • Evaluator batching for cost reduction

Inference Optimization

  • Distributed Inference
  • Latency Optimization
  • Throughput profiling

Reproducibility

Data Urls

  • SQuAD v1.1
  • CNN/DailyMail

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to English QA and summarization with short contexts (400 sampled prompts). Results may change for long documents, multi-turn dialogue, or other languages.
  • All efficiency profiles run on a single high-end GPU (RTX 4090). Heterogeneous real-world hardware and energy pricing are not modeled.
  • Simulation assumes honest nodes; adversarial collusion, misreporting of costs, or evaluator manipulation are not modeled.
  • Reward function is linear and fixed; other nonlinear or dynamic reward rules might behave differently.

When Not To Use

  • High-stakes settings where cryptographic proof of exact computation is required (PoQ judges output quality, not execution correctness).
  • Multilingual or long-context tasks not covered by the evaluated datasets.
  • Markets with highly heterogeneous and untrusted cost reporting unless combined with audits or commit-reveal cost proofs.
  • Environments with active adversaries or collusion unless robust aggregation and adversary models are added.

Failure Modes

  • Evaluator bias or low correlation: poor evaluators can distort rewards and promote wrong models.
  • Collusion between inference and evaluator nodes to inflate scores (not simulated).
  • Misreported or faked cost numbers if cost reporting is not audited, skewing incentives.
  • Batch-size sensitivity: evaluator cost per sample depends strongly on batching; poor batch management can break the cost model.

Core Entities

Models

  • TinyLlama-1.1B
  • Qwen2-1.5B
  • Gemma-2-2B
  • Phi-3-mini-4k (3.8B)
  • Llama-3.2-3B
  • CE-MiniLM
  • CE-DeBERTa
  • STS-DistilRoBERTa
  • gpt-4o-mini (judge)

Metrics

  • token-level F1 (scaled 0-10)
  • GPT judge score (0-10)
  • Pearson correlation
  • latency (ms)
  • throughput (samples/sec)
  • GPU memory (MB)
  • normalized PoQ reward

Datasets

  • SQuAD v1.1 (dev)
  • CNN/DailyMail (test)