Overview
Production Readiness
0.6
Novelty Score
0.45
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
If you run or buy decentralized LLM inference, paying for raw accuracy alone wastes money. Cost-aware PoQ rewards quality per cost, pushing demand toward nodes that give the best quality per latency/compute. That reduces marketplace waste and lets small evaluators compete with big models.
Summary TLDR
The paper extends Proof of Quality (PoQ) — a verification-by-voting approach for trustless LLM inference — to explicitly include computational cost in rewards. It combines token-level F1, lightweight learned evaluators, and GPT judgments, and uses a linear reward R = α·quality - β·cost. Experiments use 5 inference models (1.1B–3.8B), 3 evaluators, 400 sampled prompts from SQuAD and CNN/DailyMail, and 5,000 Monte Carlo PoQ rounds. Results: a semantic STS bi-encoder correlates best with ground truth and GPT; cost-aware rewards favour high quality low-latency models and efficient evaluators.
Problem Statement
Existing PoQ rewards only raw output quality. In decentralized networks nodes have very different latency/energy costs. Without cost-awareness, incentives can favor expensive models and waste resources. The paper asks: can PoQ reward quality-to-cost efficiency so decentralized inference becomes economically sustainable?
Main Contribution
Cost-aware PoQ framework that adds explicit node costs into rewards via a linear trade-off R = α·quality - β·cost for both inference and evaluator nodes.
Empirical comparison of three lightweight evaluator architectures (CE-MiniLM, CE-DeBERTa, STS-DistilRoBERTa) and their correlation with token-level F1 and GPT judgments.
Efficiency profiling of five inference models and three evaluators on a single GPU to build realistic cost norms (latency, throughput, memory).
5,000-round Monte Carlo simulation showing cost-aware PoQ shifts rewards toward high quality low cost inference and efficient evaluators.
Practical deployment guidance: evaluator selection, consensus sizing (K ≤ 3), cost normalization, and reward tuning.
Key Findings
A bi-encoder trained on semantic textual similarity (STS-DistilRoBERTa) aligns best with ground truth and GPT judgments.
Two largest inference models delivered the best quality and some of the lowest latency in this setup.
Cost-aware rewards preferentially compensate high quality low cost nodes and efficient evaluators.
Some popular cross-encoder evaluators correlated poorly or negatively with ground truth and GPT for these generation tasks.
Results
Evaluator correlation with ground truth F1
Evaluator correlation with GPT judge
Inference quality vs latency (examples)
Average PoQ rewards (normalized)
Simulation scale
Who Should Care
What To Try In 7 Days
Profile your candidate models on your target hardware (latency, throughput, memory) and compute normalized cost scores.
Run a small PoQ simulation (few hundred prompts, K ≤ 3 evaluators) using R = α·quality - β·cost to see how rewards shift as you vary β.
Replace or add a semantic STS bi-encoder evaluator and measure correlation with your ground-truth metric or human judge.
Optimization Features
Infra Optimization
- GPU profiling (RTX 4090)
- Batch-size tradeoffs for evaluators
System Optimization
- Quality-to-cost reward balancing
- Evaluator batching for cost reduction
Inference Optimization
- Distributed Inference
- Latency Optimization
- Throughput profiling
Reproducibility
Data Urls
- SQuAD v1.1
- CNN/DailyMail
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to English QA and summarization with short contexts (400 sampled prompts). Results may change for long documents, multi-turn dialogue, or other languages.
- All efficiency profiles run on a single high-end GPU (RTX 4090). Heterogeneous real-world hardware and energy pricing are not modeled.
- Simulation assumes honest nodes; adversarial collusion, misreporting of costs, or evaluator manipulation are not modeled.
- Reward function is linear and fixed; other nonlinear or dynamic reward rules might behave differently.
When Not To Use
- High-stakes settings where cryptographic proof of exact computation is required (PoQ judges output quality, not execution correctness).
- Multilingual or long-context tasks not covered by the evaluated datasets.
- Markets with highly heterogeneous and untrusted cost reporting unless combined with audits or commit-reveal cost proofs.
- Environments with active adversaries or collusion unless robust aggregation and adversary models are added.
Failure Modes
- Evaluator bias or low correlation: poor evaluators can distort rewards and promote wrong models.
- Collusion between inference and evaluator nodes to inflate scores (not simulated).
- Misreported or faked cost numbers if cost reporting is not audited, skewing incentives.
- Batch-size sensitivity: evaluator cost per sample depends strongly on batching; poor batch management can break the cost model.
Core Entities
Models
- TinyLlama-1.1B
- Qwen2-1.5B
- Gemma-2-2B
- Phi-3-mini-4k (3.8B)
- Llama-3.2-3B
- CE-MiniLM
- CE-DeBERTa
- STS-DistilRoBERTa
- gpt-4o-mini (judge)
Metrics
- token-level F1 (scaled 0-10)
- GPT judge score (0-10)
- Pearson correlation
- latency (ms)
- throughput (samples/sec)
- GPU memory (MB)
- normalized PoQ reward
Datasets
- SQuAD v1.1 (dev)
- CNN/DailyMail (test)

