Overview
The work provides an executable protocol, adapter code, and experiments across 90 configurations; empirical claims are backed by statistical tests but cover a limited set of models and benchmarks.
Citations2
Evidence Strength0.70
Confidence0.85
Risk Signals8
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/6
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
A single evaluation protocol reduces integration cost, reveals whether you should invest in a better LLM or in agent engineering, and helps pick cost-performance tradeoffs for production.
Who Should Care
Summary TLDR
The paper introduces the Unified Protocol and Exgentic, a mediation layer and evaluation harness that let any general agent run on many existing agent benchmarks without per-benchmark rewrites. The authors run 90 agent–model configurations (5 agent styles × 3 LLMs × 6 benchmarks) and publish an Open General Agent Leaderboard. Key findings: model choice (e.g., Claude Opus 4.5) dominates agent design for success, tool shortlisting fixes tool-limit failures for some models, and there is a clear cost vs performance tradeoff (high-performing setups cost much more). Code/protocol and a public leaderboard are released at www.exgentic.ai.
Problem Statement
Current agent benchmarks tie tasks to domain-specific communication protocols and hidden assumptions, so they cannot fairly evaluate agents designed to be general. This prevents apples-to-apples comparison of general-purpose agents and slows progress toward agents that work across many environments.
Main Contribution
Unified Protocol: a canonical task/context/action format that mediates between agents and benchmarks without changing either.
Exgentic: an orchestration framework and adaptor library to run any supported agent on any supported benchmark reproducibly and at scale.
Key Findings
Model choice explains far more performance variance than agent architecture.
Claude Opus 4.5 outperforms Gemini 3 and GPT 5.2 on the evaluated benchmarks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Top configuration mean success | 0.73 | — | — | All benchmarks (aggregate) | OpenAI Solo + Claude Opus 4.5 (Table 6) | Table 6 |
| Model mean success (Claude Opus 4.5) | 0.66 | — | — | Aggregate across benchmarks | Sec 5.1 and Table 1 | Sec 5.1 |
What To Try In 7 Days
Run Exgentic on one representative benchmark with your agent and two backbone models to measure model vs agent impact.
Add tool shortlisting if your model hits tool-count limits; retest for reliability and cost.
Monitor average steps and failure-run length to reduce wasted cost via early-stopping or schema guards.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Optimization Features
Infra Optimization
System Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Exgentic currently supports text-based interactions only, not visual or web UIs.
Experiments cover a subset of agents and closed-source models; results may not generalize to all LLMs.
When Not To Use
For tasks requiring GUI or visual interactions until Exgentic adds non-text protocols.
If you require exhaustive evaluation across many niche models beyond the supported set without extending adaptors.
Failure Modes
Adaptor mappings might omit implicit benchmark assumptions and change agent behavior.
Large tool sets can break models with tool-count limits unless shortlisting is used.

