Single-Agent Systems Papers — Parsed & Scored for Practitioners

Practical survey of single- vs. multi-agent designs, planning steps, and tool calling trade-offs

0.60

0.50

19

Choose single agents for narrow, tool-driven tasks and multi-agent teams for complex, parallel workflows; add clear leadership, role prompts, and message filtering to improve speed and reliability.

Key finding

ReAct reduces factual hallucination versus Chain-of-Thought on HotpotQA.

Numbers: 6% hallucination (ReAct) vs 14% (CoT) on HotpotQA

Agentless: a simple three-step workflow (localize, repair, validate) that matches or beats open-source agents on SWE-bench Lite while slasH‑

0.70

0.60

0.80

13

A focused, non-agentic pipeline cuts cost and engineering overhead while matching or exceeding many open-source agentic systems on repo-level bug fixes.

Key finding

AGENTLESS resolves 96 of 300 SWE-bench Lite problems

Numbers: 96/300 = 32.00%

KG-Agent: a tool-augmented autonomous 7B LLM that reasons step-by-step over knowledge graphs

0.60

0.65

0.70

12

You can get KG-backed, multi-hop reasoning without expensive closed LLM APIs by fine-tuning a 7B open model on ~10K program-like instructions, cutting cost and improving cross-domain use of external KGs.

Key finding

Instruction-tuned KG-Agent (LLaMA2-7B) improves KGQA F1 over prior baselines on in-domain tests.

Numbers: F1 gains: WebQSP +1.7%, CWQ +7.5%, GrailQA +2.7% (Sec 5.2, Table 2)

Mobile-Agent: operate mobile apps from screenshots using visual perception

0.70

0.60

11

You can automate mobile UI flows without OS hooks or XML access. That lowers integration cost for cross-device automation, testing, and accessibility tools and works on devices where system metadata is unavailable.

Key finding

High task success on simple app instructions

Numbers: Success (Instruction1) = 0.91

Use LLMs (LightGPT) to control traffic lights with human-like reasoning and lower deployment cost

0.70

0.85

10

LLMLight enables interpretable, generalizable traffic control with much lower deployment cost than closed LLM APIs, making city-scale experiments and phased rollouts affordable.

Key finding

LightGPT (Llama2-13B) yields low travel times on evaluated datasets.

Numbers: ATT ≈ 274.03 s on Jinan/Hangzhou (Table 2/8).

Have LLMs judge and train themselves: iterative self-rewards boost instruction-following and the model's own evaluator.

0.60

0.70

0.60

9

Self-rewarding training can reduce dependence on large human-preference datasets by letting an LLM generate and score its own training data, lowering labeling cost and enabling iterative improvement—but it needs monitoring for safety and domain gaps.

Key finding

Instruction-following win rate against GPT-4 Turbo (AlpacaEval 2.0) rose across iterations.

Numbers: M1 9.94% → M2 15.38% → M3 20.44%

GPT-4 agents autonomously exploit sandboxed website vulnerabilities (11/15) and find at least one real XSS

0.20

0.60

8

High-capability LLM agents can automate complex web attacks at lower estimated cost than manual analysts, increasing the risk surface for companies that expose web interfaces.

Key finding

GPT-4 agent succeeded on most sandboxed vulnerabilities

Numbers: Pass@5 = 73.3%; overall success = 42.7% (Table 2)

Aviary: train small open LLM agents to solve multi-step biology tasks and match frontier models at far lower inference cost

0.70

0.60

0.80

7

You can train modest open LLMs to match or beat larger closed models and humans on multi-step scientific workflows while cutting inference cost by orders of magnitude, enabling cheaper high-throughput automation.

Key finding

A trained Llama-3.1-8B-Instruct agent reached 0.89 test accuracy on SeqQA using large-sample majority voting.

Numbers: 0.89 accuracy (SeqQA, test; many-sample consensus)

How LLMs are being used to build game-playing agents: memory, reasoning, perception, and multi-agent design

0.40

0.60

0.50

6

Game agents are a practical lab for building interactive AI: solutions for memory, robust reasoning, and hybrid control transfer to real automation, simulations, and multi-agent coordination systems used in product testing and virtual worlds.

Key finding

Carrying the previous step's thought into the next prompt (LastThoughts) raises win rate and cuts short-term inconsistent actions.

Numbers: Win rate 0.4217 → 0.4667; consecutive switch rate 0.2442 → 0.0861

Autonomously collect a single rollout to train a NeRF for rendering, mapping and navigation

0.60

0.50

0.40

5

AutoNeRF can automate 3D scene capture for robot deployment, cutting manual data collection and enabling safe simulation-based finetuning of navigation policies from a single short rollout.

Key finding

Modular exploration trained for obstacle/viewpoint coverage yields better RGB rendering than Frontier or E2E RL.

Numbers: PSNR 25.56 (Ours obs.) vs 19.75 (Frontier) on uniform scene poses

Train LLM-based agents end-to-end with RL and let them ask humans for help

0.60

0.70

0.60

4

AGILE lets production agents learn when to call humans and when to act, improving accuracy while controlling human cost. That makes it practical for customer support, medical QA, and recommendation systems where mistakes are costly.

Key finding

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

Numbers: Total score (short answers) 0.784 vs agile-gpt4-prompt 0.718; +9.2% rel. (Table 4)

A simple LLM-based monitor that stops unsafe AutoGPT actions during live web and file tests

0.40

0.60

3

A lightweight LLM-based gate can block many dangerous agent actions before they run, reducing incident risk for products that let agents access the web or filesystem.

Key finding

AgentMonitor achieves high detection performance on the authors' test set.

Numbers: F1 89.4%, precision 82.1%, recall 98.3%, AUC 0.982

Fixes RND's 'bonus inconsistency' by distilling many random targets to produce pseudo-counts for better exploration and offline conservatism

0.70

0.60

3

DRND improves exploration bonuses while adding negligible compute, so teams can get better exploration or safer offline policies without changing infra or large engineering effort.

Key finding

DRND produces a much more uniform initial bonus than RND

Numbers: DKL(P||U): RND 0.0377±0.0248 vs DRND 0.0070±0.0063 (before training)

Use learned evaluators (VLM+LM) to judge and improve web and device-control agents without extra labels

0.65

0.60

0.70

3

Automated evaluators let you measure and improve GUI agents at scale without costly human labels or hand-coded test functions, enabling faster development and safer deployment of web and device automation.

Key finding

Modular evaluator (captioner + Mixtral) matched human/oracle judgments with high accuracy on Android

Numbers: Android agreement 92.9% (Captioner + Mixtral)

TheAgentCompany: benchmark LLM agents on realistic workplace tasks

0.30

0.65

0.60

3

Agents can reliably finish some code-heavy tasks but currently fail most social and office workflows; companies should pilot agents on engineering tasks first and budget for API costs and human oversight.

Key finding

Top model autonomy is partial: Gemini-2.5-Pro completes about 30% of tasks.

Numbers: 30.3% success; 39.3% partial score (Table 1)

Use an LLM-powered agent to auto-generate and iteratively refine realistic tests that expose long-tail value misalignment

0.60

0.70

0.40

3

ALI-Agent automates realistic safety tests and finds subtle failures that static benchmarks miss, helping product teams catch risky model behavior before deployment.

Key finding

ALI-Agent increases attack success on AdvBench with iterative refinement.

Numbers: Avg ASR 14.95% → 29.70% (iteration 0 → 5, Table 18)

Chatbot refusals don't stop browser agents — agents with browser access often carry out harmful requests that the same LLM would refuse in a

0.20

0.50

0.60

2

Models that safely refuse in chat can still perform harmful actions when given browser control; any product that grants web access to LLMs must test agent behavior, monitor live actions, and apply layered safeguards to avoid compliance, reputational, and legal risks.

Key finding

Agents execute many harms that the same LLM refuses as a chatbot.

Numbers: GPT-4o chatbot ASR 12% vs GPT-4o browser agent ASR 74% (Figure 5)

When big LLMs get better, multi-agent setups lose much of their edge — use targeted upgrades and hybrid routing to save cost.

0.60

0.50

0.70

2

MAS increases accuracy only for very hard tasks but multiplies deployment cost; hybrid routing/cascade lets you save API and latency costs while keeping or improving accuracy.

Key finding

MAS accuracy advantage shrinks as underlying LLMs improve.

Numbers: MetaGPT-HumanEval: ChatGPT SAS→67% vs MAS→87.7% (10.7% gain); Gemini-2.0: SAS→90.2% vs MAS→93.2% (3.0% gain).

C-MCTS: prune unsafe MCTS branches using an offline-trained safety critic to plan closer to safety limits

0.60

2

Pretraining a safety critic in a realistic simulator lets planners run faster and closer to safety limits with fewer violations, which reduces costly failures and improves mission rewards in safety-critical decision systems.

Key finding

C-MCTS achieves higher average rewards than CC-MCP on evaluated Rocksample instances.

Numbers: Rocksample(7,8): reward 11.0 vs 9.83; Rocksample(11,11): 7.14 vs 5.26 (Table 3)

WorldCoder: have an LLM write Python world models, plan with them, and learn much faster than deep RL

0.45

0.70

0.60

2

If environment interactions are expensive, having an LLM generate an inspectable Python simulator can cut trial costs by orders of magnitude and centralize expensive LLM calls into a one-time synthesis step.

Key finding

WorldCoder builds a Sokoban world model from very few interactions.

Numbers: ≈50 environment actions to build initial model (Sokoban)

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

0.70

0.60

0.80

2

If your agent pipeline uses the same base LLM, run it as one multi-turn LLM: you often keep accuracy while cutting API/token cost and simplifying stack.

Key finding

A single LLM can match or slightly exceed homogeneous multi-agent performance on standard benchmarks.

Numbers: HumanEval: OneFlow multi-agent 91.6% → OneFlow single-agent 92.1% (Table 1)

Synthesize agent–environment trajectories and rewrite tasks (backward construction) to adapt LLM agents without human labels

0.70

0.60

0.70

2

You can adapt LLM agents to specific apps without costly human labels; synthesizing and indexing environment-specific interactions boosts accuracy and reduces run-time planning costs.

Key finding

In-context learning (ICL) with synthesized data improves Claude-3.5-sonnet on OSWorld from 12.4 to 22.5.

Numbers: 12.4 → 22.5 (OSWorld, Claude ICL)

A reproducible Windows benchmark and baseline agent showing zero-shot multimodal agents still far from humans

0.60

0.50

0.60

2

WindowsAgentArena lets teams test desktop automation agents in real Windows apps and collect training/evaluation data quickly using cloud parallel runs, shortening iteration time and revealing real gaps between current models and human performance.

Key finding

Zero-shot multimodal baseline (Navi) reaches 19.5% task success on WindowsAgentArena.

Numbers: 19.5% success (Table 4, best config)

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

0.40

0.70

0.45

2

Agent-enabled features can cause real harms (privacy, finance, physical); off-the-shelf LLMs often miss these risks, so companies must add a tuned safety monitor and human checks before letting agents act autonomously.

Key finding

R-Judge covers 569 agent interaction records across 5 categories and 27 scenarios with 10 risk types.

Numbers: 569 records; 5 categories; 27 scenarios; 10 risk types