Agent Pipelines Papers — Parsed & Scored for Practitioners

OpenHands: an open, sandboxed platform that lets LLM-based agents write, run, and browse code like software developers

0.65

0.60

0.70

7

OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.

Key finding

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

Numbers: HumanEvalFix: 79.3% (CodeAct v1.5, gpt-4o); SWE-Bench Lite: 22–26% (CodeAct v1.8)

Combine MCTS + AI self-critique + offline DPO to train web agents that learn from search traces

0.60

0.70

5

Train web agents from their own search traces to get large, fast gains in task success without risky online RL; pairing a trained policy with online search gives near‑perfect results for structured web tasks.

Key finding

Agent Q plus test-time MCTS reaches human-level or better performance on WebShop.

Numbers: WebShop success: base 28.6% → Agent Q+MCTS 50.5% (human 50.0%)

Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

0.60

0.70

0.60

5

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Key finding

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers: 23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

TDAG: dynamically split complex tasks and auto-generate subagents to improve multi-step agent performance

0.60

0.50

5

TDAG reduces failure cascades and improves partial progress tracking, so agent-driven multi-step workflows are more reliable and auditable.

Key finding

TDAG achieves higher average score on ItineraryBench than baselines

Numbers: TDAG avg 49.08 vs ReAct 43.02 (Table 2)

Multi-agent LLaMA 3 workflow matches expert prompts for detecting cognitive concerns in clinical notes

0.60

0.70

4

Automated agent pipelines can cut human prompt-tuning time and reach near-expert accuracy on clinical-note screening, lowering labor cost and speeding deployment in health systems.

Key finding

Agentic prompt AP2 reached F1-score 0.91 on the prompt-refinement dataset.

Numbers: F1 = 0.91 (Table 3)

Use multi-agent pipelines and OVON JSON handoffs to lower LLM hallucinations

0.50

0.60

0.40

3

Structured multi-agent review with short metadata handoffs reduces misleading AI outputs and makes speculative content overt—useful for customer-facing assistants, regulated domains, and brand safety.

Key finding

Multi-agent review reduced mean THS across 310 prompts from -0.004919 (front-end) to -0.139597 (third reviewer).

Numbers: THS mean: -0.004919 -> -0.139597

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

0.60

0.70

0.60

3

Automates converting published code into reusable tools, cutting expert setup time and enabling rapid integration of domain-specific methods into production agents at a low per-tool cost.

Key finding

TOOLMAKER implemented 80% of benchmark tools (12 of 15)

Numbers: 12/15 tools correct

Sum2Act: a router + state-manager pipeline that makes LLMs call many real APIs reliably

0.60

0.50

3

If your product needs reliable multi-step interactions with many third-party APIs (search, image tools, web services), a small router + summarizing state manager can boost success and reduce repeated failures with little engineering overhead.

Key finding

Sum2Act raises average Pass Rate to 70.0% on ToolBench using ChatGPT

Numbers: Pass Rate avg: Sum2Act 70.0% vs DFSDT 67.0% vs ReAct 41.1%

PRISM: jointly optimize Exploration, Information, and Aggregation for cheaper, more reliable multi-agent LLM reasoning

0.60

0.70

0.80

2

PRISM shows you can get higher accuracy and much better token-cost efficiency by coordinating small models with execution grounding and evidence-based synthesis, rather than only buying bigger models.

Key finding

PRISM outperforms all tested multi-agent baselines on four benchmarks.

Numbers: GSM8K +1.3pp; AIME +6.6pp; MBPP +6.6pp; BFCL-SP +3.5pp (vs runner-up)

Use an agent + MCTS to explore a knowledge base, auto-label data, and cut labeled-data needs for KBQA

0.70

0.75

0.50

1

If you must run KBQA with limited labeled data, KBQA-o1 lets you use open LLMs and automated MCTS exploration to generate high-quality training pairs and sharply improve accuracy on complex queries.

Key finding

KBQA-o1 dramatically improves low-resource GrailQA accuracy versus prior methods.

Numbers: GrailQA F1 78.5% (Llama-3.1-8B) vs 48.5% prior low-resource best (GPT-3.5-turbo)

Teach LLM agents by learning from their failed runs: collect failure trajectories, make failure-vs-success pairs, and fine-tune via DPO.

0.60

0.50

1

If you run LLM agents in interactive tasks, adding a cheap offline loop that learns from the agent's failure cases can boost task reward and robustness to unseen variations without heavy online RL.

Key finding

ETO improves average reward over SFT across agent benchmarks.

Numbers: WebShop: 63.1 → 67.4 avg reward (Table 2)

AI agents boost capabilities but multiply inference cost, latency variance, and datacenter power needs.

0.30

0.60

0.90

1

AI agents can raise per-query compute and energy by 10s–100s×, driving much higher cloud costs and datacenter power needs; without cost-aware designs, agent features can become economically and environmentally unsustainable.

Key finding

Agentic systems issue many more LLM calls per request than single-turn models.

Numbers: Agents average 9.2× more LLM calls; LATS averages 71 LLM calls/request.

Flow: make multi-agent LLM workflows modular, run subtasks in parallel, and update the plan while running

0.60

0.70

0.50

1

Flow raises automation reliability by making plans modular and fixable at runtime; that means fewer complete failures and higher deliverable quality, though updates add compute and API cost.

Key finding

Flow achieves much higher overall task success across three coding tasks compared to baselines.

Numbers: Flow avg success rate 93% vs AutoGen 66.7 / MetaGPT 71 / CAMEL 48.7 (Tables 1–3)

Use Shapley values to explain and pick the best component mix for AI agent workflows

0.60

0.65

0.60

1

ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.

Key finding

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

Numbers: E-commerce optimal accuracy 43.31%; ATP (theorem proving) optimal 86.79%

LLM agents that iteratively teach themselves to write ML library code for new hardware languages

0.60

0.70

1

This system can speed up early ML library development for new hardware by automating functional implementations from few examples, lowering time-to-ship and expert-hours spent.

Key finding

Adaptive agentic system solved most benchmark tasks and greatly increased pass rates.

Numbers: Pass@n up to 0.96 (96%) on the benchmark

Use small local LLMs to separate true SDG contributions from incidental keyword mentions

0.40

0.50

0.60

1

Universities and research managers can avoid inflated SDG counts from keyword hits and make funding, ranking, and reporting decisions based on substantively relevant work.

Key finding

Small local LLMs can distinguish substantive SDG contributions from superficial mentions in abstracts.

LLM4AD — a Python platform that lets LLMs be used inside search loops to design and evaluate algorithms

0.60

0.50

1

LLM4AD reduces engineering friction for using LLMs to generate and test algorithms by packaging search loops, safe execution, and task templates into a single Python toolkit and GUI.

Key finding

Pairing LLMs with search methods (EoH, FunSearch, (1+1)-EPS) outperforms random sampling on most tasks.

Numbers: Benchmarks on 9 tasks, 3 independent runs; convergence plots in Fig.3

Survey: how machine learning, LLMs, and agents are reshaping operating systems and the OS stack

0.45

0.55

0.60

1

AI techniques can reduce tail latency, improve throughput, lower storage errors, and cut datacenter costs, but require guardrails and staged deployment to avoid regressions and privacy risks.

Key finding

Lightweight ML in the kernel can sharply improve I/O predictability and throughput.

Numbers: LinnOS: up to 40% lower I/O latency; up to 3× throughput under contention

Turn SHAP attributions into a reusable knowledge base and train an LLM to reason with it for more accurate, auditable sarcopenia diagnosis.

0.40

0.65

0.50

0

CANDLE shows a practical path to combine stable, auditable ML explanations with LLM reasoning. That improves accuracy, produces human-readable rationales, and builds a reusable knowledge asset (ACPB + DKB) that can reduce repeated expensive explainability computations and speed downstream inference.

Key finding

CANDLE increased overall accuracy compared to the XGBoost baseline.

Numbers: Accuracy: CANDLE (LLM with DKB) 79.3% vs XGBoost 73.3% (+6.0 percentage points) (Table 1).

Auto-generate simulator-validated PFDs and PIDs to move AI-discovered chemicals to production

0.60

0.70

0.60

0

Auto-generating simulator-validated PFDs/PIDs moves molecule discoveries toward manufacturability earlier, cutting manual engineering time, reducing late-stage rework, and accelerating commercialization decisions.

Key finding

A Graph RAG + feedback setup substantially improves zero-shot helpfulness and correctness.

Numbers: Zero-shot reward model score 'approaching 3.0' (0–4 scale) with GraphRAG+feedback on 1.5K benchmark.

Design the lakehouse for agents first: solve concurrent runs with branching + isolated functions, and governance follows.

0.60

0.50

0

If you let agents mutate your lakehouse without transactional, runtime isolation, they can corrupt production data or leak secrets. Building a small, enforceable run API and sandboxed functions reduces risk and makes governance feasible.

Key finding

Multi-node pipelines need atomic commits across tables, not per-table transactions.

Agentic AI falls short when key signals hide inside images

0.40

0.50

0.60

0

Automated agentic pipelines that ignore images or other non-tabular signals can miss high-value predictive cues. For insurance, property, and other fields where visuals matter, adding simple image-processing steps can substantially improve forecasts and pricing.

Key finding

Generic agentic AI that uses only tabular features achieves low predictive ranking performance.

Numbers: Normalized Gini = 0.3823 (Agentic AI; Random Forest on tabular only)

Use a permissioned blockchain to audit and gate multi-agent AI decisions in real time

0.60

0.50

0

Adds a tamper-proof policy gate and audit trail to autonomous AI decisions. That reduces risk and supports compliance in healthcare, smart-city, and enterprise automation while keeping response times within seconds.

Key finding

Average decision cycle time in the blockchain-governed pipeline

Numbers: Mean = 1.82 s (50 trials); 95% CI [1.78, 1.86] s

A two-stage agent pipeline that turns raw tables into vetted charts and a publication-ready narrative report.

0.50

0.40

0

Automates the end-to-end path from raw tables to a polished report, saving analyst time on repetitive chart creation, basic QA, and first-draft narrative. The system produces multiple scored insight candidates so teams can select defensible findings instead of relying on a single model output.

Key finding

The Insight Generator creates multiple alternatives and delivers a small set of vetted insights.

Numbers: Produces 5–7 candidate insights per chart; returns top 3 per chart after scoring.