Workflow Agents Papers — Parsed & Scored for Practitioners

Survey of how large language models power modern video understanding (taxonomies, benchmarks, gaps)

0.70

0.60

0.70

6

Vid-LLMs let products auto-summarize, QA, and index video at human-like levels; adopting them can drastically cut manual review costs and unlock search/recommendation features across massive video catalogs.

Key finding

LLM-based video models now match or exceed many traditional systems on dense captioning benchmarks.

Numbers: ActivityNet CIDEr: Streaming GIT 41.2 (Table IV)

Use an LLM to write a structured audio script, compile it to code, and run specialist audio models to generate narrated, mixed audio scenes.

0.60

0.70

0.50

6

WavJourney turns natural language briefs into finished mixed audio by chaining existing specialist models, reducing the need to build large unified audio models and enabling faster prototyping of audio content.

Key finding

WavJourney beats AudioGen and AudioLDM in human subjective scores on AudioCaps.

Numbers: OVL 3.75 vs AudioGen 3.56; REL 3.74 vs 3.52

Survey: How LLMs are being used across the full scientific research cycle

0.60

0.70

0.75

5

LLMs can speed idea generation, automate parts of experiment workflows, and draft or pre-check manuscripts—reducing time-to-insight but requiring verification steps to avoid costly mistakes.

Key finding

LLMs are being applied at four reproducible stages of research: hypothesis discovery, experiment planning/implementation, writing, and peer review.

AutoFlow: automatically generate readable natural‑language workflows so LLM agents solve complex tasks with less human work

0.50

0.60

4

AutoFlow reduces manual workflow design time by automatically producing readable, executable agent workflows and can raise task performance on image/text benchmarks, lowering operational cost for multi‑step agent tasks.

Key finding

AutoFlow improves average OpenAGI score compared to manual CoRE when Mixtral is the interpreter and GPT‑4 is the generator.

Numbers: avg 0.3597 vs 0.2483 (Δ +0.1114, +44.9%)

Orchestrated Distributed Intelligence (ODI): orchestrate multiple AI agents with humans to turn systems of record into systems of action

0.40

0.65

0.70

3

Orchestrated agentic AI turns data repositories into real-time decision systems, boosting productivity, reducing manual work, and enabling strategic agility when paired with governance and change management.

Key finding

Many US economic processes remain heavily manual, creating room for AI-driven automation.

Numbers: Nearly 50% of US GDP involves processes with up to 90% manual labor

MedAgentBench: a realistic FHIR-based EHR playground and 300-task benchmark for medical LLM agents

0.30

0.60

3

MedAgentBench provides a practical testbed to measure LLM agents on real EHR tasks so teams can benchmark readiness before risky EHR integration.

Key finding

Top-performing model achieved non-perfect but substantial task success.

Numbers: Claude 3.5 Sonnet v2 overall SR = 69.67%

AIDE: an LLM-driven agent that searches the space of code via tree search to automate ML engineering and beat many human baselines

0.60

0.50

3

AIDE can automate the repetitive trial-and-error of ML engineering, producing competitive models faster and often cheaper than manual work or traditional AutoML when you have LLM API access.

Key finding

On 16 tabular Kaggle tasks (Weco-Kaggle Lite), AIDE outperformed about half of human competitors on average.

Numbers: Exceeds % of humans = 51.38%; Above Median = 50.0%

UrbanKGent: an LLM agent that builds city-scale knowledge graphs cheaper and more accurately using geospatial tools

0.70

0.65

0.80

3

UrbanKGent lets teams build large, practical city knowledge graphs with small open models, cutting inference costs roughly 20× and lowering data needs, so you can deploy KG-driven city apps faster and cheaper.

Key finding

Fine-tuned UrbanKGent-13B outperforms GPT-4 on UrbanKGC accuracy on evaluated datasets.

Numbers: NYC: +~15% (RTE) and +~14% (KGC) accuracy vs GPT-4 on evaluated splits

A 30-task benchmark that tests agents on end-to-end ML development workflows

0.40

0.50

0.60

2

Agents can automate well-scoped ML tasks (data prep, basic debugging) today but fail at open-ended model improvement; firms should use agents to speed routine work and keep human oversight for strategic model changes.

Key finding

Openhands (Claude Sonnet) achieved the highest overall success rate.

Numbers: 50% (15/30 tasks)

QualityFlow: use an LLM 'imagined execution' checker to keep correct code and reach SOTA on code benchmarks

0.70

0.50

2

Adding a reliable LLM-based checker can raise correct-code yield substantially and reduce wasted debugging cycles; this reduces human review time and increases throughput for code-synthesis features.

Key finding

QualityFlow reaches 94.2% pass@1 on MBPP with Sonnet LLM, a +4.8% absolute improvement over prior reported SOTA.

Numbers: MBPP pass@1 = 94.2% (QualityFlow Sonnet); prior SOTA 89.4% (Table 2)

Agentic copilot that converts natural language into P&ID DEXPI XML and Visio drawings

0.40

0.60

2

Automating P&ID creation cuts manual drafting time and improves auditability by producing interoperable DEXPI XML and editable Visio drafts.

Key finding

ACPID achieves much higher soundness than single-pass GPT-4-Turbo.

Numbers: ACPID 96.96% vs Zero-shot 58.33% and Few-shot 65.90%

CoSQA+: a large multi-choice code-search dataset built by test-driven agent annotations (412k pairs, agent accuracy 93.9%)

0.70

0.65

0.75

2

CoSQA+ reduces expensive human labeling by using test-driven agents to create a large, functionally verified multi-choice code-search dataset that improves retrieval performance and cuts annotation cost to roughly $0.00062 per sample.

Key finding

CoSQA+ provides 412,080 labeled query–code pairs.

Numbers: 412,080 pairs; 132,952 unique codes

Agentic flows create 25M synthetic instruction pairs to teach skills and boost a 7B model across many benchmarks

0.60

0.70

0.60

2

Agentic flows automate creation of large, diverse instruction data from raw web/code sources, enabling faster model skill updates without manual prompt engineering or heavy labeling.

Key finding

AgentInstruct produced roughly 25.8 million instruction–response pairs used for post-training.

Numbers: ≈25.8M paired instructions (22M agentic + 3.8M external)

PlotEdit: five LLM agents edit chart images from plain English, improving fidelity and accessibility

0.70

0.60

0.45

1

PlotEdit turns static chart images in PDFs into editable, high-fidelity charts using natural language, speeding up content updates and improving accessibility for visually impaired users.

Key finding

PlotEdit produces more faithful edited charts than prior methods on ChartCraft.

Numbers: Overall SSIM 89.0 vs ChartReformer 82.4 (Table 1)

Practical blueprint for making enterprise APIs 'agent-ready' for autonomous AI agents

0.50

0.60

1

If you plan to let AI agents use your APIs, you must redesign endpoints, headers, and governance now to avoid outages, security gaps, and surprise costs.

Key finding

Traditional REST/GraphQL/gRPC APIs are poorly matched to autonomous, iterative agent behavior.

Harmonia: an LLM-driven agent that interactively builds reproducible data harmonization pipelines

0.40

0.60

0.50

1

Agentic harmonization can speed up combining heterogeneous datasets and produce reusable, publishable transformation scripts that improve reproducibility and reduce manual engineering time.

Key finding

Harmonia produced perfect schema-matching on the evaluated use case.

Numbers: Schema accuracy Harmonia=1.00 vs Baseline=0.88

How LLM-based coding agents must earn developer trust to be useful

0.50

0.60

1

AI coding agents can cut developer time but only if they earn developer trust through verifiable outputs, provenance, and integrated review processes.

Key finding

Developer trust, not raw generation skill, is the main barrier to widespread adoption of AI software engineers.

AI Scientists can generate ideas but routinely fail to implement and verify them

1.00

0.60

0.50

1

AI-generated research outputs are often non-executable or non-reproducible; businesses using "AI Scientists" must invest in execution pipelines, human verification, and testing to avoid wasted spend and flawed products.

Key finding

Top LLMs fail basic experiment execution on real-paper replication tasks.

Numbers: PaperBench execution 1.8% (Claude 3.5 Sonnet)

Flow: make multi-agent LLM workflows modular, run subtasks in parallel, and update the plan while running

0.60

0.70

0.50

1

Flow raises automation reliability by making plans modular and fixable at runtime; that means fewer complete failures and higher deliverable quality, though updates add compute and API cost.

Key finding

Flow achieves much higher overall task success across three coding tasks compared to baselines.

Numbers: Flow avg success rate 93% vs AutoGen 66.7 / MetaGPT 71 / CAMEL 48.7 (Tables 1–3)

Use Shapley values to explain and pick the best component mix for AI agent workflows

0.60

0.65

0.60

1

ShapleyFlow helps you decide which component (planning, reasoning, action, reflection) to upgrade for a specific workflow, so you spend compute and engineering budget where it yields the largest accuracy or reward gains.

Key finding

ShapleyFlow discovers task-specific optimal workflows that outperform single-LLM baselines.

Numbers: E-commerce optimal accuracy 43.31%; ATP (theorem proving) optimal 86.79%

CRAG‑MoW: a multi-agent, self‑corrective RAG system that benchmarks open LLMs on chemical search

0.60

0.70

0.60

1

You can build a domain‑focused, multi‑agent retrieval+LLM pipeline using open models and get judged output quality close to closed SOTA (GPT‑4o) while increasing preference rate; this reduces vendor lock‑in and enables on‑prem, interpretable retrieval and model selection.

Key finding

CRAG‑MoW workflows reach LLM‑Judge scores close to GPT‑4o on evaluated tasks.

Numbers: CRAG‑MoWs 7.12 vs GPT‑4o 7.59 (LLM‑Judge average, 1–10)

Teach agents reusable web workflows from past traces to boost web-navigation success

0.60

0.65

0.45

1

Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.

Key finding

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

Numbers: AWM 35.5 SR vs BrowserGym 23.5 SR; +12.0 abs (+51.1% rel)

DSBench: a realistic benchmark testing data‑science agents on ModelOff and Kaggle tasks

0.30

0.40

0.50

1

Realistic data science tasks expose that current agents often fail or produce weak models; businesses should treat agent outputs as assistive drafts and invest in execution environments and verification.

Key finding

Top agent solves only about one third of data‑analysis questions.

Numbers: Task-level accuracy 34.12% (AutoGen + GPT-4o)

Combine multiple OCR engines + two LLMs and pick the best JSON by majority voting to boost invoice OCR accuracy and speed.

0.50

0.60

1

Combining multiple OCR engines with LLM-based JSON conversion and majority voting can cut extraction errors and improve throughput for invoice automation, reducing manual fixes and speeding up batch processing.

Key finding

LMV-RPA achieved higher extraction accuracy than the baseline.

Numbers: 99% vs 94% (reported on 100 invoices)