Train LLM-based agents end-to-end with RL and let them ask humans for help

Overview

Decision SnapshotReady For Pilot

AGILE is a pragmatic system design: the LLM acts as a token-policy, an executor runs functions, and PPO tunes advice/tool decisions. Experiments on three datasets and ablations support the claims.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AGILE lets production agents learn when to call humans and when to act, improving accuracy while controlling human cost. That makes it practical for customer support, medical QA, and recommendation systems where mistakes are costly.

Who Should Care

Product Manager ML Engineer CTO Data Scientist

Summary TLDR

AGILE is a practical framework that treats a large language model (LLM) as a policy in a token-level reinforcement learning (RL) loop. The system combines an LLM, a retrievable memory store, tool calls (search/SQL), an executor that runs function tokens, and a new proactive "seek advice" action that asks humans when needed. Trained with imitation learning followed by PPO, AGILE improves accuracy and reduces costly human queries across three QA tasks. The authors also release ProductQA, an 88k e-commerce QA dataset designed to test memory, tool use, reflection and human interaction.

Problem Statement

Building useful LLM agents needs many components (memory, tools, reflection, human help). Existing benchmarks and training methods do not jointly optimize all parts end-to-end. The paper asks: can we train an agent that coordinates LLM reasoning, tool calls, memory, and when to ask humans using RL?

Main Contribution

An end-to-end RL framework (AGILE) where the LLM is the policy and function names are actions executed by an external executor.

A new proactive human-advice action that the agent can learn to invoke to trade off accuracy vs. human cost.

Key Findings

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

NumbersTotal score (short answers) 0.784 vs agile-gpt4-prompt 0.718; +9.2% rel. (Table 4)

Practical UseUsing RL to tune the whole agent (policy + module calls) can deliver practical accuracy gains over prompting GPT-4 in agent form; try two-stage SFT→PPO for agent policies.

Evidence RefTable 4

On MedMCQA, AGILE raised accuracy from the base model 53.4% to 85.2% by seeking advice on some cases.

NumbersMeerkat-7b-prompt 0.534 → agile-mek7b-ppo 0.852; advice rate 31.6% (Table 6)

Practical UseAllowing selective human advice plus RL dramatically boosts accuracy on high-stakes domains; set an advice cost to control human workload.

Evidence RefTable 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ProductQA — agile-vic13b-ppo total score (short answers)	0.784	agile-gpt4-prompt 0.718	+9.2% relative	ProductQA (avg over 6 test groups)	Table 4 shows total score 0.784 for agile-vic13b-ppo vs 0.718 for agile-gpt4-prompt	Table 4
Accuracy	0.852	Meerkat-7b-prompt 0.534; gpt4-Medprompt 0.791	+31.8% vs base; +6.1% vs gpt4-Medprompt	MedMCQA dev	Table 6 reports accuracy 0.852 (agile-mek7b-ppo) and base 0.534	Table 6

What To Try In 7 Days

Prototype an executor that exposes a few tool calls (search/SQL) and a small memory store and wrap your LLM to emit function tokens.

Train quickly with imitation data (SFT) and run a short PPO fine-tune to learn an advice-cost trade-off.

Use a small advice cost and measure advice rate vs. accuracy to set human-in-the-loop budgets.

Agent Features

Memory

Embedding-based retrieval (all-MiniLM-L6-v2)UpdateMemory via reflectionMemory used across sessions (long-term)

Planning

LLM generates multi-token plans and function callsSession-level planning via context and memory

Tool Use

SQL product searchArticle/web searchExecutor-triggered tool calls

Frameworks

SFT

Is Agentic

Yes

Architectures

LLM-as-policy (token-level MDP)Executor pattern (function tokens → effects)Memory + tools + human-in-the-loop

Collaboration

Proactive human advice-seekingReflection to distill human answers into memory

Optimization Features

Token Efficiency

Executor can clear or trim context to limit LLM context size

Infra Optimization

Training reported on NVIDIA H800 (8 GPUs for experiments)

Model Optimization

LoRA

System Optimization

Session partitioning to keep training sequences manageable

Training Optimization

SFTSession-level RL algorithm to handle long trajectories

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/bytarnish/AGILE

Data URLs

https://github.com/bytarnish/AGILE (ProductQA code and release noted)

Risks & Boundaries

Limitations

Experiments use 7B and 13B models only; results may differ on much larger models (authors note).

ProductQA training uses a subset (20 groups) and tests on 6 held-out groups; generalization beyond these categories is untested.

When Not To Use

If you cannot afford human-in-the-loop costs or reliable experts for advice.

For extremely latency-sensitive services where executor/tool calls or human queries are too slow.

Failure Modes

Over-reliance on human advice if advice cost is set too low.

Hallucinations when retrieval or tool outputs are missing or incorrect.

Core Entities

Models

Vicuna-13bMeerkat-7bGPT-4GPT-3.5

Metrics

AccuracyAdvice RateTotal Score (reward)Exact Match

Datasets

ProductQAMedMCQAHotPotQA

Benchmarks

ProductQA

Context Entities

Models

gpt4-MedPrompt (SOTA comparator)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

On MedMCQA, AGILE raised accuracy from the base model 53.4% to 85.2% by seeking advice on some cases.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Metamorphic tests show many LLM agents give different answers to the same problem when phrased differently

Key finding

R-Judge: a human-curated benchmark (569 agent logs) that tests whether LLMs spot safety risks in agent interactions

Key finding

A single LLM can role-play homogeneous multi-agent workflows and cut inference cost via KV-cache reuse

Key finding

DeceptGuard: detect agent deception by reading CoT text and activation probes

Key finding