Train LLM-based agents end-to-end with RL and let them ask humans for help

May 23, 20248 min

Overview

Decision SnapshotReady For Pilot

AGILE is a pragmatic system design: the LLM acts as a token-policy, an executor runs functions, and PPO tunes advice/tool decisions. Experiments on three datasets and ablations support the claims.

Citations4

Evidence Strength0.90

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Peiyuan Feng, Yichen He, Guanhua Huang, Yuan Lin, Hanchong Zhang, Yuchen Zhang, Hang Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AGILE lets production agents learn when to call humans and when to act, improving accuracy while controlling human cost. That makes it practical for customer support, medical QA, and recommendation systems where mistakes are costly.

Who Should Care

Summary TLDR

AGILE is a practical framework that treats a large language model (LLM) as a policy in a token-level reinforcement learning (RL) loop. The system combines an LLM, a retrievable memory store, tool calls (search/SQL), an executor that runs function tokens, and a new proactive "seek advice" action that asks humans when needed. Trained with imitation learning followed by PPO, AGILE improves accuracy and reduces costly human queries across three QA tasks. The authors also release ProductQA, an 88k e-commerce QA dataset designed to test memory, tool use, reflection and human interaction.

Problem Statement

Building useful LLM agents needs many components (memory, tools, reflection, human help). Existing benchmarks and training methods do not jointly optimize all parts end-to-end. The paper asks: can we train an agent that coordinates LLM reasoning, tool calls, memory, and when to ask humans using RL?

Main Contribution

An end-to-end RL framework (AGILE) where the LLM is the policy and function names are actions executed by an external executor.

A new proactive human-advice action that the agent can learn to invoke to trade off accuracy vs. human cost.

Key Findings

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

NumbersTotal score (short answers) 0.784 vs agile-gpt4-prompt 0.718; +9.2% rel. (Table 4)

Practical UseUsing RL to tune the whole agent (policy + module calls) can deliver practical accuracy gains over prompting GPT-4 in agent form; try two-stage SFT→PPO for agent policies.

Evidence RefTable 4

On MedMCQA, AGILE raised accuracy from the base model 53.4% to 85.2% by seeking advice on some cases.

NumbersMeerkat-7b-prompt 0.534 → agile-mek7b-ppo 0.852; advice rate 31.6% (Table 6)

Practical UseAllowing selective human advice plus RL dramatically boosts accuracy on high-stakes domains; set an advice cost to control human workload.

Evidence RefTable 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ProductQA — agile-vic13b-ppo total score (short answers)0.784agile-gpt4-prompt 0.718+9.2% relativeProductQA (avg over 6 test groups)Table 4 shows total score 0.784 for agile-vic13b-ppo vs 0.718 for agile-gpt4-promptTable 4
Accuracy0.852Meerkat-7b-prompt 0.534; gpt4-Medprompt 0.791+31.8% vs base; +6.1% vs gpt4-MedpromptMedMCQA devTable 6 reports accuracy 0.852 (agile-mek7b-ppo) and base 0.534Table 6

What To Try In 7 Days

Prototype an executor that exposes a few tool calls (search/SQL) and a small memory store and wrap your LLM to emit function tokens.

Train quickly with imitation data (SFT) and run a short PPO fine-tune to learn an advice-cost trade-off.

Use a small advice cost and measure advice rate vs. accuracy to set human-in-the-loop budgets.

Agent Features

Memory
Embedding-based retrieval (all-MiniLM-L6-v2)UpdateMemory via reflectionMemory used across sessions (long-term)
Planning
LLM generates multi-token plans and function callsSession-level planning via context and memory
Tool Use
SQL product searchArticle/web searchExecutor-triggered tool calls
Frameworks
SFT
Is Agentic

Yes

Architectures
LLM-as-policy (token-level MDP)Executor pattern (function tokens → effects)Memory + tools + human-in-the-loop
Collaboration
Proactive human advice-seekingReflection to distill human answers into memory

Optimization Features

Token Efficiency
Executor can clear or trim context to limit LLM context size
Infra Optimization
Training reported on NVIDIA H800 (8 GPUs for experiments)
Model Optimization
LoRA
System Optimization
Session partitioning to keep training sequences manageable
Training Optimization
SFTSession-level RL algorithm to handle long trajectories

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use 7B and 13B models only; results may differ on much larger models (authors note).

ProductQA training uses a subset (20 groups) and tests on 6 held-out groups; generalization beyond these categories is untested.

When Not To Use

If you cannot afford human-in-the-loop costs or reliable experts for advice.

For extremely latency-sensitive services where executor/tool calls or human queries are too slow.

Failure Modes

Over-reliance on human advice if advice cost is set too low.

Hallucinations when retrieval or tool outputs are missing or incorrect.

Core Entities

Models

Vicuna-13bMeerkat-7bGPT-4GPT-3.5

Metrics

AccuracyAdvice RateTotal Score (reward)Exact Match

Datasets

ProductQAMedMCQAHotPotQA

Benchmarks

ProductQA

Context Entities

Models

gpt4-MedPrompt (SOTA comparator)