Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
You can get safer, more capable agent behavior from open LLMs by changing training data rather than buying closed APIs, reducing cost and dependency while lowering hallucination risks.
Summary TLDR
Agent-FLAN is a practical fine-tuning recipe for turning open LLMs (Llama2) into stronger agents. Key moves: (1) convert formatted agent data into multi-turn chat, (2) split training data by core capabilities (reasoning, retrieval, understanding, instruction following) and rebalance, and (3) add curated negative examples to teach when not to call tools. On evaluated benchmarks Agent-FLAN improves an off-the-shelf Llama2-7B by ~3.5% overall versus prior tuning methods, reduces hallucination on a 1,845-sample Agent-H benchmark, and scales predictably with model size and data fraction.
Problem Statement
Open-source LLMs are strong at language but lag behind API models when used as agents. Existing agent fine-tuning mixes format rules and reasoning, ignores per-capability learning speeds, and under-addresses hallucinations. The result: overfitting to formats, uneven capability gains, and unsafe or meaningless tool calls.
Main Contribution
Three diagnostic observations: (1) agent corpora mix format-following and reasoning which misaligns with pretraining; (2) different agent capabilities learn at different speeds; (3) hallucinations are common and under-evaluated.
Agent-FLAN method: align agent data to chat format, decompose data by capability and rebalance, and add diverse negative samples to reduce hallucination.
Empirical results: Agent-FLAN raises Llama2-series performance on multiple agent benchmarks (overall +3.5% vs prior best) and reduces hallucination measured on a new Agent-H benchmark.
Key Findings
Aligning formatted agent data into multi-turn chat improves task scores.
Agent-FLAN outperforms prior agent-tuning work on evaluated benchmarks.
Negative-sample training reduces agent hallucinations on a focused benchmark.
Results
Overall (mixed agent eval)
T-Eval (tool-use evaluation)
Agent-H hallucination HScore
Training data size effect (HotpotQA)
Who Should Care
What To Try In 7 Days
Convert a small fraction (10–30%) of your agent-format training examples into multi-turn chat and fine-tune a model to validate gains.
Split your agent data by capability (reasoning, retrieval, understanding, instruction) and upweight reasoning/understanding samples.
Create a small set (hundreds) of negative examples where no tool should be called and use them as supervised negatives to reduce hallucinated tool calls.
Agent Features
Planning
- ReAct-style thought-action planning (supported in training and evaluation)
Tool Use
- function calling
- tool invocation control (learn when/when not to call)
Frameworks
- ReAct
- ToolBench
- AgentTuning (comparison)
Is Agentic
true
Architectures
- decoder-only transformer (Llama2 variants)
Optimization Features
Training Optimization
- data balancing by capability
- chat-format alignment to avoid format overfitting
Reproducibility
Data Urls
- AgentInstruct (Zeng et al.)
- ToolBench (Qin et al.)
- glaive-function-calling-v2 (GlaiveAI)
- ALFWorld, WebShop, Mind2Web, Knowledge Graph (cited sources)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training and validation cover a subset of possible agent scenarios; other interactive cases may behave differently (§7).
- ToolBench use is partially filtered: authors used ~10% of ToolBench samples to keep quality, so results might change if full datasets are used (§7, Appendix B).
- Reported gains are on the evaluated benchmarks; real-world integration and safety still require more testing.
When Not To Use
- If you need the absolute best API model results (GPT-4 still outperforms on many agent metrics), prefer APIs for top performance.
- If you cannot curate or vet negative samples safely, negative-sample training may be risky or insufficient.
Failure Modes
- Format overfitting if chat alignment is not applied carefully; model can still learn to prioritize format tokens over content.
- Insufficient negative sample diversity can leave edge-case hallucinations unaddressed.
- Scaling only data quantity without improving diversity yields diminishing returns.
Core Entities
Models
- Llama2-7B
- Llama2-13B
- Llama2-70B
- GPT-3.5
- GPT-4
Metrics
- T-Eval score
- Overall agent evaluation (normalized)
- HReAct (format hallucinations)
- HGeneral (general-format hallucinations)
- HScore (Agent-H composite)
Datasets
- AgentInstruct
- ToolBench
- Glaive-function-calling-v2
- ALFWorld
- WebShop
- Mind2Web
- Knowledge Graph
- OS/Database (from AgentInstruct subsets)
- Agent-H (constructed, 1,845 samples)
Benchmarks
- T-Eval
- HotpotQA
- SciWorld
- WebArena
- Agent-H
- MMLU
- GSM8K
- HumanEval
Context Entities
Models
- Vicuna (cited)
- AgentTuning (Zeng et al., cited)
Datasets
- ShareGPT (mixed during training)

