Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

Overview

Decision SnapshotNeeds Validation

The approach is low-risk and practical: it's mostly data redesign and extra negative examples, with experiments across 7B–70B models and multiple benchmarks backing the claims.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get safer, more capable agent behavior from open LLMs by changing training data rather than buying closed APIs, reducing cost and dependency while lowering hallucination risks.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

Agent-FLAN is a practical fine-tuning recipe for turning open LLMs (Llama2) into stronger agents. Key moves: (1) convert formatted agent data into multi-turn chat, (2) split training data by core capabilities (reasoning, retrieval, understanding, instruction following) and rebalance, and (3) add curated negative examples to teach when not to call tools. On evaluated benchmarks Agent-FLAN improves an off-the-shelf Llama2-7B by ~3.5% overall versus prior tuning methods, reduces hallucination on a 1,845-sample Agent-H benchmark, and scales predictably with model size and data fraction.

Problem Statement

Open-source LLMs are strong at language but lag behind API models when used as agents. Existing agent fine-tuning mixes format rules and reasoning, ignores per-capability learning speeds, and under-addresses hallucinations. The result: overfitting to formats, uneven capability gains, and unsafe or meaningless tool calls.

Main Contribution

Three diagnostic observations: (1) agent corpora mix format-following and reasoning which misaligns with pretraining; (2) different agent capabilities learn at different speeds; (3) hallucinations are common and under-evaluated.

Agent-FLAN method: align agent data to chat format, decompose data by capability and rebalance, and add diverse negative samples to reduce hallucination.

Key Findings

Aligning formatted agent data into multi-turn chat improves task scores.

NumbersT-Eval +3.1%, HotpotQA +2.5% (Table 2)

Practical UseConvert ReAct/JSON-style training examples into natural dialogue before fine-tuning. Expect a few-percent gain on agent QA and tool tasks without extra model changes.

Evidence RefTable 2

Agent-FLAN outperforms prior agent-tuning work on evaluated benchmarks.

NumbersOverall score 41.7 vs 38.2 (AgentTuning*), +3.5 pts (Table 1)

Practical UseUse Agent-FLAN's data design (chat alignment, capability balancing, negative samples) to boost a Llama2-7B baseline by ~3–4% on mixed agent evaluations.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall (mixed agent eval)	41.7 (Agent-FLAN, Llama2-7B)	38.2 (AgentTuning*, same data)	+3.5	Held-in + Held-out mixture (Table 1)	Table 1 reports normalized scores vs GPT-4	Table 1
T-Eval (tool-use evaluation)	66.0 (Agent-FLAN, Llama2-7B)	61.8 (AgentTuning*)	+4.2	T-Eval (held-out)	Table 1 and Table 3	Table 1

What To Try In 7 Days

Convert a small fraction (10–30%) of your agent-format training examples into multi-turn chat and fine-tune a model to validate gains.

Split your agent data by capability (reasoning, retrieval, understanding, instruction) and upweight reasoning/understanding samples.

Create a small set (hundreds) of negative examples where no tool should be called and use them as supervised negatives to reduce hallucinated tool calls.

Agent Features

Planning

ReAct-style thought-action planning (supported in training and evaluation)

Tool Use

function callingtool invocation control (learn when/when not to call)

Frameworks

ReActToolBenchAgentTuning (comparison)

Is Agentic

Yes

Architectures

decoder-only transformer (Llama2 variants)

Optimization Features

Training Optimization

data balancing by capabilitychat-format alignment to avoid format overfitting

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/InternLM/Agent-FLAN

Data URLs

AgentInstruct (Zeng et al.)ToolBench (Qin et al.)glaive-function-calling-v2 (GlaiveAI)ALFWorld, WebShop, Mind2Web, Knowledge Graph (cited sources)

Risks & Boundaries

Limitations

Training and validation cover a subset of possible agent scenarios; other interactive cases may behave differently (§7).

ToolBench use is partially filtered: authors used ~10% of ToolBench samples to keep quality, so results might change if full datasets are used (§7, Appendix B).

When Not To Use

If you need the absolute best API model results (GPT-4 still outperforms on many agent metrics), prefer APIs for top performance.

If you cannot curate or vet negative samples safely, negative-sample training may be risky or insufficient.

Failure Modes

Format overfitting if chat alignment is not applied carefully; model can still learn to prioritize format tokens over content.

Insufficient negative sample diversity can leave edge-case hallucinations unaddressed.

Core Entities

Models

Llama2-7BLlama2-13BLlama2-70BGPT-3.5GPT-4

Metrics

T-Eval scoreOverall agent evaluation (normalized)HReAct (format hallucinations)HGeneral (general-format hallucinations)HScore (Agent-H composite)

Datasets

AgentInstructToolBenchGlaive-function-calling-v2ALFWorldWebShopMind2WebKnowledge GraphOS/Database (from AgentInstruct subsets)Agent-H (constructed, 1,845 samples)

Benchmarks

T-EvalHotpotQASciWorldWebArenaAgent-HMMLUGSM8KHumanEval

Context Entities

Models

Vicuna (cited)AgentTuning (Zeng et al., cited)

Datasets

ShareGPT (mixed during training)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Aligning formatted agent data into multi-turn chat improves task scores.

Agent-FLAN outperforms prior agent-tuning work on evaluated benchmarks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

Hamza: Turkish LLMs, adaptation vs from‑scratch, plus new Turkish benchmarks

Key finding

FinTral: a 7B multimodal financial LLM + FinSet dataset that rivals GPT-4 on many finance tasks

Key finding

EmplifAI: 4,125 Japanese medical two‑turn dialogues labeled with 28 fine-grained emotions

Key finding