Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

March 19, 20247 min

Overview

Decision SnapshotNeeds Validation

The approach is low-risk and practical: it's mostly data redesign and extra negative examples, with experiments across 7B–70B models and multiple benchmarks backing the claims.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can get safer, more capable agent behavior from open LLMs by changing training data rather than buying closed APIs, reducing cost and dependency while lowering hallucination risks.

Who Should Care

Summary TLDR

Agent-FLAN is a practical fine-tuning recipe for turning open LLMs (Llama2) into stronger agents. Key moves: (1) convert formatted agent data into multi-turn chat, (2) split training data by core capabilities (reasoning, retrieval, understanding, instruction following) and rebalance, and (3) add curated negative examples to teach when not to call tools. On evaluated benchmarks Agent-FLAN improves an off-the-shelf Llama2-7B by ~3.5% overall versus prior tuning methods, reduces hallucination on a 1,845-sample Agent-H benchmark, and scales predictably with model size and data fraction.

Problem Statement

Open-source LLMs are strong at language but lag behind API models when used as agents. Existing agent fine-tuning mixes format rules and reasoning, ignores per-capability learning speeds, and under-addresses hallucinations. The result: overfitting to formats, uneven capability gains, and unsafe or meaningless tool calls.

Main Contribution

Three diagnostic observations: (1) agent corpora mix format-following and reasoning which misaligns with pretraining; (2) different agent capabilities learn at different speeds; (3) hallucinations are common and under-evaluated.

Agent-FLAN method: align agent data to chat format, decompose data by capability and rebalance, and add diverse negative samples to reduce hallucination.

Key Findings

Aligning formatted agent data into multi-turn chat improves task scores.

NumbersT-Eval +3.1%, HotpotQA +2.5% (Table 2)

Practical UseConvert ReAct/JSON-style training examples into natural dialogue before fine-tuning. Expect a few-percent gain on agent QA and tool tasks without extra model changes.

Evidence RefTable 2

Agent-FLAN outperforms prior agent-tuning work on evaluated benchmarks.

NumbersOverall score 41.7 vs 38.2 (AgentTuning*), +3.5 pts (Table 1)

Practical UseUse Agent-FLAN's data design (chat alignment, capability balancing, negative samples) to boost a Llama2-7B baseline by ~3–4% on mixed agent evaluations.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall (mixed agent eval)41.7 (Agent-FLAN, Llama2-7B)38.2 (AgentTuning*, same data)+3.5Held-in + Held-out mixture (Table 1)Table 1 reports normalized scores vs GPT-4Table 1
T-Eval (tool-use evaluation)66.0 (Agent-FLAN, Llama2-7B)61.8 (AgentTuning*)+4.2T-Eval (held-out)Table 1 and Table 3Table 1

What To Try In 7 Days

Convert a small fraction (10–30%) of your agent-format training examples into multi-turn chat and fine-tune a model to validate gains.

Split your agent data by capability (reasoning, retrieval, understanding, instruction) and upweight reasoning/understanding samples.

Create a small set (hundreds) of negative examples where no tool should be called and use them as supervised negatives to reduce hallucinated tool calls.

Agent Features

Planning
ReAct-style thought-action planning (supported in training and evaluation)
Tool Use
function callingtool invocation control (learn when/when not to call)
Frameworks
ReActToolBenchAgentTuning (comparison)
Is Agentic

Yes

Architectures
decoder-only transformer (Llama2 variants)

Optimization Features

Training Optimization
data balancing by capabilitychat-format alignment to avoid format overfitting

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

AgentInstruct (Zeng et al.)ToolBench (Qin et al.)glaive-function-calling-v2 (GlaiveAI)ALFWorld, WebShop, Mind2Web, Knowledge Graph (cited sources)

Risks & Boundaries

Limitations

Training and validation cover a subset of possible agent scenarios; other interactive cases may behave differently (§7).

ToolBench use is partially filtered: authors used ~10% of ToolBench samples to keep quality, so results might change if full datasets are used (§7, Appendix B).

When Not To Use

If you need the absolute best API model results (GPT-4 still outperforms on many agent metrics), prefer APIs for top performance.

If you cannot curate or vet negative samples safely, negative-sample training may be risky or insufficient.

Failure Modes

Format overfitting if chat alignment is not applied carefully; model can still learn to prioritize format tokens over content.

Insufficient negative sample diversity can leave edge-case hallucinations unaddressed.

Core Entities

Models

Llama2-7BLlama2-13BLlama2-70BGPT-3.5GPT-4

Metrics

T-Eval scoreOverall agent evaluation (normalized)HReAct (format hallucinations)HGeneral (general-format hallucinations)HScore (Agent-H composite)

Datasets

AgentInstructToolBenchGlaive-function-calling-v2ALFWorldWebShopMind2WebKnowledge GraphOS/Database (from AgentInstruct subsets)Agent-H (constructed, 1,845 samples)

Benchmarks

T-EvalHotpotQASciWorldWebArenaAgent-HMMLUGSM8KHumanEval

Context Entities

Models

Vicuna (cited)AgentTuning (Zeng et al., cited)

Datasets

ShareGPT (mixed during training)