Tune open LLMs into safer, better tool-using agents by aligning data to chat, decomposing capabilities, and adding negative samples

March 19, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao

Links

Abstract / PDF

Why It Matters For Business

You can get safer, more capable agent behavior from open LLMs by changing training data rather than buying closed APIs, reducing cost and dependency while lowering hallucination risks.

Summary TLDR

Agent-FLAN is a practical fine-tuning recipe for turning open LLMs (Llama2) into stronger agents. Key moves: (1) convert formatted agent data into multi-turn chat, (2) split training data by core capabilities (reasoning, retrieval, understanding, instruction following) and rebalance, and (3) add curated negative examples to teach when not to call tools. On evaluated benchmarks Agent-FLAN improves an off-the-shelf Llama2-7B by ~3.5% overall versus prior tuning methods, reduces hallucination on a 1,845-sample Agent-H benchmark, and scales predictably with model size and data fraction.

Problem Statement

Open-source LLMs are strong at language but lag behind API models when used as agents. Existing agent fine-tuning mixes format rules and reasoning, ignores per-capability learning speeds, and under-addresses hallucinations. The result: overfitting to formats, uneven capability gains, and unsafe or meaningless tool calls.

Main Contribution

Three diagnostic observations: (1) agent corpora mix format-following and reasoning which misaligns with pretraining; (2) different agent capabilities learn at different speeds; (3) hallucinations are common and under-evaluated.

Agent-FLAN method: align agent data to chat format, decompose data by capability and rebalance, and add diverse negative samples to reduce hallucination.

Empirical results: Agent-FLAN raises Llama2-series performance on multiple agent benchmarks (overall +3.5% vs prior best) and reduces hallucination measured on a new Agent-H benchmark.

Key Findings

Aligning formatted agent data into multi-turn chat improves task scores.

NumbersT-Eval +3.1%, HotpotQA +2.5% (Table 2)

Agent-FLAN outperforms prior agent-tuning work on evaluated benchmarks.

NumbersOverall score 41.7 vs 38.2 (AgentTuning*), +3.5 pts (Table 1)

Negative-sample training reduces agent hallucinations on a focused benchmark.

NumbersAgent-H benchmark size 1,845; HScore improved from 78.7 to 89.1 (Table 3)

Results

Overall (mixed agent eval)

Value41.7 (Agent-FLAN, Llama2-7B)

Baseline38.2 (AgentTuning*, same data)

T-Eval (tool-use evaluation)

Value66.0 (Agent-FLAN, Llama2-7B)

Baseline61.8 (AgentTuning*)

Agent-H hallucination HScore

Value89.1 (Agent-FLAN, Llama2-7B)

Baseline78.7 (Llama2-7B)

Training data size effect (HotpotQA)

ValueMost gains by 25% of Agent-FLAN data

Baseline100% data

Who Should Care

What To Try In 7 Days

Convert a small fraction (10–30%) of your agent-format training examples into multi-turn chat and fine-tune a model to validate gains.

Split your agent data by capability (reasoning, retrieval, understanding, instruction) and upweight reasoning/understanding samples.

Create a small set (hundreds) of negative examples where no tool should be called and use them as supervised negatives to reduce hallucinated tool calls.

Agent Features

Planning

  • ReAct-style thought-action planning (supported in training and evaluation)

Tool Use

  • function calling
  • tool invocation control (learn when/when not to call)

Frameworks

  • ReAct
  • ToolBench
  • AgentTuning (comparison)

Is Agentic

true

Architectures

  • decoder-only transformer (Llama2 variants)

Optimization Features

Training Optimization

  • data balancing by capability
  • chat-format alignment to avoid format overfitting

Reproducibility

Data Urls

  • AgentInstruct (Zeng et al.)
  • ToolBench (Qin et al.)
  • glaive-function-calling-v2 (GlaiveAI)
  • ALFWorld, WebShop, Mind2Web, Knowledge Graph (cited sources)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training and validation cover a subset of possible agent scenarios; other interactive cases may behave differently (§7).
  • ToolBench use is partially filtered: authors used ~10% of ToolBench samples to keep quality, so results might change if full datasets are used (§7, Appendix B).
  • Reported gains are on the evaluated benchmarks; real-world integration and safety still require more testing.

When Not To Use

  • If you need the absolute best API model results (GPT-4 still outperforms on many agent metrics), prefer APIs for top performance.
  • If you cannot curate or vet negative samples safely, negative-sample training may be risky or insufficient.

Failure Modes

  • Format overfitting if chat alignment is not applied carefully; model can still learn to prioritize format tokens over content.
  • Insufficient negative sample diversity can leave edge-case hallucinations unaddressed.
  • Scaling only data quantity without improving diversity yields diminishing returns.

Core Entities

Models

  • Llama2-7B
  • Llama2-13B
  • Llama2-70B
  • GPT-3.5
  • GPT-4

Metrics

  • T-Eval score
  • Overall agent evaluation (normalized)
  • HReAct (format hallucinations)
  • HGeneral (general-format hallucinations)
  • HScore (Agent-H composite)

Datasets

  • AgentInstruct
  • ToolBench
  • Glaive-function-calling-v2
  • ALFWorld
  • WebShop
  • Mind2Web
  • Knowledge Graph
  • OS/Database (from AgentInstruct subsets)
  • Agent-H (constructed, 1,845 samples)

Benchmarks

  • T-Eval
  • HotpotQA
  • SciWorld
  • WebArena
  • Agent-H
  • MMLU
  • GSM8K
  • HumanEval

Context Entities

Models

  • Vicuna (cited)
  • AgentTuning (Zeng et al., cited)

Datasets

  • ShareGPT (mixed during training)