Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

February 10, 20257 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

Links

Abstract / PDF

Why It Matters For Business

Investing in targeted pretraining data for tool use yields measurable gains in API-calling and multi-step planning, letting mid-sized open models approach commercial LLM performance on agent tasks.

Summary TLDR

Hephaestus-Forge is a purpose-built, 103B-token pretraining corpus of API docs, function-calling trajectories, code, and text designed to teach models how to call APIs, plan multi-step tool sequences, and adapt to environment feedback. Continual pre-training (two-stage) on this mix plus standard instruction fine-tuning produces Hephaestus models (8B and 7B variants) that outperform comparable open-source models on three agent benchmarks. The authors also report an empirical optimal pretraining mix of roughly 36% agent data (≈1:1:1 agent:code:text). Experimental checks include data filtering, ablations of retrieval, and contamination string-matching.

Problem Statement

Open-source LLMs lack agent-oriented pretraining data, so agents usually rely on heavy prompting or task-specific fine-tuning. That can fail to teach new tool-use skills, hurt generalization, and leave function-calling, multi-step planning, and feedback adaptation underdeveloped.

Main Contribution

Hephaestus-Forge: a 103B-token, multi-source pretraining corpus focused on API docs, function-calling trajectories, code, and text to teach agent skills.

A two-stage continual pre-training recipe (broad agent+general data then seed-focused agent data) that injects function-calling and intrinsic planning knowledge.

Scaling-law study that finds an empirical optimal data mix (~36% agent data, approx. 1:1:1 agent:code:text) and ablations showing the value of retrieval and filtering.

Key Findings

Hephaestus-Forge contains about 103 billion tokens and metadata for 76,537 APIs.

Numbers103B tokens; 76,537 APIs

Optimal pretraining mix is roughly 36% agent data, yielding an approximately 1:1:1 ratio of agent:code:text.

NumbersAgent data ≈36% (1:1:1)

Hephaestus-8B (instruction fine-tuned) achieves 70.78% overall accuracy on BFCL-v2 vs 62.12% for LLaMA-3-8B-IFT (baseline).

NumbersBFCL-v2 OA: 70.78 vs 62.12 (Δ +8.66 points)

Ablations show retrieval and filtering matter: removing retrieved data or filtering reduces performance on many agent tasks.

NumbersBFCL-v2 OA drops to 49.86 or 59.34 in some ablations (see Table 3)

Results

Accuracy

Value70.78%

BaselineLLaMA-3-8B-IFT 62.12%

Accuracy

Value51.59%

BaselineLLaMA-3-8B-IFT 48.52%

AgentBench overall (OA)

Value2.29

BaselineLLaMA-3-8B-IFT 2.07

Pretraining corpus size

Value103B tokens

Who Should Care

What To Try In 7 Days

Collect a small seed of API docs and usage examples for your tooling surface.

Use semantic retrieval (embed+nearest neighbors) to expand seeds from web crawls.

Train a lightweight classifier (fastText) to filter agent-relevant pages, then sample a 1:1:1 mix of agent:code:text for short continual pretraining or adapter experiments (small b

Agent Features

Memory

  • short-term interaction state (observations → actions)
  • implicit planning state in model parameters

Planning

  • intrinsic multi-step planning (sequence of API calls)
  • plan refinement from environment feedback

Tool Use

  • API function calling (single and multi-turn)
  • multi-tool sequencing and parameter selection

Frameworks

  • continual pre-training (Stage I broad, Stage II seed-focused)
  • instruction fine-tuning (Stage III) for downstream alignment

Is Agentic

true

Architectures

  • Transformer (LLaMA-3 backbone)
  • Mistral (7B backbone variant)

Collaboration

  • generalization across diverse APIs and domains

Optimization Features

Infra Optimization

  • used 128 A100 (40G) GPUs for 11.1 days for 8B pretraining

System Optimization

  • parallel training with tensor and pipeline model parallelism

Training Optimization

  • two-stage continual pretraining to reduce stability gap
  • scaling-law fitting to choose data mix

Reproducibility

License

  • Apache-2.0 / MIT / LGPL-2.1 (seed sources per paper)

Data Urls

  • Hephaestus paper lists seed data sources and public dataset URLs (see appendix A and Tables 9-10)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Compute limited: experiments only on small/medium backbones; large-scale behavior is untested.
  • Possible filter errors: fastText misclassifies some pages (they report failure cases).
  • Pretraining corpus not fully released; reproducibility may require effort to reassemble data.
  • Ablations show sensitivity: removing retrieval or filtering changes behavior and can cause overfitting.

When Not To Use

  • If you only need a quick instruction-following chatbot—heavy continual pretraining is costly.
  • If you lack the compute budget (their 8B pretraining cost used 128 A100s for ~11 days).
  • When function-calling surface is tiny and task-specific fine-tuning suffices.

Failure Modes

  • Overfitting to seed patterns when retrieval data is sparse or unfiltered.
  • Stability gap: sudden drops in old-task performance if data distribution shifts without staged training.
  • Incomplete instruction-following ability may limit executable function generation despite correct ASTs.

Core Entities

Models

  • Hephaestus-8B
  • Hephaestus-7B (Mistral backbone)
  • LLaMA-3-8B
  • LLaMA-3.1-8B
  • Mixtral-8x22B
  • Mistral-7B-v0.3
  • StarCoder-v2

Metrics

  • Accuracy
  • success rate
  • F1
  • reward score
  • benchmark loss

Datasets

  • Hephaestus-Forge
  • BFCL-v2
  • BFCL-v3
  • AgentBench
  • API-Bank
  • API-Bench
  • MMLU
  • ToolACE
  • ShareGPT
  • AgentFlan

Benchmarks

  • BFCL (Berkeley Function Calling Leaderboard)
  • AgentBench
  • Nexus
  • API-Bank
  • API-Bench
  • MMLU