Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Overview

Decision SnapshotReady For Pilot

The paper shows consistent improvements on multiple agent benchmarks and includes ablations and contamination checks, but experiments are limited to small/medium model scales and rely on many curated components.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

License: Apache-2.0 / MIT / LGPL-2.1 (seed sources per paper)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

Investing in targeted pretraining data for tool use yields measurable gains in API-calling and multi-step planning, letting mid-sized open models approach commercial LLM performance on agent tasks.

Who Should Care

CTO ML Engineer Product Manager Data Scientist

Summary TLDR

Hephaestus-Forge is a purpose-built, 103B-token pretraining corpus of API docs, function-calling trajectories, code, and text designed to teach models how to call APIs, plan multi-step tool sequences, and adapt to environment feedback. Continual pre-training (two-stage) on this mix plus standard instruction fine-tuning produces Hephaestus models (8B and 7B variants) that outperform comparable open-source models on three agent benchmarks. The authors also report an empirical optimal pretraining mix of roughly 36% agent data (≈1:1:1 agent:code:text). Experimental checks include data filtering, ablations of retrieval, and contamination string-matching.

Problem Statement

Open-source LLMs lack agent-oriented pretraining data, so agents usually rely on heavy prompting or task-specific fine-tuning. That can fail to teach new tool-use skills, hurt generalization, and leave function-calling, multi-step planning, and feedback adaptation underdeveloped.

Main Contribution

Hephaestus-Forge: a 103B-token, multi-source pretraining corpus focused on API docs, function-calling trajectories, code, and text to teach agent skills.

A two-stage continual pre-training recipe (broad agent+general data then seed-focused agent data) that injects function-calling and intrinsic planning knowledge.

Key Findings

Hephaestus-Forge contains about 103 billion tokens and metadata for 76,537 APIs.

Numbers103B tokens; 76,537 APIs

Practical UseIf you need an agent-capable model, pretraining on large, focused agent corpora (100B+ tokens) gives the model direct exposure to API formats and call patterns.

Evidence RefAbstract; §4.1

Optimal pretraining mix is roughly 36% agent data, yielding an approximately 1:1:1 ratio of agent:code:text.

NumbersAgent data ≈36% (1:1:1)

Practical UseWhen building a continual pretraining corpus, allocate ~36% to agent-specific samples and split remaining data roughly equally between code and general text.

Evidence Ref§5 and Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	70.78%	LLaMA-3-8B-IFT 62.12%	+8.66 pp	BFCL-v2	Hephaestus-8B-IFT BFCL-v2 OA 70.78 vs LLaMA-3-8B-IFT 62.12	Table 7; Table 4
Accuracy	51.59%	LLaMA-3-8B-IFT 48.52%	+3.07 pp	BFCL-v3	Hephaestus-8B-IFT BFCL-v3 OA 51.59 vs LLaMA-3-8B-IFT 48.52	Table 4

What To Try In 7 Days

Collect a small seed of API docs and usage examples for your tooling surface.

Use semantic retrieval (embed+nearest neighbors) to expand seeds from web crawls.

Train a lightweight classifier (fastText) to filter agent-relevant pages, then sample a 1:1:1 mix of agent:code:text for short continual pretraining or adapter experiments (small b

Agent Features

Memory

short-term interaction state (observations → actions)implicit planning state in model parameters

Planning

intrinsic multi-step planning (sequence of API calls)plan refinement from environment feedback

Tool Use

API function calling (single and multi-turn)multi-tool sequencing and parameter selection

Frameworks

continual pre-training (Stage I broad, Stage II seed-focused)instruction fine-tuning (Stage III) for downstream alignment

Is Agentic

Yes

Architectures

Transformer (LLaMA-3 backbone)Mistral (7B backbone variant)

Collaboration

generalization across diverse APIs and domains

Optimization Features

Infra Optimization

used 128 A100 (40G) GPUs for 11.1 days for 8B pretraining

System Optimization

parallel training with tensor and pipeline model parallelism

Training Optimization

two-stage continual pretraining to reduce stability gapscaling-law fitting to choose data mix

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseApache-2.0 / MIT / LGPL-2.1 (seed sources per paper)

Data URLs

Hephaestus paper lists seed data sources and public dataset URLs (see appendix A and Tables 9-10)

Risks & Boundaries

Limitations

Compute limited: experiments only on small/medium backbones; large-scale behavior is untested.

Possible filter errors: fastText misclassifies some pages (they report failure cases).

When Not To Use

If you only need a quick instruction-following chatbot—heavy continual pretraining is costly.

If you lack the compute budget (their 8B pretraining cost used 128 A100s for ~11 days).

Failure Modes

Overfitting to seed patterns when retrieval data is sparse or unfiltered.

Stability gap: sudden drops in old-task performance if data distribution shifts without staged training.

Core Entities

Models

Hephaestus-8BHephaestus-7B (Mistral backbone)LLaMA-3-8BLLaMA-3.1-8BMixtral-8x22BMistral-7B-v0.3StarCoder-v2

Metrics

Accuracysuccess rateF1reward scorebenchmark loss

Datasets

Hephaestus-ForgeBFCL-v2BFCL-v3AgentBenchAPI-BankAPI-BenchMMLUToolACEShareGPTAgentFlan

Benchmarks

BFCL (Berkeley Function Calling Leaderboard)AgentBenchNexusAPI-BankAPI-BenchMMLU

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hephaestus-Forge contains about 103 billion tokens and metadata for 76,537 APIs.

Optimal pretraining mix is roughly 36% agent data, yielding an approximately 1:1:1 ratio of agent:code:text.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

CoALM: one fine-tuned model that combines multi-turn dialogue state tracking with robust API / function calling

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

AgentArch: benchmark of 18 agent architectures across 6 LLMs on two enterprise workflows

Key finding

Tool-R0: teach LLMs to call real tools from scratch using Generator–Solver self-play

Key finding

Generate editable BIM models from plain language by orchestrating LLM agents that write modeling code

Key finding