Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

February 10, 20257 min

Overview

Decision SnapshotReady For Pilot

The paper shows consistent improvements on multiple agent benchmarks and includes ablations and contamination checks, but experiments are limited to small/medium model scales and rely on many curated components.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

License: Apache-2.0 / MIT / LGPL-2.1 (seed sources per paper)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang

Links

Abstract / PDF / Data

Why It Matters For Business

Investing in targeted pretraining data for tool use yields measurable gains in API-calling and multi-step planning, letting mid-sized open models approach commercial LLM performance on agent tasks.

Who Should Care

Summary TLDR

Hephaestus-Forge is a purpose-built, 103B-token pretraining corpus of API docs, function-calling trajectories, code, and text designed to teach models how to call APIs, plan multi-step tool sequences, and adapt to environment feedback. Continual pre-training (two-stage) on this mix plus standard instruction fine-tuning produces Hephaestus models (8B and 7B variants) that outperform comparable open-source models on three agent benchmarks. The authors also report an empirical optimal pretraining mix of roughly 36% agent data (≈1:1:1 agent:code:text). Experimental checks include data filtering, ablations of retrieval, and contamination string-matching.

Problem Statement

Open-source LLMs lack agent-oriented pretraining data, so agents usually rely on heavy prompting or task-specific fine-tuning. That can fail to teach new tool-use skills, hurt generalization, and leave function-calling, multi-step planning, and feedback adaptation underdeveloped.

Main Contribution

Hephaestus-Forge: a 103B-token, multi-source pretraining corpus focused on API docs, function-calling trajectories, code, and text to teach agent skills.

A two-stage continual pre-training recipe (broad agent+general data then seed-focused agent data) that injects function-calling and intrinsic planning knowledge.

Key Findings

Hephaestus-Forge contains about 103 billion tokens and metadata for 76,537 APIs.

Numbers103B tokens; 76,537 APIs

Practical UseIf you need an agent-capable model, pretraining on large, focused agent corpora (100B+ tokens) gives the model direct exposure to API formats and call patterns.

Evidence RefAbstract; §4.1

Optimal pretraining mix is roughly 36% agent data, yielding an approximately 1:1:1 ratio of agent:code:text.

NumbersAgent data ≈36% (1:1:1)

Practical UseWhen building a continual pretraining corpus, allocate ~36% to agent-specific samples and split remaining data roughly equally between code and general text.

Evidence Ref§5 and Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy70.78%LLaMA-3-8B-IFT 62.12%+8.66 ppBFCL-v2Hephaestus-8B-IFT BFCL-v2 OA 70.78 vs LLaMA-3-8B-IFT 62.12Table 7; Table 4
Accuracy51.59%LLaMA-3-8B-IFT 48.52%+3.07 ppBFCL-v3Hephaestus-8B-IFT BFCL-v3 OA 51.59 vs LLaMA-3-8B-IFT 48.52Table 4

What To Try In 7 Days

Collect a small seed of API docs and usage examples for your tooling surface.

Use semantic retrieval (embed+nearest neighbors) to expand seeds from web crawls.

Train a lightweight classifier (fastText) to filter agent-relevant pages, then sample a 1:1:1 mix of agent:code:text for short continual pretraining or adapter experiments (small b

Agent Features

Memory
short-term interaction state (observations → actions)implicit planning state in model parameters
Planning
intrinsic multi-step planning (sequence of API calls)plan refinement from environment feedback
Tool Use
API function calling (single and multi-turn)multi-tool sequencing and parameter selection
Frameworks
continual pre-training (Stage I broad, Stage II seed-focused)instruction fine-tuning (Stage III) for downstream alignment
Is Agentic

Yes

Architectures
Transformer (LLaMA-3 backbone)Mistral (7B backbone variant)
Collaboration
generalization across diverse APIs and domains

Optimization Features

Infra Optimization
used 128 A100 (40G) GPUs for 11.1 days for 8B pretraining
System Optimization
parallel training with tensor and pipeline model parallelism
Training Optimization
two-stage continual pretraining to reduce stability gapscaling-law fitting to choose data mix

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseApache-2.0 / MIT / LGPL-2.1 (seed sources per paper)

Data URLs

Hephaestus paper lists seed data sources and public dataset URLs (see appendix A and Tables 9-10)

Risks & Boundaries

Limitations

Compute limited: experiments only on small/medium backbones; large-scale behavior is untested.

Possible filter errors: fastText misclassifies some pages (they report failure cases).

When Not To Use

If you only need a quick instruction-following chatbot—heavy continual pretraining is costly.

If you lack the compute budget (their 8B pretraining cost used 128 A100s for ~11 days).

Failure Modes

Overfitting to seed patterns when retrieval data is sparse or unfiltered.

Stability gap: sudden drops in old-task performance if data distribution shifts without staged training.

Core Entities

Models

Hephaestus-8BHephaestus-7B (Mistral backbone)LLaMA-3-8BLLaMA-3.1-8BMixtral-8x22BMistral-7B-v0.3StarCoder-v2

Metrics

Accuracysuccess rateF1reward scorebenchmark loss

Datasets

Hephaestus-ForgeBFCL-v2BFCL-v3AgentBenchAPI-BankAPI-BenchMMLUToolACEShareGPTAgentFlan

Benchmarks

BFCL (Berkeley Function Calling Leaderboard)AgentBenchNexusAPI-BankAPI-BenchMMLU