Agentic flows create 25M synthetic instruction pairs to teach skills and boost a 7B model across many benchmarks

July 3, 20249 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

2

Authors

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah

Links

Abstract / PDF

Why It Matters For Business

Agentic flows automate creation of large, diverse instruction data from raw web/code sources, enabling faster model skill updates without manual prompt engineering or heavy labeling.

Summary TLDR

AgentInstruct is an automated pipeline of LLM-powered agents that turns raw text and code into large, diverse synthetic instruction-response pairs (≈25.8M). The authors use these pairs to post-train Mistral-7B into Orca-3 and report consistent boosts on many benchmarks (e.g., +40% AGIEval, +19% MMLU, +54% GSM8K) and a ~31% reduction in summarization hallucination. The method focuses on three agentic flows—content transformation, seed instruction generation, and refinement—and relies on tools like search and code interpreters to improve quality and diversity.

Problem Statement

Synthetic data can speed model development but varies widely in quality and diversity; creating high-quality, diverse synthetic instruction data at scale usually needs heavy human curation. The paper asks: can agentic multi-step flows turn raw documents and code into large, diverse, and high-quality synthetic datasets for post-training ("Generative Teaching") with minimal human effort?

Main Contribution

AgentInstruct: an agentic framework with three flows (content transformation, seed instruction generation, refinement) that generates both prompts and responses from raw seeds.

A large post-training dataset of ≈25.8M paired instructions produced from raw text and code seeds plus existing instruction corpora.

A finetuned 7B model (Orca-3) showing substantial, benchmarked improvements over Mistral-7B-instruct and other baselines.

Practical implementation details: training setup (152 A100 GPUs, 3 epochs, ~200 hours) and evaluation across many public benchmarks including a custom Orca-Bench.

Key Findings

AgentInstruct produced roughly 25.8 million instruction–response pairs used for post-training.

Numbers≈25.8M paired instructions (22M agentic + 3.8M external)

Fine-tuning Mistral-7B on the AgentInstruct data (Orca-3) yields large gains on multiple benchmarks versus Mistral-7B-Instruct.

NumbersAGIEval +40%; MMLU +19%; GSM8K +54% (relative to Mistral-7B-Instruct)

Orca-3 improves average Orca-Bench score and outperforms prior finetunes.

NumbersOrca-Bench: Orca-3=9.55 vs Mistral-Instruct=8.31 (scale 0–10)

AgentInstruct decreased hallucination rates in summarization tasks.

NumbersHallucination rate overall 21.09% (−31.34% vs Mistral-7B-Instruct 30.72%)

RAG skill produced by AgentInstruct yields large relative gains when paired with the same retrieval setup.

NumbersMIRAGE average RAG improvement ~+38.3% vs Mistral-Instruct; PubMedQA +92.7% on one split

Generating the dataset and training required nontrivial compute.

NumbersTraining used 152 NVIDIA A100 GPUs for ~200 hours, 3 epochs

Results

AGIEval (score)

Value56.80

BaselineMistral-7B-Instruct 40.52

Accuracy

Value69.95

BaselineMistral-7B-Instruct 58.61

Accuracy

Value83.09

BaselineMistral-7B-Instruct 54.06

BBH (score)

Value61.83

BaselineMistral-7B-Instruct 44.71

AlpacaEval (win-rate score)

Value24.80

BaselineMistral-7B-Instruct 17.10

Orca-Bench (0–10)

Value9.55

BaselineMistral-Instruct-7B 8.31

Summarization hallucination rate (micro)

Value21.09%

BaselineMistral-7B-Instruct 30.72%

RAG (MIRAGE average)

Value56.22 (CoT) / 64.27 (RAG)

BaselineMistral-7B-Instruct 40.75 (CoT) / 46.47 (RAG)

Who Should Care

What To Try In 7 Days

Run a small AgentInstruct flow on a domain corpus (10k seeds) to generate a pilot instruction set.

Fine-tune a small checkpoint on the pilot data and compare core tasks (e.g., format following, domain QA) vs baseline.

Add inexpensive verification steps (tool calls or LLM critic) to reduce hallucinations before scaling.

Agent Features

Memory

  • short-term conversation history (per-flow)
  • no claimed long-term retrieval memory

Planning

  • iterative refinement flows
  • suggester-editor cycles

Tool Use

  • search APIs
  • code interpreter
  • calculator
  • external APIs

Frameworks

  • Content Transformation Flow
  • Seed Instruction Generation Flow
  • Instruction Refinement Flow

Is Agentic

true

Architectures

  • LLM-powered agents
  • multi-agent orchestration (flows)

Collaboration

  • multi-agent handoffs (content→instruction→refinement)

Optimization Features

Token Efficiency

  • packing to max sequence length 8192

Infra Optimization

  • training run used 19 nodes (152 A100 GPUs), batch size 10 per GPU

System Optimization

  • distributed training across 152 A100 GPUs

Training Optimization

  • token packing to 8192 context length
  • label masking to compute loss only on responses
  • AdamW optimizer with cosine LR schedule and 500-step warmup

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Creating flows requires human engineering; not fully automatic.
  • Quality depends on seed data; biased or low-quality seeds propagate problems.
  • Generation and training are compute- and cost-intensive.
  • Synthetic data may still miss real-world nuance and can introduce new biases.
  • Validation of synthetic examples is challenging at scale.

When Not To Use

  • When you lack compute budget for large-scale generation or training.
  • When legally protected or highly sensitive data requires strict human vetting.
  • When human-labeled, benchmark-specific fidelity is required instead of capability teaching.

Failure Modes

  • Model collapse or style imitation if flows are not diverse enough.
  • Amplified bias from biased seed corpora.
  • Persistent hallucinations in domains without grounding.
  • Overfitting to synthetic styles that do not match target deployment data.

Core Entities

Models

  • Mistral-7B
  • Orca-3 (Mistral-7B finetuned on AgentInstruct)
  • Orca-2.5
  • Mistral-7B-Instruct
  • LLAMA3-8B-Instruct
  • GPT-3.5-turbo
  • GPT-4

Metrics

  • Orca-Bench score (0–10 relative to GPT-4)
  • Accuracy
  • Hallucination rate (%)
  • Quality score (1–10)
  • Relative % improvement vs baseline

Datasets

  • AgentInstruct dataset (≈25.8M pairs)
  • Orca-2.5-dataset (≈3.8M pairs)
  • KnowledgePile (seed)
  • AutoMathText (seed)
  • CodeParrot subset (seed)

Benchmarks

  • Orca-Bench
  • AGIEval
  • MMLU
  • GSM8K
  • BBH
  • AlpacaEval
  • FoFo
  • ACI-Bench
  • MIRAGE
  • DROP
  • MT-Bench
  • InfoBench