Agentic flows create 25M synthetic instruction pairs to teach skills and boost a 7B model across many benchmarks

July 3, 20249 min

Overview

Decision SnapshotNeeds Validation

The idea is practical: agentic flows produce large, diverse synthetic data and move a 7B model up on many benchmarks, but effectiveness depends on seeds, compute, and verification steps.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals12

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 8/8

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah

Links

Abstract / PDF

Why It Matters For Business

Agentic flows automate creation of large, diverse instruction data from raw web/code sources, enabling faster model skill updates without manual prompt engineering or heavy labeling.

Who Should Care

Summary TLDR

AgentInstruct is an automated pipeline of LLM-powered agents that turns raw text and code into large, diverse synthetic instruction-response pairs (≈25.8M). The authors use these pairs to post-train Mistral-7B into Orca-3 and report consistent boosts on many benchmarks (e.g., +40% AGIEval, +19% MMLU, +54% GSM8K) and a ~31% reduction in summarization hallucination. The method focuses on three agentic flows—content transformation, seed instruction generation, and refinement—and relies on tools like search and code interpreters to improve quality and diversity.

Problem Statement

Synthetic data can speed model development but varies widely in quality and diversity; creating high-quality, diverse synthetic instruction data at scale usually needs heavy human curation. The paper asks: can agentic multi-step flows turn raw documents and code into large, diverse, and high-quality synthetic datasets for post-training ("Generative Teaching") with minimal human effort?

Main Contribution

AgentInstruct: an agentic framework with three flows (content transformation, seed instruction generation, refinement) that generates both prompts and responses from raw seeds.

A large post-training dataset of ≈25.8M paired instructions produced from raw text and code seeds plus existing instruction corpora.

Key Findings

AgentInstruct produced roughly 25.8 million instruction–response pairs used for post-training.

Numbers≈25.8M paired instructions (22M agentic + 3.8M external)

Practical UseYou can bootstrap a large instruction dataset from raw text/code without collecting prompt sets manually; expect a multi-million sample corpus.

Evidence Ref3.1 Dataset Description

Fine-tuning Mistral-7B on the AgentInstruct data (Orca-3) yields large gains on multiple benchmarks versus Mistral-7B-Instruct.

NumbersAGIEval +40%; MMLU +19%; GSM8K +54% (relative to Mistral-7B-Instruct)

Practical UsePost-training on agentic synthetic data can meaningfully raise a 7B model's general reasoning and math skills for broad use cases.

Evidence RefAbstract; Table 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
AGIEval (score)56.80Mistral-7B-Instruct 40.52+40%AGIEvalTable 3 (AGIEval row)Table 3
Accuracy69.95Mistral-7B-Instruct 58.61+19%MMLUTable 3 (MMLU row)Table 3

What To Try In 7 Days

Run a small AgentInstruct flow on a domain corpus (10k seeds) to generate a pilot instruction set.

Fine-tune a small checkpoint on the pilot data and compare core tasks (e.g., format following, domain QA) vs baseline.

Add inexpensive verification steps (tool calls or LLM critic) to reduce hallucinations before scaling.

Agent Features

Memory
short-term conversation history (per-flow)no claimed long-term retrieval memory
Planning
iterative refinement flowssuggester-editor cycles
Tool Use
search APIscode interpretercalculatorexternal APIs
Frameworks
Content Transformation FlowSeed Instruction Generation FlowInstruction Refinement Flow
Is Agentic

Yes

Architectures
LLM-powered agentsmulti-agent orchestration (flows)
Collaboration
multi-agent handoffs (content→instruction→refinement)

Optimization Features

Token Efficiency
packing to max sequence length 8192
Infra Optimization
training run used 19 nodes (152 A100 GPUs), batch size 10 per GPU
System Optimization
distributed training across 152 A100 GPUs
Training Optimization
token packing to 8192 context lengthlabel masking to compute loss only on responsesAdamW optimizer with cosine LR schedule and 500-step warmup

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Creating flows requires human engineering; not fully automatic.

Quality depends on seed data; biased or low-quality seeds propagate problems.

When Not To Use

When you lack compute budget for large-scale generation or training.

When legally protected or highly sensitive data requires strict human vetting.

Failure Modes

Model collapse or style imitation if flows are not diverse enough.

Amplified bias from biased seed corpora.

Core Entities

Models

Mistral-7BOrca-3 (Mistral-7B finetuned on AgentInstruct)Orca-2.5Mistral-7B-InstructLLAMA3-8B-InstructGPT-3.5-turboGPT-4

Metrics

Orca-Bench score (0–10 relative to GPT-4)AccuracyHallucination rate (%)Quality score (1–10)Relative % improvement vs baseline

Datasets

AgentInstruct dataset (≈25.8M pairs)Orca-2.5-dataset (≈3.8M pairs)KnowledgePile (seed)AutoMathText (seed)CodeParrot subset (seed)

Benchmarks

Orca-BenchAGIEvalMMLUGSM8KBBHAlpacaEvalFoFoACI-BenchMIRAGEDROPMT-BenchInfoBench