Generate validated, machine-readable agent interaction records using only LLMs

October 20, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Abhigya Verma, Seganrasan Subramanian, Nandhakumar Kandasamy, Naman Gupta

Links

Abstract / PDF

Why It Matters For Business

Generates machine-readable agent interaction data at scale without human labeling. This can cut annotation cost, speed agent training cycles, and produce testbeds for function-calling accuracy and multi-turn behavior.

Summary TLDR

The paper presents a modular LLM-only framework (named SYTHIA inside the paper) to synthesize fully structured agentic records—tasks, function/tool schemas, pseudocode policies, turn-by-turn dialogues, and execution traces—plus automated validation and judge-based filtering so generated data can be machine-consumed for training and evaluation.

Problem Statement

Training and evaluating LLM agents needs structured, executable records (user intent + tool specs + call arguments + execution traces). Human collection is expensive and slow. Existing synthetic pipelines often lack strong schema enforcement and execution fidelity required for tooling and function calling.

Main Contribution

A unified, modular framework to synthesize agentic data end-to-end using only LLMs, with no human-in-the-loop.

Four concrete pipelines that target different supervision granularities: RecordSynth (full records), DAGFirstGeneration (atomic tool-call triples), MultiTurnDialogueSynth (validated multi-turn dialogues), and AgenticRecordRollout (serialize for SFT).

Schema-first generation and automated validators: JSON-schema function specs, execution-step traces, typed argument checks, and judge-based scoring to filter low-quality samples.

Design details and prompt templates to enforce syntactic and semantic constraints so outputs are machine-parseable and BFCL-style compatible for function-calling benchmarks.

Key Findings

The framework is implemented as four modular pipelines covering end-to-end records, DAG-based atomic triples, multi-turn dialogues, and rollout to SFT-ready chat examples.

Numbers4 pipelines (RecordSynth, DAGFirstGeneration, MultiTurnDialogueSynth, AgenticRecordRollout)

Outputs are strictly schema-constrained and machine-loadable (JSON lists and pydantic-compatible pseudocode), with per-step alignment between function inputs and outputs.

The pipeline includes automated validation and judge modules that filter samples on schema correctness, argument typing, grounding, and instruction clarity.

The authors explicitly note risks: DAG instantiation errors, prompt sensitivity, model bias in synthetic outputs, and long-term degeneration when training on self-generated data.

Who Should Care

What To Try In 7 Days

Run RecordSynth on one high-value domain (e.g., ticketing or CLM) to create a small set (100–500) of validated agentic records.

Use DAGFirstGeneration to convert a few records into BFCL-style triples and evaluate your model's function-calling accuracy on them.

Integrate JSON-schema validation and judge-filtering into your data pipeline to catch malformed tool calls before training.

Agent Features

Memory

  • short-term execution traces (per-record); no long-term memory module described

Planning

  • Declarative DAG templates anchor tool dependencies
  • execution traces capture sequential and parallel steps

Tool Use

  • Structured function calling
  • Mocked tool responses for validation

Frameworks

  • RecordSynth
  • DAGFirstGeneration
  • MultiTurnDialogueSynth
  • AgenticRecordRollout
  • SyGra (orchestration)

Is Agentic

true

Architectures

  • DAG-based planning templates
  • pydantic-compatible pseudocode policies

Collaboration

  • Supports simulated user proxy and agent exchanges (self-play dialogue generation)

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • DAG fidelity: mis-specified DAGs can propagate incorrect ground truth into atomic supervision (Sec. 5.1).
  • Prompt sensitivity: small phrasing changes can alter generated arguments and reasoning (Sec. 5.1).
  • Bias and safety: generated content can inherit LLM biases and hallucinated tool signatures; prompts and filters are needed (Sec. 5.2).
  • Model collapse risk: training repeatedly on self-generated data risks degeneration and loss of grounding diversity (Sec. 5.3).
  • Reproducibility constraints: exact replication needs function libraries, deterministic LLM sampling, and comparable infra (Sec. 5.4).

When Not To Use

  • As the only training source for safety-critical decision systems without human verification.
  • If you cannot provide stable, deterministic LLM behavior or consistent function schemas.
  • When your application requires real-world execution traces from live systems rather than mocked outputs.

Failure Modes

  • Hallucinated or incorrect function arguments despite schema checks (if prompts trick the model).
  • Cascading DAG errors where an early wrong node causes incorrect downstream supervision.
  • Overfitting to prompt templates leading to unnatural utterances and brittle agents.
  • Synthetic degeneration over multiple train-finetune cycles (self-distillation collapse).

Core Entities

Models

  • SyGra 2 (data orchestration substrate referenced)
  • Mistral-Nemo-Instruct-2407 (tokenizer example)

Metrics

  • schema validity (type and field matching)
  • judge-based clarity/grounding scores

Datasets

  • Synthetic agentic datasets (RecordSynth outputs)
  • BFCL-style atomic tool-call datasets (derived via DAGFirstGeneration)

Benchmarks

  • Berkeley Function-Calling Leaderboard (BFCL)
  • AgentBench (mentioned as related work)