AUGCON: automatic pipeline to generate diverse, multi-granularity SFT pairs from any corpus

May 26, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Shanghaoran Quan

Links

Abstract / PDF

Why It Matters For Business

AUGCON creates high-quality, diverse SFT pairs automatically, lowering annotation costs and improving domain-adapted LLM performance for productized assistants and search/chat features.

Summary TLDR

AUGCON is an automated pipeline that builds supervised fine-tuning (SFT) query–response pairs from a custom corpus. It (1) extracts multi-granularity questions via a Context-Split-Tree (CST) that recursively splits text and derives matching queries, (2) trains a contrastive scorer to rank and filter diverse queries, and (3) uses principle-driven self-alignment plus a self-improving in‑context example search to produce high-fidelity answers. AUGCON outperforms several prior context-driven generators on human evaluation (DailyM) and on four standard benchmarks, while being usable with open-source LLMs and released code/datasets.

Problem Statement

Creating high-quality, diverse query–response pairs from a private corpus is costly by hand. Existing automated methods produce redundant or single-granularity queries and lower-fidelity answers. We need an automatic, scalable way to generate multi-granularity, high-diversity SFT data that yields better fine-tuned models.

Main Contribution

Context-Split-Tree (CST): a recursive LLM-driven splitting method to derive queries that match different context granularities.

Contrastive scorer: train a lightweight scorer with contrastive learning to rank and filter queries for quality and diversity.

Principle-driven self-alignment + self-improving: search for best few-shot exemplars and apply alignment principles to generate higher-fidelity answers.

Extensive evaluations (human + automatic) showing AUGCON yields higher diversity, realism, and downstream model quality; code, dataset, and models to be open-sourced.

Key Findings

AUGCON improves accuracy on reading QA benchmarks compared to prior context-driven SFT methods.

NumbersSQuAD1.1 Acc 0.336 vs 0.314 (best baseline); TriviaQA 0.849 vs 0.825; DROP 0.350 vs 0.334; WebGLM-QA BS 0.924 vs 0.903

AUGCON wins GPT-4 pairwise judgements vs other context-driven methods.

NumbersWins vs ETRC: 64.5% vs 35.5%; vs Context-Instruct: 60.3% vs 39.7%

CST produces a predictable number of queries: 2^n - 1 for n sentence units.

NumbersProof: a context of n sentences yields 2^n − 1 derived queries

Results

Accuracy

Value0.336 ± 0.004 (AUGCON fine-tuned)

BaselineContext-Instruct 0.314 ± 0.003

Accuracy

Value0.849 ± 0.003 (AUGCON fine-tuned)

BaselineContext-Instruct 0.825 ± 0.003

Accuracy

Value0.350 ± 0.003 (AUGCON fine-tuned)

BaselineContext-Instruct 0.334 ± 0.003

WebGLM-QA BERTScore (long-form)

Value0.924 ± 0.002 (AUGCON fine-tuned)

BaselineETRC 0.903 ± 0.001

Who Should Care

What To Try In 7 Days

Run AUGCON on a small, high-value corpus (1–2k documents) and fine-tune an open-source chat model to compare against DAPT.

Use CST with a higher λ to prioritize macro-topic queries or a lower λ to harvest detailed Q&A and observe downstream task gains.

Open the released code and run the scorer + filtering pipeline to produce a compact, diverse SFT set and test model quality with a 100-query human sample.

Optimization Features

Token Efficiency

  • CST yields linear amount of questions relative to sentence units, controlling token generation via c

Infra Optimization

  • Single-node 8xA100 setup, DeepSpeed + ZeRO-2

Model Optimization

  • LoRA

System Optimization

  • Use of A100 80G multi-GPU node; generation throughput ~340 pairs/A100-hour

Training Optimization

  • DeepSpeed ZeRO-2 for memory efficiency
  • AdamW optimizer and 4 training epochs

Inference Optimization

  • vLLM for high-throughput LLM calls
  • concurrent requests (8 threads) to accelerate generation

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Quality depends on input context depth; shallow or noisy corpora yield weaker queries and answers.
  • Method requires substantial GPU resources for large corpora (reported 184 A100 hours to generate 120K pairs on DailyM).
  • Potential bias from base LLMs and sensitivity to cultural/linguistic nuances not fully resolved.
  • Generalization across low-resource languages and highly specialized domains not fully validated.
  • Answer filtering beyond self-alignment (e.g., post-hoc factual verification) left as future work.

When Not To Use

  • You have a tiny corpus with few coherent sentences—CST will yield limited value.
  • You lack GPU budget for generating and fine-tuning (AUGCON can be compute-heavy at scale).
  • Your application requires formal safety certification or rigorous fact checks; outputs may need extra verification.

Failure Modes

  • LLM hallucination during context split produces invalid subcontexts or nonsensical queries.
  • Scorer misranking retains low-quality queries if negative sample construction is insufficient.
  • Self-improving ICL search selects suboptimal exemplars, lowering answer fidelity.
  • Overfitting to synthetic prompt style if few-shot examples are not diverse.

Core Entities

Models

  • Qwen1.5-32B-Chat
  • Llama3-70B-Instruct
  • Llama3-c-70B
  • Qwen1.5-c-32B

Metrics

  • Accuracy
  • BERTScore
  • ROUGE-L
  • GPT-4 pairwise judgement

Datasets

  • SFT
  • SQuAD1.1
  • TriviaQA
  • DROP
  • WebGLM-QA

Benchmarks

  • SQuAD1.1
  • TriviaQA
  • DROP
  • WebGLM-QA
  • DailyM test set

Context Entities

Models

  • GPT-4 (used as judge)

Datasets

  • Open-source magazine corpus (DailyM), public QA benchmarks