Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
AUGCON creates high-quality, diverse SFT pairs automatically, lowering annotation costs and improving domain-adapted LLM performance for productized assistants and search/chat features.
Summary TLDR
AUGCON is an automated pipeline that builds supervised fine-tuning (SFT) query–response pairs from a custom corpus. It (1) extracts multi-granularity questions via a Context-Split-Tree (CST) that recursively splits text and derives matching queries, (2) trains a contrastive scorer to rank and filter diverse queries, and (3) uses principle-driven self-alignment plus a self-improving in‑context example search to produce high-fidelity answers. AUGCON outperforms several prior context-driven generators on human evaluation (DailyM) and on four standard benchmarks, while being usable with open-source LLMs and released code/datasets.
Problem Statement
Creating high-quality, diverse query–response pairs from a private corpus is costly by hand. Existing automated methods produce redundant or single-granularity queries and lower-fidelity answers. We need an automatic, scalable way to generate multi-granularity, high-diversity SFT data that yields better fine-tuned models.
Main Contribution
Context-Split-Tree (CST): a recursive LLM-driven splitting method to derive queries that match different context granularities.
Contrastive scorer: train a lightweight scorer with contrastive learning to rank and filter queries for quality and diversity.
Principle-driven self-alignment + self-improving: search for best few-shot exemplars and apply alignment principles to generate higher-fidelity answers.
Extensive evaluations (human + automatic) showing AUGCON yields higher diversity, realism, and downstream model quality; code, dataset, and models to be open-sourced.
Key Findings
AUGCON improves accuracy on reading QA benchmarks compared to prior context-driven SFT methods.
AUGCON wins GPT-4 pairwise judgements vs other context-driven methods.
CST produces a predictable number of queries: 2^n - 1 for n sentence units.
Results
Accuracy
Accuracy
Accuracy
WebGLM-QA BERTScore (long-form)
Who Should Care
What To Try In 7 Days
Run AUGCON on a small, high-value corpus (1–2k documents) and fine-tune an open-source chat model to compare against DAPT.
Use CST with a higher λ to prioritize macro-topic queries or a lower λ to harvest detailed Q&A and observe downstream task gains.
Open the released code and run the scorer + filtering pipeline to produce a compact, diverse SFT set and test model quality with a 100-query human sample.
Optimization Features
Token Efficiency
- CST yields linear amount of questions relative to sentence units, controlling token generation via c
Infra Optimization
- Single-node 8xA100 setup, DeepSpeed + ZeRO-2
Model Optimization
- LoRA
System Optimization
- Use of A100 80G multi-GPU node; generation throughput ~340 pairs/A100-hour
Training Optimization
- DeepSpeed ZeRO-2 for memory efficiency
- AdamW optimizer and 4 training epochs
Inference Optimization
- vLLM for high-throughput LLM calls
- concurrent requests (8 threads) to accelerate generation
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Quality depends on input context depth; shallow or noisy corpora yield weaker queries and answers.
- Method requires substantial GPU resources for large corpora (reported 184 A100 hours to generate 120K pairs on DailyM).
- Potential bias from base LLMs and sensitivity to cultural/linguistic nuances not fully resolved.
- Generalization across low-resource languages and highly specialized domains not fully validated.
- Answer filtering beyond self-alignment (e.g., post-hoc factual verification) left as future work.
When Not To Use
- You have a tiny corpus with few coherent sentences—CST will yield limited value.
- You lack GPU budget for generating and fine-tuning (AUGCON can be compute-heavy at scale).
- Your application requires formal safety certification or rigorous fact checks; outputs may need extra verification.
Failure Modes
- LLM hallucination during context split produces invalid subcontexts or nonsensical queries.
- Scorer misranking retains low-quality queries if negative sample construction is insufficient.
- Self-improving ICL search selects suboptimal exemplars, lowering answer fidelity.
- Overfitting to synthetic prompt style if few-shot examples are not diverse.
Core Entities
Models
- Qwen1.5-32B-Chat
- Llama3-70B-Instruct
- Llama3-c-70B
- Qwen1.5-c-32B
Metrics
- Accuracy
- BERTScore
- ROUGE-L
- GPT-4 pairwise judgement
Datasets
- SFT
- SQuAD1.1
- TriviaQA
- DROP
- WebGLM-QA
Benchmarks
- SQuAD1.1
- TriviaQA
- DROP
- WebGLM-QA
- DailyM test set
Context Entities
Models
- GPT-4 (used as judge)
Datasets
- Open-source magazine corpus (DailyM), public QA benchmarks

