Overview
The recipe is practical and validated across models and tasks; experiments show consistent gains, but large-scale limits and dataset breadth remain partially explored.
Citations1
Evidence Strength0.70
Confidence0.86
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 3/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 50%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.
Who Should Care
Summary TLDR
LongAlign is a pragmatic recipe for making LLMs follow instructions on long inputs. The team builds 10k supervised long-instruction examples (8k–64k tokens) from nine sources, proposes packing and sorted-batching to speed supervised fine-tuning, and introduces a human-checked benchmark LongBench-Chat (10k–100k queries) scored by GPT-4. Packing + a proposed loss-weighting fixes a training bias and improves long-task accuracy (~10%); more and more diverse long data yields up to ~30% gains on evaluated long tasks. Code, data, and models are open-sourced.
Problem Statement
Current long-context work focuses on extending architecture and positional encodings but lacks practical instruction-following finetuning data, efficient multi-GPU training methods for long varied-length examples, and a reliable benchmark to evaluate instruction-following on very long inputs.
Main Contribution
A diverse long instruction-following dataset: 10k generated SFT instances from 9 long-text sources covering 8k–64k token lengths (10% Chinese).
Training recipes for efficient supervised finetuning: packing and sorted batching plus a loss-weighting fix for packing that balances per-sequence loss contributions.
Key Findings
More long instruction data materially improves long-context instruction performance.
Diversity of long data helps instruction-following beyond raw volume.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| LongBench-Chat (ChatGLM3-6B-64k) | LongAlign-0k: 3.73; LongAlign-5k: 5.97; LongAlign-10k: 6.21; LongAlpaca-12k: 4.46 | LongAlign-0k (no long SFT) | 10k vs 0k: +2.48 absolute (≈+66% relative on score scale) | LongBench-Chat | Table 2; Sec 4.2 | Table 2 |
| Effect of packing + loss weighting (ChatGLM3-6B-64k) | Naïve: 5.87 → Packing: 5.76 → Packing+loss weighting: 6.21 | Naïve batching 5.87 | Packing+loss weighting vs packing: +0.45 (≈+7.8%) | LongBench-Chat | Table 3; Sec 4.3 | Table 3 |
What To Try In 7 Days
Create 1k–5k long instruction examples from your domain (mix sources).
Implement packing or sorted batching to speed SFT; measure GPU idle time.
Add per-sequence loss scaling when packing to avoid bias toward long examples and test long-task accuracy.
Agent Features
Memory
Frameworks
Architectures
Optimization Features
Token Efficiency
Infra Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Data focuses on QA, summarization, reasoning; lacks multi-turn lifelong dialogues and other long-task types (Sec E).
Experiments restricted to SFT on models up to 13B and context window mainly up to 64k due to resource/framework limits.
When Not To Use
When the required task is multi-turn lifelong dialogue or other long-task types not covered by the dataset.
If you cannot modify training pipeline to support packing or FlashAttention-2 APIs.
Failure Modes
Packing without loss weighting over-weights long sequences and target tokens, harming training.
Evaluator bias: GPT-4 scoring, while correlated with humans, may still miss nuanced preferences or favor certain styles.

