Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.
Summary TLDR
LongAlign is a pragmatic recipe for making LLMs follow instructions on long inputs. The team builds 10k supervised long-instruction examples (8k–64k tokens) from nine sources, proposes packing and sorted-batching to speed supervised fine-tuning, and introduces a human-checked benchmark LongBench-Chat (10k–100k queries) scored by GPT-4. Packing + a proposed loss-weighting fixes a training bias and improves long-task accuracy (~10%); more and more diverse long data yields up to ~30% gains on evaluated long tasks. Code, data, and models are open-sourced.
Problem Statement
Current long-context work focuses on extending architecture and positional encodings but lacks practical instruction-following finetuning data, efficient multi-GPU training methods for long varied-length examples, and a reliable benchmark to evaluate instruction-following on very long inputs.
Main Contribution
A diverse long instruction-following dataset: 10k generated SFT instances from 9 long-text sources covering 8k–64k token lengths (10% Chinese).
Training recipes for efficient supervised finetuning: packing and sorted batching plus a loss-weighting fix for packing that balances per-sequence loss contributions.
LongBench-Chat: a 50-question, human-annotated benchmark (10k–100k contexts) scored with GPT-4+few-shot and validated against human judgments.
Empirical study: shows data quantity/diversity and training choices materially affect long-context instruction performance and scale to 13B models and 128k context in experiments.
Key Findings
More long instruction data materially improves long-context instruction performance.
Diversity of long data helps instruction-following beyond raw volume.
Packing and sorted batching roughly double training throughput versus naïve batching.
Packing causes loss-weighting bias; correcting it improves long-task accuracy by ~10%.
GPT-4 with few-shot scoring aligns well with human raters on LongBench-Chat.
Method scales to larger models and longer windows in experiments.
Results
LongBench-Chat (ChatGLM3-6B-64k)
Effect of packing + loss weighting (ChatGLM3-6B-64k)
Scaling to 13B (LongBench-Chat)
Who Should Care
What To Try In 7 Days
Create 1k–5k long instruction examples from your domain (mix sources).
Implement packing or sorted batching to speed SFT; measure GPU idle time.
Add per-sequence loss scaling when packing to avoid bias toward long examples and test long-task accuracy.
Agent Features
Memory
- Extended input-context memory (up to 64k/128k tokens)
Frameworks
- Packing training
- Sorted batching
Architectures
- Long-context transformer (RoPE scaling)
Optimization Features
Token Efficiency
- Use ChatGLM tokenizer for denser Chinese compression (dataset measured by that tokenizer)
Infra Optimization
- DeepSpeed + ZeRO3 + CPU offloading (8xA800 80G GPUs tested)
System Optimization
- FlashAttention-2 block-diagonal attention via cu_seqlens use
Training Optimization
- Packing (concat sequences into packs)
- Sorted batching (group by length)
- Loss weighting for packing to balance per-sequence loss
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Data focuses on QA, summarization, reasoning; lacks multi-turn lifelong dialogues and other long-task types (Sec E).
- Experiments restricted to SFT on models up to 13B and context window mainly up to 64k due to resource/framework limits.
- LongBench-Chat is 50 questions; while GPT-4 scoring correlates well with humans, the benchmark size is modest.
When Not To Use
- When the required task is multi-turn lifelong dialogue or other long-task types not covered by the dataset.
- If you cannot modify training pipeline to support packing or FlashAttention-2 APIs.
- When you need RLHF-style preference alignment — this recipe focuses on supervised finetuning.
Failure Modes
- Packing without loss weighting over-weights long sequences and target tokens, harming training.
- Evaluator bias: GPT-4 scoring, while correlated with humans, may still miss nuanced preferences or favor certain styles.
- Dataset licensing: some sources (Books3) have unclear distribution rights; check legal use before deployment.
Core Entities
Models
- ChatGLM3-6B-64k
- LongAlign-6B-64k
- LongAlign-7B-64k
- LongAlign-13B-64k
- Llama-2-7B-64k
- Llama-2-13B-64k
Metrics
- GPT-4 rating (1–10)
- Spearman rho
- Kendall tau
- Normalized 0–100 for some tasks
Datasets
- SFT
- LongAlpaca-12k
- ShareGPT
- LongBench-Chat
- LongBench
Benchmarks
- LongBench-Chat
- LongBench
- Needle in a Haystack
Context Entities
Models
- GPT-4-1106-preview
- Claude-2.1
- GLM-4-128k
- Vicuna-7b-16k
- Mixtral-8x7b
Metrics
- ROUGE / F1 (not used directly for aligned models here)
- GPT-4 scoring averages
Datasets
- ArXiv
- Books3
- C4
- CLUECorpus2020
- CommonCrawl
- GitHub
- Stack Exchange
- Wikipedia
- WuDaoCorpora
Benchmarks
- MT-Bench
- MMLU
- ARC
- HellaSwag
- TruthfulQA

