A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

January 31, 20248 min

Overview

Decision SnapshotNeeds Validation

The recipe is practical and validated across models and tasks; experiments show consistent gains, but large-scale limits and dataset breadth remain partially explored.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.

Who Should Care

Summary TLDR

LongAlign is a pragmatic recipe for making LLMs follow instructions on long inputs. The team builds 10k supervised long-instruction examples (8k–64k tokens) from nine sources, proposes packing and sorted-batching to speed supervised fine-tuning, and introduces a human-checked benchmark LongBench-Chat (10k–100k queries) scored by GPT-4. Packing + a proposed loss-weighting fixes a training bias and improves long-task accuracy (~10%); more and more diverse long data yields up to ~30% gains on evaluated long tasks. Code, data, and models are open-sourced.

Problem Statement

Current long-context work focuses on extending architecture and positional encodings but lacks practical instruction-following finetuning data, efficient multi-GPU training methods for long varied-length examples, and a reliable benchmark to evaluate instruction-following on very long inputs.

Main Contribution

A diverse long instruction-following dataset: 10k generated SFT instances from 9 long-text sources covering 8k–64k token lengths (10% Chinese).

Training recipes for efficient supervised finetuning: packing and sorted batching plus a loss-weighting fix for packing that balances per-sequence loss contributions.

Key Findings

More long instruction data materially improves long-context instruction performance.

NumbersLongBench-Chat: 3.73 (0k) → 6.21 (10k) average score

Practical UseAdd several thousand diverse long examples (not just short instruction mix) to SFT to see large improvements on long tasks.

Evidence RefTable 2; Sec 4.2

Diversity of long data helps instruction-following beyond raw volume.

NumbersLongAlign-10k outperforms LongAlpaca-12k on LongBench-Chat and MT-Bench (examples in Table 2)

Practical UsePrefer varied long-text sources (books, papers, code, Wikipedia) over narrow sources when creating long SFT data.

Evidence RefSec 4.2; Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LongBench-Chat (ChatGLM3-6B-64k)LongAlign-0k: 3.73; LongAlign-5k: 5.97; LongAlign-10k: 6.21; LongAlpaca-12k: 4.46LongAlign-0k (no long SFT)10k vs 0k: +2.48 absolute (≈+66% relative on score scale)LongBench-ChatTable 2; Sec 4.2Table 2
Effect of packing + loss weighting (ChatGLM3-6B-64k)Naïve: 5.87 → Packing: 5.76 → Packing+loss weighting: 6.21Naïve batching 5.87Packing+loss weighting vs packing: +0.45 (≈+7.8%)LongBench-ChatTable 3; Sec 4.3Table 3

What To Try In 7 Days

Create 1k–5k long instruction examples from your domain (mix sources).

Implement packing or sorted batching to speed SFT; measure GPU idle time.

Add per-sequence loss scaling when packing to avoid bias toward long examples and test long-task accuracy.

Agent Features

Memory
Extended input-context memory (up to 64k/128k tokens)
Frameworks
Packing trainingSorted batching
Architectures
Long-context transformer (RoPE scaling)

Optimization Features

Token Efficiency
Use ChatGLM tokenizer for denser Chinese compression (dataset measured by that tokenizer)
Infra Optimization
DeepSpeed + ZeRO3 + CPU offloading (8xA800 80G GPUs tested)
System Optimization
FlashAttention-2 block-diagonal attention via cu_seqlens use
Training Optimization
Packing (concat sequences into packs)Sorted batching (group by length)Loss weighting for packing to balance per-sequence loss

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

Data focuses on QA, summarization, reasoning; lacks multi-turn lifelong dialogues and other long-task types (Sec E).

Experiments restricted to SFT on models up to 13B and context window mainly up to 64k due to resource/framework limits.

When Not To Use

When the required task is multi-turn lifelong dialogue or other long-task types not covered by the dataset.

If you cannot modify training pipeline to support packing or FlashAttention-2 APIs.

Failure Modes

Packing without loss weighting over-weights long sequences and target tokens, harming training.

Evaluator bias: GPT-4 scoring, while correlated with humans, may still miss nuanced preferences or favor certain styles.

Core Entities

Models

ChatGLM3-6B-64kLongAlign-6B-64kLongAlign-7B-64kLongAlign-13B-64kLlama-2-7B-64kLlama-2-13B-64k

Metrics

GPT-4 rating (1–10)Spearman rhoKendall tauNormalized 0–100 for some tasks

Datasets

SFTLongAlpaca-12kShareGPTLongBench-ChatLongBench

Benchmarks

LongBench-ChatLongBenchNeedle in a Haystack

Context Entities

Models

GPT-4-1106-previewClaude-2.1GLM-4-128kVicuna-7b-16kMixtral-8x7b

Metrics

ROUGE / F1 (not used directly for aligned models here)GPT-4 scoring averages

Datasets

ArXivBooks3C4CLUECorpus2020CommonCrawlGitHubStack ExchangeWikipediaWuDaoCorpora

Benchmarks

MT-BenchMMLUARCHellaSwagTruthfulQA