A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

Overview

Decision SnapshotNeeds Validation

The recipe is practical and validated across models and tasks; experiments show consistent gains, but large-scale limits and dataset breadth remain partially explored.

Citations1

Evidence Strength0.70

Confidence0.86

Risk Signals9

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.

Who Should Care

ML Engineer Engineering Lead CTO Product Manager

Summary TLDR

LongAlign is a pragmatic recipe for making LLMs follow instructions on long inputs. The team builds 10k supervised long-instruction examples (8k–64k tokens) from nine sources, proposes packing and sorted-batching to speed supervised fine-tuning, and introduces a human-checked benchmark LongBench-Chat (10k–100k queries) scored by GPT-4. Packing + a proposed loss-weighting fixes a training bias and improves long-task accuracy (~10%); more and more diverse long data yields up to ~30% gains on evaluated long tasks. Code, data, and models are open-sourced.

Problem Statement

Current long-context work focuses on extending architecture and positional encodings but lacks practical instruction-following finetuning data, efficient multi-GPU training methods for long varied-length examples, and a reliable benchmark to evaluate instruction-following on very long inputs.

Main Contribution

A diverse long instruction-following dataset: 10k generated SFT instances from 9 long-text sources covering 8k–64k token lengths (10% Chinese).

Training recipes for efficient supervised finetuning: packing and sorted batching plus a loss-weighting fix for packing that balances per-sequence loss contributions.

Key Findings

More long instruction data materially improves long-context instruction performance.

NumbersLongBench-Chat: 3.73 (0k) → 6.21 (10k) average score

Practical UseAdd several thousand diverse long examples (not just short instruction mix) to SFT to see large improvements on long tasks.

Evidence RefTable 2; Sec 4.2

Diversity of long data helps instruction-following beyond raw volume.

NumbersLongAlign-10k outperforms LongAlpaca-12k on LongBench-Chat and MT-Bench (examples in Table 2)

Practical UsePrefer varied long-text sources (books, papers, code, Wikipedia) over narrow sources when creating long SFT data.

Evidence RefSec 4.2; Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LongBench-Chat (ChatGLM3-6B-64k)	LongAlign-0k: 3.73; LongAlign-5k: 5.97; LongAlign-10k: 6.21; LongAlpaca-12k: 4.46	LongAlign-0k (no long SFT)	10k vs 0k: +2.48 absolute (≈+66% relative on score scale)	LongBench-Chat	Table 2; Sec 4.2	Table 2
Effect of packing + loss weighting (ChatGLM3-6B-64k)	Naïve: 5.87 → Packing: 5.76 → Packing+loss weighting: 6.21	Naïve batching 5.87	Packing+loss weighting vs packing: +0.45 (≈+7.8%)	LongBench-Chat	Table 3; Sec 4.3	Table 3

What To Try In 7 Days

Create 1k–5k long instruction examples from your domain (mix sources).

Implement packing or sorted batching to speed SFT; measure GPU idle time.

Add per-sequence loss scaling when packing to avoid bias toward long examples and test long-task accuracy.

Agent Features

Memory

Extended input-context memory (up to 64k/128k tokens)

Frameworks

Packing trainingSorted batching

Architectures

Long-context transformer (RoPE scaling)

Optimization Features

Token Efficiency

Use ChatGLM tokenizer for denser Chinese compression (dataset measured by that tokenizer)

Infra Optimization

DeepSpeed + ZeRO3 + CPU offloading (8xA800 80G GPUs tested)

System Optimization

FlashAttention-2 block-diagonal attention via cu_seqlens use

Training Optimization

Packing (concat sequences into packs)Sorted batching (group by length)Loss weighting for packing to balance per-sequence loss

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/THUDM/LongAlign

Data URLs

https://github.com/THUDM/LongAlign

Risks & Boundaries

Limitations

Data focuses on QA, summarization, reasoning; lacks multi-turn lifelong dialogues and other long-task types (Sec E).

Experiments restricted to SFT on models up to 13B and context window mainly up to 64k due to resource/framework limits.

When Not To Use

When the required task is multi-turn lifelong dialogue or other long-task types not covered by the dataset.

If you cannot modify training pipeline to support packing or FlashAttention-2 APIs.

Failure Modes

Packing without loss weighting over-weights long sequences and target tokens, harming training.

Evaluator bias: GPT-4 scoring, while correlated with humans, may still miss nuanced preferences or favor certain styles.

Core Entities

Models

ChatGLM3-6B-64kLongAlign-6B-64kLongAlign-7B-64kLongAlign-13B-64kLlama-2-7B-64kLlama-2-13B-64k

Metrics

GPT-4 rating (1–10)Spearman rhoKendall tauNormalized 0–100 for some tasks

Datasets

SFTLongAlpaca-12kShareGPTLongBench-ChatLongBench

Benchmarks

LongBench-ChatLongBenchNeedle in a Haystack

Context Entities

Models

GPT-4-1106-previewClaude-2.1GLM-4-128kVicuna-7b-16kMixtral-8x7b

Metrics

ROUGE / F1 (not used directly for aligned models here)GPT-4 scoring averages

Datasets

ArXivBooks3C4CLUECorpus2020CommonCrawlGitHubStack ExchangeWikipediaWuDaoCorpora

Benchmarks

MT-BenchMMLUARCHellaSwagTruthfulQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

More long instruction data materially improves long-context instruction performance.

Diversity of long data helps instruction-following beyond raw volume.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding