A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

January 31, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li

Links

Abstract / PDF

Why It Matters For Business

If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.

Summary TLDR

LongAlign is a pragmatic recipe for making LLMs follow instructions on long inputs. The team builds 10k supervised long-instruction examples (8k–64k tokens) from nine sources, proposes packing and sorted-batching to speed supervised fine-tuning, and introduces a human-checked benchmark LongBench-Chat (10k–100k queries) scored by GPT-4. Packing + a proposed loss-weighting fixes a training bias and improves long-task accuracy (~10%); more and more diverse long data yields up to ~30% gains on evaluated long tasks. Code, data, and models are open-sourced.

Problem Statement

Current long-context work focuses on extending architecture and positional encodings but lacks practical instruction-following finetuning data, efficient multi-GPU training methods for long varied-length examples, and a reliable benchmark to evaluate instruction-following on very long inputs.

Main Contribution

A diverse long instruction-following dataset: 10k generated SFT instances from 9 long-text sources covering 8k–64k token lengths (10% Chinese).

Training recipes for efficient supervised finetuning: packing and sorted batching plus a loss-weighting fix for packing that balances per-sequence loss contributions.

LongBench-Chat: a 50-question, human-annotated benchmark (10k–100k contexts) scored with GPT-4+few-shot and validated against human judgments.

Empirical study: shows data quantity/diversity and training choices materially affect long-context instruction performance and scale to 13B models and 128k context in experiments.

Key Findings

More long instruction data materially improves long-context instruction performance.

NumbersLongBench-Chat: 3.73 (0k) → 6.21 (10k) average score

Diversity of long data helps instruction-following beyond raw volume.

NumbersLongAlign-10k outperforms LongAlpaca-12k on LongBench-Chat and MT-Bench (examples in Table 2)

Packing and sorted batching roughly double training throughput versus naïve batching.

NumbersTraining time reduced by >100% (less than half the time, Fig.5)

Packing causes loss-weighting bias; correcting it improves long-task accuracy by ~10%.

NumbersPacking+loss weighting improves LongBench-Chat from 5.76 → 6.21 (ChatGLM3-6B) (~7.8%) and comparable 5–10% gains noted

GPT-4 with few-shot scoring aligns well with human raters on LongBench-Chat.

NumbersSpearman ρ(GPT-4+few-shot)=0.844 versus human inter-annotator ρ=0.817; avg score diff ≤0.1

Method scales to larger models and longer windows in experiments.

NumbersLlama-2-13B-64k LongBench-Chat 6.79–7.02 vs 7B model ~6.1 (~10% gain)

Results

LongBench-Chat (ChatGLM3-6B-64k)

ValueLongAlign-0k: 3.73; LongAlign-5k: 5.97; LongAlign-10k: 6.21; LongAlpaca-12k: 4.46

BaselineLongAlign-0k (no long SFT)

Effect of packing + loss weighting (ChatGLM3-6B-64k)

ValueNaïve: 5.87 → Packing: 5.76 → Packing+loss weighting: 6.21

BaselineNaïve batching 5.87

Scaling to 13B (LongBench-Chat)

ValueLlama-2-7B-64k (packing+loss): 6.10; Llama-2-13B-64k (packing+loss): 6.79

Baseline7B model 6.10

Who Should Care

What To Try In 7 Days

Create 1k–5k long instruction examples from your domain (mix sources).

Implement packing or sorted batching to speed SFT; measure GPU idle time.

Add per-sequence loss scaling when packing to avoid bias toward long examples and test long-task accuracy.

Agent Features

Memory

  • Extended input-context memory (up to 64k/128k tokens)

Frameworks

  • Packing training
  • Sorted batching

Architectures

  • Long-context transformer (RoPE scaling)

Optimization Features

Token Efficiency

  • Use ChatGLM tokenizer for denser Chinese compression (dataset measured by that tokenizer)

Infra Optimization

  • DeepSpeed + ZeRO3 + CPU offloading (8xA800 80G GPUs tested)

System Optimization

  • FlashAttention-2 block-diagonal attention via cu_seqlens use

Training Optimization

  • Packing (concat sequences into packs)
  • Sorted batching (group by length)
  • Loss weighting for packing to balance per-sequence loss

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Data focuses on QA, summarization, reasoning; lacks multi-turn lifelong dialogues and other long-task types (Sec E).
  • Experiments restricted to SFT on models up to 13B and context window mainly up to 64k due to resource/framework limits.
  • LongBench-Chat is 50 questions; while GPT-4 scoring correlates well with humans, the benchmark size is modest.

When Not To Use

  • When the required task is multi-turn lifelong dialogue or other long-task types not covered by the dataset.
  • If you cannot modify training pipeline to support packing or FlashAttention-2 APIs.
  • When you need RLHF-style preference alignment — this recipe focuses on supervised finetuning.

Failure Modes

  • Packing without loss weighting over-weights long sequences and target tokens, harming training.
  • Evaluator bias: GPT-4 scoring, while correlated with humans, may still miss nuanced preferences or favor certain styles.
  • Dataset licensing: some sources (Books3) have unclear distribution rights; check legal use before deployment.

Core Entities

Models

  • ChatGLM3-6B-64k
  • LongAlign-6B-64k
  • LongAlign-7B-64k
  • LongAlign-13B-64k
  • Llama-2-7B-64k
  • Llama-2-13B-64k

Metrics

  • GPT-4 rating (1–10)
  • Spearman rho
  • Kendall tau
  • Normalized 0–100 for some tasks

Datasets

  • SFT
  • LongAlpaca-12k
  • ShareGPT
  • LongBench-Chat
  • LongBench

Benchmarks

  • LongBench-Chat
  • LongBench
  • Needle in a Haystack

Context Entities

Models

  • GPT-4-1106-preview
  • Claude-2.1
  • GLM-4-128k
  • Vicuna-7b-16k
  • Mixtral-8x7b

Metrics

  • ROUGE / F1 (not used directly for aligned models here)
  • GPT-4 scoring averages

Datasets

  • ArXiv
  • Books3
  • C4
  • CLUECorpus2020
  • CommonCrawl
  • GitHub
  • Stack Exchange
  • Wikipedia
  • WuDaoCorpora

Benchmarks

  • MT-Bench
  • MMLU
  • ARC
  • HellaSwag
  • TruthfulQA