XGen-7B: an open 7B LLM trained up to 8K context (1.5T tokens) with instruction-tuned releases

September 7, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

4

Authors

Erik Nijkamp, Tian Xie, Hiroaki Hayashi, Bo Pang, Congying Xia, Chen Xing, Jesse Vig, Semih Yavuz, Philippe Laban, Ben Krause, Senthil Purushwalkam, Tong Niu, Wojciech Kryściński, Lidiya Murakhovs'ka, Prafulla Kumar Choubey, Alex Fabbri, Ye Liu, Rui Meng, Lifu Tu, Meghana Bhat, Chien-Sheng Wu, Silvio Savarese, Yingbo Zhou, Shafiq Joty, Caiming Xiong

Links

Abstract / PDF

Why It Matters For Business

XGen-7B gives teams a practical, open 7B model that handles long documents (up to 8K tokens) and competitive instruction-following, lowering cost versus much larger closed models while keeping good accuracy.

Summary TLDR

Salesforce trains and releases XGen-7B, a family of 7-billion-parameter models trained stage-wise to handle up to 8K input tokens. Key facts: base models saw up to 1.5 trillion training tokens (800B@2K, 400B@4K, 300B@8K). They release two instruction-tuned variants (WizardLM-based and a general public-data mix). Evaluations show XGen-7B matches or beats other open-source 7B models on standard benchmarks and shows clear gains on long-context tasks. Models and code are open-sourced.

Problem Statement

Most competitive open-source LLMs are trained with a 2K token limit. That prevents reliable handling of long documents (summaries, long code, transcripts). The paper aims to produce a practical, open 7B model that reliably uses context up to 8K tokens and remains efficient to serve.

Main Contribution

Train a 7B dense-attention LLM (XGen-7B) with stage-wise context growth: 2K → 4K → 8K tokens.

Scale total token budget to 1.5T tokens (800B, 400B, 300B splits) for the 8K model.

Provide two instruction-tuned releases: XGen-7B-Inst (WizardLM) and XGen-7B-Inst (general public datasets).

Open-source model weights and training code (JaxFormer) for community use.

Empirically analyze training stability choices (RMS-Norm, swish-GLU, sequential attention) and report no catastrophic loss spikes with their recipe.

Key Findings

Stage-wise training yields an 8K-capable model that uses long context.

Numbers800B@2K + 400B@4K + 300B@8K = 1.5T tokens

XGen-7B matches or slightly improves over LLaMA-7B on standard benchmarks.

NumbersMMLU five-shot wavg: XGen 36.3 vs LLaMA 35.1

Instruction-tuned XGen (WizardLM data) is strong on instruction benchmarks judged by GPT-4.

NumbersAlpacaEval win rate vs text-davinci-003: 68.8%

Long-context models improve long-document tasks compared to 2K models.

NumbersLong-form QA average (GPT-4 eval): XGen-7B-Inst general avg 2.74 (highest in comparison)

Code generation performance is competitive for a 7B model.

NumbersHumanEval pass@1: XGen-7B = 14.20

Training energy footprint reported.

NumbersEstimated carbon: 4.5 tCO2e

Results

MMLU five-shot weighted average

Value36.3%

BaselineLLaMA-7B 35.1%

MMLU zero-shot weighted average

Value32.1%

BaselineLLaMA-7B 32.0%

HumanEval pass@1

Value14.20

BaselineMPT-7B 15.90

AlpacaEval win rate vs text-davinci-003

Value68.8%

Baselinetext-davinci-003 50%

MT-Bench single answer score (GPT-4 grader)

Value5.69

BaselineGPT-4 8.99

Accuracy

Value2.74

BaselineVicuna-7B-v1.3 2.66

ROUGE-1 on long dialogue summarization (AMI)

Value31.34

BaselineVicuna-7B-v1.3 14.23

Who Should Care

What To Try In 7 Days

Run XGen-7B-8K on your long-doc tasks (documents >2K tokens) and compare summaries/QA against your 2K models.

Replace a 13–40B inference endpoint with XGen-7B-Inst for inexpensive dev/testing of assistant flows.

Fine-tune or prompt-engineer XGen-7B-Inst wizardLM on a small in-domain instruction set for rapid assistant prototyping.

Optimization Features

Token Efficiency

  • Concatenate documents with <|endoftext|> and exclude docs <100 tokens
  • Shuffle and chunk to control distribution shifts across sequence-length stages

Model Optimization

  • Increase vocab to 51,200 tokens
  • Use RMS-Norm and swish-GLU for numerical stability

System Optimization

  • FP32 numerics except matmul in BF16
  • Data and model parallelism optimized for TPU-v4

Training Optimization

  • Stage-wise sequence length schedule (2K→4K→8K)
  • 1.5T total token budget for 8K model
  • JAX implementation (JaxFormer) optimized for TPU-v4

Inference Optimization

  • Model kept at 7B for inference efficiency and potential mobile/16GB-GPU deployment

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Models still exhibit bias, toxicity and hallucinations like other LLMs (Section 7).
  • Quadratic attention cost makes long-sequence training and inference expensive.
  • Some evaluation relies on LLM-as-judge (GPT-4), which can bias results.
  • Tokenization: consecutive whitespace handling can break Python syntax in some generations.
  • Open-source release is partial; exact licensing and some data details are not fully specified.

When Not To Use

  • When you need the highest possible single-shot assistant quality (GPT-4 still scores higher on MT-Bench).
  • For extreme context lengths beyond 8K tokens without further engineering.
  • When strict certified safety or full provenance of training data is required.

Failure Modes

  • Hallucinations on factual queries despite long context.
  • Performance degradation if training recipe deviates from RMS-Norm and sequential attention choices.
  • Code generation errors due to tokenization of consecutive whitespace.
  • Benchmark scores influenced by LLM-based judges and dataset overlap.

Core Entities

Models

  • XGen-7B-4K
  • XGen-7B-8K
  • XGen-7B-Inst wizardLM
  • XGen-7B-Inst general

Metrics

  • Accuracy
  • pass@1
  • Win rate (AlpacaEval)
  • MT-Bench score
  • ROUGE-1/2/L
  • Perplexity

Datasets

  • Natural language mix (public corpora)
  • RedPajama GitHub subset
  • Apex code
  • Starcoder (BigCode) data
  • WizardLM-196K
  • ShareGPT
  • Baize
  • Dolly2
  • OpenAssistant (OAsst)
  • SCROLLS (long-sequence subsets)
  • HumanEval
  • MMLU
  • AlpacaEval
  • MT-Bench

Benchmarks

  • MMLU
  • HumanEval
  • AlpacaEval
  • MT-Bench
  • SCROLLS long-sequence
  • ROUGE (dialogue summarization)