Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
XGen-7B gives teams a practical, open 7B model that handles long documents (up to 8K tokens) and competitive instruction-following, lowering cost versus much larger closed models while keeping good accuracy.
Summary TLDR
Salesforce trains and releases XGen-7B, a family of 7-billion-parameter models trained stage-wise to handle up to 8K input tokens. Key facts: base models saw up to 1.5 trillion training tokens (800B@2K, 400B@4K, 300B@8K). They release two instruction-tuned variants (WizardLM-based and a general public-data mix). Evaluations show XGen-7B matches or beats other open-source 7B models on standard benchmarks and shows clear gains on long-context tasks. Models and code are open-sourced.
Problem Statement
Most competitive open-source LLMs are trained with a 2K token limit. That prevents reliable handling of long documents (summaries, long code, transcripts). The paper aims to produce a practical, open 7B model that reliably uses context up to 8K tokens and remains efficient to serve.
Main Contribution
Train a 7B dense-attention LLM (XGen-7B) with stage-wise context growth: 2K → 4K → 8K tokens.
Scale total token budget to 1.5T tokens (800B, 400B, 300B splits) for the 8K model.
Provide two instruction-tuned releases: XGen-7B-Inst (WizardLM) and XGen-7B-Inst (general public datasets).
Open-source model weights and training code (JaxFormer) for community use.
Empirically analyze training stability choices (RMS-Norm, swish-GLU, sequential attention) and report no catastrophic loss spikes with their recipe.
Key Findings
Stage-wise training yields an 8K-capable model that uses long context.
XGen-7B matches or slightly improves over LLaMA-7B on standard benchmarks.
Instruction-tuned XGen (WizardLM data) is strong on instruction benchmarks judged by GPT-4.
Long-context models improve long-document tasks compared to 2K models.
Code generation performance is competitive for a 7B model.
Training energy footprint reported.
Results
MMLU five-shot weighted average
MMLU zero-shot weighted average
HumanEval pass@1
AlpacaEval win rate vs text-davinci-003
MT-Bench single answer score (GPT-4 grader)
Accuracy
ROUGE-1 on long dialogue summarization (AMI)
Who Should Care
What To Try In 7 Days
Run XGen-7B-8K on your long-doc tasks (documents >2K tokens) and compare summaries/QA against your 2K models.
Replace a 13–40B inference endpoint with XGen-7B-Inst for inexpensive dev/testing of assistant flows.
Fine-tune or prompt-engineer XGen-7B-Inst wizardLM on a small in-domain instruction set for rapid assistant prototyping.
Optimization Features
Token Efficiency
- Concatenate documents with <|endoftext|> and exclude docs <100 tokens
- Shuffle and chunk to control distribution shifts across sequence-length stages
Model Optimization
- Increase vocab to 51,200 tokens
- Use RMS-Norm and swish-GLU for numerical stability
System Optimization
- FP32 numerics except matmul in BF16
- Data and model parallelism optimized for TPU-v4
Training Optimization
- Stage-wise sequence length schedule (2K→4K→8K)
- 1.5T total token budget for 8K model
- JAX implementation (JaxFormer) optimized for TPU-v4
Inference Optimization
- Model kept at 7B for inference efficiency and potential mobile/16GB-GPU deployment
Reproducibility
Data Urls
- https://github.com/togethercomputer/OpenWebText
- https://github.com/togethercomputer/RedPajama (GitHub subset)
- Starcoder/BigCode (referenced)
- WizardLM dataset (WizardLM-196K)
- ShareGPT, Baize, Dolly2, OpenAssistant, SCROLLS
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Models still exhibit bias, toxicity and hallucinations like other LLMs (Section 7).
- Quadratic attention cost makes long-sequence training and inference expensive.
- Some evaluation relies on LLM-as-judge (GPT-4), which can bias results.
- Tokenization: consecutive whitespace handling can break Python syntax in some generations.
- Open-source release is partial; exact licensing and some data details are not fully specified.
When Not To Use
- When you need the highest possible single-shot assistant quality (GPT-4 still scores higher on MT-Bench).
- For extreme context lengths beyond 8K tokens without further engineering.
- When strict certified safety or full provenance of training data is required.
Failure Modes
- Hallucinations on factual queries despite long context.
- Performance degradation if training recipe deviates from RMS-Norm and sequential attention choices.
- Code generation errors due to tokenization of consecutive whitespace.
- Benchmark scores influenced by LLM-based judges and dataset overlap.
Core Entities
Models
- XGen-7B-4K
- XGen-7B-8K
- XGen-7B-Inst wizardLM
- XGen-7B-Inst general
Metrics
- Accuracy
- pass@1
- Win rate (AlpacaEval)
- MT-Bench score
- ROUGE-1/2/L
- Perplexity
Datasets
- Natural language mix (public corpora)
- RedPajama GitHub subset
- Apex code
- Starcoder (BigCode) data
- WizardLM-196K
- ShareGPT
- Baize
- Dolly2
- OpenAssistant (OAsst)
- SCROLLS (long-sequence subsets)
- HumanEval
- MMLU
- AlpacaEval
- MT-Bench
Benchmarks
- MMLU
- HumanEval
- AlpacaEval
- MT-Bench
- SCROLLS long-sequence
- ROUGE (dialogue summarization)

