An open‑source Llama3-based model with a 128K context window that matches or beats many proprietary models on ultra-long and RAG tasks

July 19, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

0

Authors

Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

Links

Abstract / PDF

Why It Matters For Business

You can run an open‑source 70B model that reads 100K+ tokens and often matches or beats commercial models on retrieval and long‑document QA, reducing dependence on closed APIs and giving control over data and cost.

Summary TLDR

ChatQA 2 is a Llama‑3.0 based model extended from 8K to 128K context and instruction‑tuned in three stages. The 70B model outperforms many public and commercial baselines on ultra‑long benchmarks (>100K tokens) and on RAG short‑context tasks, while being competitive on 32K tasks. The recipe (pretraining data, tuning data, retriever setup) and weights are released so practitioners can reproduce or adapt the approach.

Problem Statement

Open-source LLMs lag proprietary models for very long context and retrieval‑augmented tasks. The paper asks: can we extend an 8K Llama3 model to 128K context, tune it for instruction following and RAG, and match proprietary performance on realistic long documents?

Main Contribution

A reproducible recipe to extend Llama3‑70B from 8K to 128K context by continued pretraining and RoPE frequency scaling.

A three‑stage instruction tuning pipeline that mixes short and synthetic long SFT data to improve instruction following, RAG, and ultra‑long understanding.

Empirical comparisons showing the open model (ChatQA‑2‑70B) outperforms many SOTA models on ultra‑long tasks and improves when using large RAG retrieval budgets.

Key Findings

ChatQA‑2‑70B achieves top average on four ultra‑long InfiniteBench tasks.

NumbersAvg 41.04 vs GPT‑4‑Turbo 33.16 (InfiniteBench)

On short RAG conversational benchmarks (within 4K), ChatQA‑2‑70B outperforms several 128K context models.

NumbersAvg 56.30 vs GPT‑4‑Turbo 54.72 (ChatRAG Bench)

On long (32K) benchmarks, ChatQA‑2‑70B is competitive but not top.

Numbers48.15 vs GPT‑4‑Turbo 51.93 (32K long tasks)

RAG with many retrieved chunks can beat direct long‑context prompting.

NumbersRAG 64.55 (k=30) vs Long 64.29 (En.QA+En.MC avg)

Needle‑in‑a‑Haystack retrieval test: ChatQA‑2 models hit 100% retrieval accuracy up to 128K.

Numbers100% accuracy on NIAH needle test

Accuracy improves monotonically with the total number of retrieved tokens in RAG.

NumbersMonotonic gain when moving from 3K to 24K retrieved tokens (Figure 2)

Results

Ultra-long average (InfiniteBench)

Value41.04

BaselineGPT‑4‑Turbo 33.16

Long (32K) average

Value48.15

BaselineGPT‑4‑Turbo 51.93

Short (4K) ChatRAG average

Value56.30

BaselineGPT‑4‑Turbo 54.72

Accuracy

Value100%

RAG vs Long (En.QA+En.MC avg)

ValueRAG 64.55 (k=30)

BaselineLong 64.29

Who Should Care

What To Try In 7 Days

Reproduce their 128K recipe on a smaller scale (8B) using provided weights and data to validate retrieval on your docs.

Swap in a long‑context retriever (E5‑mistral or NV‑emb‑v2) and test RAG with increasing top‑k until accuracy plateaus.

For QA over very long docs, try RAG with k×chunk_size ≥12K tokens before using full long prompts.

Agent Features

Memory

  • long context (128K)

Optimization Features

Token Efficiency

  • Chunk size 1200 with top‑5 used as default
  • Accuracy

Infra Optimization

  • Batching ~4M tokens per batch mentioned for pretraining setup

Model Optimization

  • Increase RoPE base frequency (to 150M)
  • Use document upsampling for long sequences
  • Use '<s>' as document separator instead of <BOS>/<EOS>

System Optimization

  • Use long‑context retrievers (E5‑mistral or NV‑emb‑v2) that embed thousands of tokens

Training Optimization

  • Continued pretraining on 10B tokens of 128K sequences (upsampled)
  • Learning rate 3e‑5, batch size 32, 2000 steps (8B tokens)
  • SFT

Inference Optimization

  • Prefer RAG with tuned top‑k to reduce inference cost
  • Use chunk sizes (e.g., 1200) and larger k for better recall

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Lower summarization scores due to limited summarization SFT data.
  • Not as strong on some knowledge/coding benchmarks (MMLU, HumanEval) without RLHF/DPO.
  • Continued pretraining corpus is smaller than some competitors (e.g., Qwen2), which may limit 32K performance.

When Not To Use

  • If you need top tier summarization quality without extra SFT data.
  • When coding or knowledge‑intensive benchmarks are the main goal without RLHF.
  • If you cannot host or afford inference for a 70B model; test smaller variants first.

Failure Modes

  • Too many irrelevant retrieved chunks can degrade generation if not tuned.
  • Fragmentation from small chunk sizes hurts context continuity for some tasks.
  • Performance depends on retriever quality and the total retrieved token budget.

Core Entities

Models

  • Llama3-ChatQA-2-70B
  • Llama3-ChatQA-2-8B
  • Llama3-ChatQA-1.5-70B
  • Llama3.1-70B-Instruct
  • Llama3.1-8B-Instruct
  • Qwen2-72B-Instruct
  • GPT-4-Turbo-2024-04-09
  • Yi-34B
  • Claude 2

Metrics

  • Accuracy
  • ROUGE-L-Sum
  • F1
  • Exact Match (EM)

Datasets

  • SlimPajama
  • NarrativeQA
  • OpenOrca
  • Long-Data-Collections
  • InfiniteBench
  • ChatRAG Bench
  • SCROLLS
  • LongBench
  • NeedleInAHaystack

Benchmarks

  • InfiniteBench (ultra-long)
  • LongBench/SCROLLS (32K)
  • ChatRAG Bench (4K)
  • Needle In A Haystack

Context Entities

Models

  • Llama‑3‑70B‑Instruct‑Gradient‑262k
  • Llama3-Instruct-70B
  • GPT-3.5-Turbo

Datasets

  • LongAlpaca12k
  • QMSum
  • Qasper
  • QuALITY
  • HotpotQA
  • MuSiQue
  • MultiFieldQA-en