An open‑source Llama3-based model with a 128K context window that matches or beats many proprietary models on ultra-long and RAG tasks

Overview

Production Readiness

0.7

Novelty Score

0.65

Cost Impact Score

0.7

Citation Count

Authors

Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

Links

Abstract / PDF

Why It Matters For Business

You can run an open‑source 70B model that reads 100K+ tokens and often matches or beats commercial models on retrieval and long‑document QA, reducing dependence on closed APIs and giving control over data and cost.

Summary TLDR

ChatQA 2 is a Llama‑3.0 based model extended from 8K to 128K context and instruction‑tuned in three stages. The 70B model outperforms many public and commercial baselines on ultra‑long benchmarks (>100K tokens) and on RAG short‑context tasks, while being competitive on 32K tasks. The recipe (pretraining data, tuning data, retriever setup) and weights are released so practitioners can reproduce or adapt the approach.

Problem Statement

Open-source LLMs lag proprietary models for very long context and retrieval‑augmented tasks. The paper asks: can we extend an 8K Llama3 model to 128K context, tune it for instruction following and RAG, and match proprietary performance on realistic long documents?

Main Contribution

A reproducible recipe to extend Llama3‑70B from 8K to 128K context by continued pretraining and RoPE frequency scaling.

A three‑stage instruction tuning pipeline that mixes short and synthetic long SFT data to improve instruction following, RAG, and ultra‑long understanding.

Empirical comparisons showing the open model (ChatQA‑2‑70B) outperforms many SOTA models on ultra‑long tasks and improves when using large RAG retrieval budgets.

Key Findings

ChatQA‑2‑70B achieves top average on four ultra‑long InfiniteBench tasks.

NumbersAvg 41.04 vs GPT‑4‑Turbo 33.16 (InfiniteBench)

On short RAG conversational benchmarks (within 4K), ChatQA‑2‑70B outperforms several 128K context models.

NumbersAvg 56.30 vs GPT‑4‑Turbo 54.72 (ChatRAG Bench)

On long (32K) benchmarks, ChatQA‑2‑70B is competitive but not top.

Numbers48.15 vs GPT‑4‑Turbo 51.93 (32K long tasks)

RAG with many retrieved chunks can beat direct long‑context prompting.

NumbersRAG 64.55 (k=30) vs Long 64.29 (En.QA+En.MC avg)

Needle‑in‑a‑Haystack retrieval test: ChatQA‑2 models hit 100% retrieval accuracy up to 128K.

Numbers100% accuracy on NIAH needle test

Accuracy improves monotonically with the total number of retrieved tokens in RAG.

NumbersMonotonic gain when moving from 3K to 24K retrieved tokens (Figure 2)

Results

Ultra-long average (InfiniteBench)

Value41.04

BaselineGPT‑4‑Turbo 33.16

Long (32K) average

Value48.15

BaselineGPT‑4‑Turbo 51.93

Short (4K) ChatRAG average

Value56.30

BaselineGPT‑4‑Turbo 54.72

Accuracy

Value100%

RAG vs Long (En.QA+En.MC avg)

ValueRAG 64.55 (k=30)

BaselineLong 64.29

Who Should Care

CtoProduct ManagerMl EngineerData ScientistFounder

What To Try In 7 Days

Reproduce their 128K recipe on a smaller scale (8B) using provided weights and data to validate retrieval on your docs.

Swap in a long‑context retriever (E5‑mistral or NV‑emb‑v2) and test RAG with increasing top‑k until accuracy plateaus.

For QA over very long docs, try RAG with k×chunk_size ≥12K tokens before using full long prompts.

Agent Features

Memory

long context (128K)

Optimization Features

Token Efficiency

Chunk size 1200 with top‑5 used as default
Accuracy

Infra Optimization

Batching ~4M tokens per batch mentioned for pretraining setup

Model Optimization

Increase RoPE base frequency (to 150M)
Use document upsampling for long sequences
Use '<s>' as document separator instead of <BOS>/<EOS>

System Optimization

Use long‑context retrievers (E5‑mistral or NV‑emb‑v2) that embed thousands of tokens

Training Optimization

Continued pretraining on 10B tokens of 128K sequences (upsampled)
Learning rate 3e‑5, batch size 32, 2000 steps (8B tokens)
SFT

Inference Optimization

Prefer RAG with tuned top‑k to reduce inference cost
Use chunk sizes (e.g., 1200) and larger k for better recall

Reproducibility

Code Urls

https://chatqa2-project.github.io/

Data Urls

https://chatqa2-project.github.io/

Code Available

Data Available

Open Source Status

Risks & Boundaries

Limitations

Lower summarization scores due to limited summarization SFT data.
Not as strong on some knowledge/coding benchmarks (MMLU, HumanEval) without RLHF/DPO.
Continued pretraining corpus is smaller than some competitors (e.g., Qwen2), which may limit 32K performance.

When Not To Use

If you need top tier summarization quality without extra SFT data.
When coding or knowledge‑intensive benchmarks are the main goal without RLHF.
If you cannot host or afford inference for a 70B model; test smaller variants first.

Failure Modes

Too many irrelevant retrieved chunks can degrade generation if not tuned.
Fragmentation from small chunk sizes hurts context continuity for some tasks.
Performance depends on retriever quality and the total retrieved token budget.

Core Entities

Models

Llama3-ChatQA-2-70B
Llama3-ChatQA-2-8B
Llama3-ChatQA-1.5-70B
Llama3.1-70B-Instruct
Llama3.1-8B-Instruct
Qwen2-72B-Instruct
GPT-4-Turbo-2024-04-09
Yi-34B
Claude 2

Metrics

Accuracy
ROUGE-L-Sum
F1
Exact Match (EM)

Datasets

SlimPajama
NarrativeQA
OpenOrca
Long-Data-Collections
InfiniteBench
ChatRAG Bench
SCROLLS
LongBench
NeedleInAHaystack

Benchmarks

InfiniteBench (ultra-long)
LongBench/SCROLLS (32K)
ChatRAG Bench (4K)
Needle In A Haystack

Context Entities

Models

Llama‑3‑70B‑Instruct‑Gradient‑262k
Llama3-Instruct-70B
GPT-3.5-Turbo

Datasets

LongAlpaca12k
QMSum
Qasper
QuALITY
HotpotQA
MuSiQue
MultiFieldQA-en

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ChatQA‑2‑70B achieves top average on four ultra‑long InfiniteBench tasks.

On short RAG conversational benchmarks (within 4K), ChatQA‑2‑70B outperforms several 128K context models.

On long (32K) benchmarks, ChatQA‑2‑70B is competitive but not top.

RAG with many retrieved chunks can beat direct long‑context prompting.

Needle‑in‑a‑Haystack retrieval test: ChatQA‑2 models hit 100% retrieval accuracy up to 128K.

Accuracy improves monotonically with the total number of retrieved tokens in RAG.

Results

Ultra-long average (InfiniteBench)

Long (32K) average

Short (4K) ChatRAG average

Accuracy

RAG vs Long (En.QA+En.MC avg)

Who Should Care

What To Try In 7 Days

Agent Features

Memory

Optimization Features

Token Efficiency

Infra Optimization

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Code Urls

Data Urls

Code Available

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

Related Papers