Overview
Production Readiness
0.7
Novelty Score
0.65
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can run an open‑source 70B model that reads 100K+ tokens and often matches or beats commercial models on retrieval and long‑document QA, reducing dependence on closed APIs and giving control over data and cost.
Summary TLDR
ChatQA 2 is a Llama‑3.0 based model extended from 8K to 128K context and instruction‑tuned in three stages. The 70B model outperforms many public and commercial baselines on ultra‑long benchmarks (>100K tokens) and on RAG short‑context tasks, while being competitive on 32K tasks. The recipe (pretraining data, tuning data, retriever setup) and weights are released so practitioners can reproduce or adapt the approach.
Problem Statement
Open-source LLMs lag proprietary models for very long context and retrieval‑augmented tasks. The paper asks: can we extend an 8K Llama3 model to 128K context, tune it for instruction following and RAG, and match proprietary performance on realistic long documents?
Main Contribution
A reproducible recipe to extend Llama3‑70B from 8K to 128K context by continued pretraining and RoPE frequency scaling.
A three‑stage instruction tuning pipeline that mixes short and synthetic long SFT data to improve instruction following, RAG, and ultra‑long understanding.
Empirical comparisons showing the open model (ChatQA‑2‑70B) outperforms many SOTA models on ultra‑long tasks and improves when using large RAG retrieval budgets.
Key Findings
ChatQA‑2‑70B achieves top average on four ultra‑long InfiniteBench tasks.
On short RAG conversational benchmarks (within 4K), ChatQA‑2‑70B outperforms several 128K context models.
On long (32K) benchmarks, ChatQA‑2‑70B is competitive but not top.
RAG with many retrieved chunks can beat direct long‑context prompting.
Needle‑in‑a‑Haystack retrieval test: ChatQA‑2 models hit 100% retrieval accuracy up to 128K.
Accuracy improves monotonically with the total number of retrieved tokens in RAG.
Results
Ultra-long average (InfiniteBench)
Long (32K) average
Short (4K) ChatRAG average
Accuracy
RAG vs Long (En.QA+En.MC avg)
Who Should Care
What To Try In 7 Days
Reproduce their 128K recipe on a smaller scale (8B) using provided weights and data to validate retrieval on your docs.
Swap in a long‑context retriever (E5‑mistral or NV‑emb‑v2) and test RAG with increasing top‑k until accuracy plateaus.
For QA over very long docs, try RAG with k×chunk_size ≥12K tokens before using full long prompts.
Agent Features
Memory
- long context (128K)
Optimization Features
Token Efficiency
- Chunk size 1200 with top‑5 used as default
- Accuracy
Infra Optimization
- Batching ~4M tokens per batch mentioned for pretraining setup
Model Optimization
- Increase RoPE base frequency (to 150M)
- Use document upsampling for long sequences
- Use '<s>' as document separator instead of <BOS>/<EOS>
System Optimization
- Use long‑context retrievers (E5‑mistral or NV‑emb‑v2) that embed thousands of tokens
Training Optimization
- Continued pretraining on 10B tokens of 128K sequences (upsampled)
- Learning rate 3e‑5, batch size 32, 2000 steps (8B tokens)
- SFT
Inference Optimization
- Prefer RAG with tuned top‑k to reduce inference cost
- Use chunk sizes (e.g., 1200) and larger k for better recall
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Lower summarization scores due to limited summarization SFT data.
- Not as strong on some knowledge/coding benchmarks (MMLU, HumanEval) without RLHF/DPO.
- Continued pretraining corpus is smaller than some competitors (e.g., Qwen2), which may limit 32K performance.
When Not To Use
- If you need top tier summarization quality without extra SFT data.
- When coding or knowledge‑intensive benchmarks are the main goal without RLHF.
- If you cannot host or afford inference for a 70B model; test smaller variants first.
Failure Modes
- Too many irrelevant retrieved chunks can degrade generation if not tuned.
- Fragmentation from small chunk sizes hurts context continuity for some tasks.
- Performance depends on retriever quality and the total retrieved token budget.
Core Entities
Models
- Llama3-ChatQA-2-70B
- Llama3-ChatQA-2-8B
- Llama3-ChatQA-1.5-70B
- Llama3.1-70B-Instruct
- Llama3.1-8B-Instruct
- Qwen2-72B-Instruct
- GPT-4-Turbo-2024-04-09
- Yi-34B
- Claude 2
Metrics
- Accuracy
- ROUGE-L-Sum
- F1
- Exact Match (EM)
Datasets
- SlimPajama
- NarrativeQA
- OpenOrca
- Long-Data-Collections
- InfiniteBench
- ChatRAG Bench
- SCROLLS
- LongBench
- NeedleInAHaystack
Benchmarks
- InfiniteBench (ultra-long)
- LongBench/SCROLLS (32K)
- ChatRAG Bench (4K)
- Needle In A Haystack
Context Entities
Models
- Llama‑3‑70B‑Instruct‑Gradient‑262k
- Llama3-Instruct-70B
- GPT-3.5-Turbo
Datasets
- LongAlpaca12k
- QMSum
- Qasper
- QuALITY
- HotpotQA
- MuSiQue
- MultiFieldQA-en

