Learn offline 'cheat-sheets' so a 4k LLaMA2 handles 128k tokens, cutting tokens and latency

April 11, 20247 min

Overview

Production Readiness

0.75

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Sijun Tan, Xiuyu Li, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa

Links

Abstract / PDF

Why It Matters For Business

LLoCO cuts token processing and GPU costs for long-document QA while improving accuracy and latency, letting teams serve very long documents without buying larger models or more GPUs.

Summary TLDR

LLoCO compresses long documents offline into short token-embedding "cheat sheets" and then finetunes small LoRA adapters on those embeddings. At inference it retrieves compressed embeddings and the matching adapter, prepends embeddings to the LLM, and generates answers. On LLaMA2-7B this extends effective context to 128k tokens, uses ~30× fewer tokens, gives up to 7.62× inference speedup and up to 11.52× finetuning throughput on A100, and improves long-document QA on several benchmarks. Main limits: the compressor is tied to a specific LLM and adapters are per-document-group.

Problem Statement

Transformer LLMs slow down and run out of GPU memory on very long documents because self-attention and KV caches scale poorly with sequence length. This raises latency, GPU costs, and token billing for long-document QA and summarization.

Main Contribution

Introduce LLoCO: offline context compression + in-domain LoRA finetuning + retrieval of compressed embeddings at inference.

Extend LLaMA2-7B (4k) to effectively handle up to 128k tokens using compressed embeddings.

Cut inference token usage by ~30× and achieve up to 7.62× faster per-token generation (A100) and 11.52× finetuning throughput.

Show consistent QA improvements across multiple long-document benchmarks and robust retrieval (Needle-in-haystack) at ~80% success.

Key Findings

LLoCO raises average QA performance vs base LLaMA2-7B on evaluated long-doc tasks.

NumbersAvg score 23.44 -> 30.67 (Table 1; +7.23 pts)

Inference token footprint is reduced by about 30× via compression.

Numbers30× compression ratio (AutoCompressor: 1536→50 per chunk)

Per-token generation gets large speedups on GPU when using compressed embeddings.

NumbersUp to 7.62× speed-up on A100 and 7.19× on A6000

Finetuning throughput improves substantially when training on compressed contexts.

NumbersUp to 11.52× higher throughput on A100s

LLoCO retrieves short, hard-to-find snippets reliably in retrieval stress tests.

NumbersNeedle-in-haystack retrieval ≈ 80% success rate across lengths

Results

Average QA score (selected long-doc tasks)

Value30.67 (LLoCO)

Baseline23.44 (LLaMA2-7B-4k)

Compression ratio (tokens)

Value30×

Inference speed-up (per-token latency)

Value7.62× (A100), 7.19× (A6000)

BaselineLLaMA2-7B without compression

Finetuning throughput

Valueup to 11.52× (A100)

Baselinefinetuning LLaMA2-7B on original context

Needle-in-haystack retrieval success

Value≈80% success

BaselineLLaMA2-7B-32k lower

Who Should Care

What To Try In 7 Days

Compress a sample collection with an available compressor (AutoCompressor or ICAE) and store embeddings in your vector DB.

Finetune one small LoRA adapter on compressed embeddings for a representative document group and validate QA accuracy.

Serve queries by retrieving compressed embeddings + the matching LoRA adapter, and measure token cost and latency vs your current RAG baseline.

Optimization Features

Token Efficiency

  • 30× fewer tokens during inference

Infra Optimization

  • Enables long-sequence decoding (up to 128k) without blowing GPU VRAM
  • Lower per-token latency on common GPUs (A100, A6000)

Model Optimization

  • LoRA

System Optimization

  • Index compressed embeddings in vector DB and retrieve at runtime
  • LoRA

Training Optimization

  • LoRA

Inference Optimization

  • Prepend short summary embeddings instead of full context to reduce KV cache
  • Retrieve only compressed embeddings for relevant passages

Reproducibility

Data Urls

  • QuALITY
  • Qasper
  • NarrativeQA
  • HotpotQA
  • QMSum
  • LongBench

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Context encoder (AutoCompressor) is tied to a specific base LLM; a new encoder is needed per model.
  • Training a high-quality compressor can be costly (authors note ~15B tokens used).
  • Current pipeline uses one LoRA adapter per document group and cannot easily apply multiple adapters at once.
  • Falls short on tasks requiring long, free-form generated outputs (e.g., one-page summaries like GovReport).

When Not To Use

  • You need a model-agnostic compressor that works unchanged across many LLMs.
  • Tasks require long, detailed generated outputs rather than concise Q&A answers.
  • Your retrieval returns passages from many unrelated groups and you cannot pick or compose one LoRA adapter.

Failure Modes

  • Hallucination if LoRA adapter is not trained for the document distribution or compression loses critical facts.
  • Performance drops on out-of-distribution documents and datasets not represented in finetuning.
  • Serving errors when retrieved passages belong to multiple groups but only one adapter is applied.

Core Entities

Models

  • LLaMA2-7B-4k
  • LLaMA2-7B-32k
  • LLaMA2-7B-128k (comparative)
  • Longchat7b-v1.5-32k
  • AutoCompressor
  • ICAE

Metrics

  • Exact Match (EM)
  • F1
  • ROUGE (geometric mean)

Datasets

  • QuALITY
  • Qasper
  • NarrativeQA
  • HotpotQA
  • QMSum
  • LongBench

Benchmarks

  • LongBench

Context Entities

Models

  • GPT-4 (used for distilled training data)
  • CEPE
  • SnapKV