Overview
Production Readiness
0.75
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
1
Why It Matters For Business
LLoCO cuts token processing and GPU costs for long-document QA while improving accuracy and latency, letting teams serve very long documents without buying larger models or more GPUs.
Summary TLDR
LLoCO compresses long documents offline into short token-embedding "cheat sheets" and then finetunes small LoRA adapters on those embeddings. At inference it retrieves compressed embeddings and the matching adapter, prepends embeddings to the LLM, and generates answers. On LLaMA2-7B this extends effective context to 128k tokens, uses ~30× fewer tokens, gives up to 7.62× inference speedup and up to 11.52× finetuning throughput on A100, and improves long-document QA on several benchmarks. Main limits: the compressor is tied to a specific LLM and adapters are per-document-group.
Problem Statement
Transformer LLMs slow down and run out of GPU memory on very long documents because self-attention and KV caches scale poorly with sequence length. This raises latency, GPU costs, and token billing for long-document QA and summarization.
Main Contribution
Introduce LLoCO: offline context compression + in-domain LoRA finetuning + retrieval of compressed embeddings at inference.
Extend LLaMA2-7B (4k) to effectively handle up to 128k tokens using compressed embeddings.
Cut inference token usage by ~30× and achieve up to 7.62× faster per-token generation (A100) and 11.52× finetuning throughput.
Show consistent QA improvements across multiple long-document benchmarks and robust retrieval (Needle-in-haystack) at ~80% success.
Key Findings
LLoCO raises average QA performance vs base LLaMA2-7B on evaluated long-doc tasks.
Inference token footprint is reduced by about 30× via compression.
Per-token generation gets large speedups on GPU when using compressed embeddings.
Finetuning throughput improves substantially when training on compressed contexts.
LLoCO retrieves short, hard-to-find snippets reliably in retrieval stress tests.
Results
Average QA score (selected long-doc tasks)
Compression ratio (tokens)
Inference speed-up (per-token latency)
Finetuning throughput
Needle-in-haystack retrieval success
Who Should Care
What To Try In 7 Days
Compress a sample collection with an available compressor (AutoCompressor or ICAE) and store embeddings in your vector DB.
Finetune one small LoRA adapter on compressed embeddings for a representative document group and validate QA accuracy.
Serve queries by retrieving compressed embeddings + the matching LoRA adapter, and measure token cost and latency vs your current RAG baseline.
Optimization Features
Token Efficiency
- 30× fewer tokens during inference
Infra Optimization
- Enables long-sequence decoding (up to 128k) without blowing GPU VRAM
- Lower per-token latency on common GPUs (A100, A6000)
Model Optimization
- LoRA
System Optimization
- Index compressed embeddings in vector DB and retrieve at runtime
- LoRA
Training Optimization
- LoRA
Inference Optimization
- Prepend short summary embeddings instead of full context to reduce KV cache
- Retrieve only compressed embeddings for relevant passages
Reproducibility
Data Urls
- QuALITY
- Qasper
- NarrativeQA
- HotpotQA
- QMSum
- LongBench
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Context encoder (AutoCompressor) is tied to a specific base LLM; a new encoder is needed per model.
- Training a high-quality compressor can be costly (authors note ~15B tokens used).
- Current pipeline uses one LoRA adapter per document group and cannot easily apply multiple adapters at once.
- Falls short on tasks requiring long, free-form generated outputs (e.g., one-page summaries like GovReport).
When Not To Use
- You need a model-agnostic compressor that works unchanged across many LLMs.
- Tasks require long, detailed generated outputs rather than concise Q&A answers.
- Your retrieval returns passages from many unrelated groups and you cannot pick or compose one LoRA adapter.
Failure Modes
- Hallucination if LoRA adapter is not trained for the document distribution or compression loses critical facts.
- Performance drops on out-of-distribution documents and datasets not represented in finetuning.
- Serving errors when retrieved passages belong to multiple groups but only one adapter is applied.
Core Entities
Models
- LLaMA2-7B-4k
- LLaMA2-7B-32k
- LLaMA2-7B-128k (comparative)
- Longchat7b-v1.5-32k
- AutoCompressor
- ICAE
Metrics
- Exact Match (EM)
- F1
- ROUGE (geometric mean)
Datasets
- QuALITY
- Qasper
- NarrativeQA
- HotpotQA
- QMSum
- LongBench
Benchmarks
- LongBench
Context Entities
Models
- GPT-4 (used for distilled training data)
- CEPE
- SnapKV

