Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
2
Why It Matters For Business
LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.
Summary TLDR
LeanContext reduces the tokens sent to pay-per-use LLMs by keeping a small set of query-relevant sentences intact and compressing the rest. A lightweight Q-learning agent chooses how many sentences (top-k) to keep per query. On ArXiv and BBCNews tests LeanContext cut prompt cost 37%–68% with a small ROUGE-1 drop (~0.014–0.026 absolute). Adding top-k sentences to cheap open-source summarizers recovers or improves QA quality while still saving cost.
Problem Statement
Feeding long, domain documents into pay-per-use LLMs is expensive because API cost scales with input tokens. Standard summarizers aim at human-readable summaries and can remove details LLMs need to answer domain questions. The paper asks: how can we reduce prompt tokens for domain QA while keeping answer quality?
Main Contribution
LeanContext: a pipeline that ranks sentences by query relevance, keeps top-k sentences intact, and compresses the remaining text with an open-source summarizer.
Adaptive top-k selection via a small Q-learning agent that picks a reduction threshold per query/context state.
Empirical evaluation on ArXiv and BBCNews showing major token/cost savings with small ROUGE-1 drops, and benefits when combined with existing open-source summarizers.
Key Findings
Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss
Maximum observed cost savings are large on news-like data
Top-k sentences boost weak summarizers a lot when cascaded
Results
ArXiv (N=4) prompt tokens
ArXiv ROUGE-1
BBCNews (N=8) prompt tokens
BBCNews ROUGE-1
T5-base + LeanContext (ArXiv)
Who Should Care
What To Try In 7 Days
Measure current prompt token cost on a small sample of your domain documents.
Implement a simple pipeline: retrieve N chunks, embed sentences, keep top-10% sentences intact and compress the rest with an open-source summarizer.
If cost/quality tradeoff is promising, train a small Q-learning agent on ~20–100 example queries to choose adaptive top-k thresholds.
Agent Features
Planning
- Adaptive threshold selection per query (choose top-k ratio)
Tool Use
- vector database (ChromaDB)
- embedding model (all-MiniLM-L6-v2)
- pay-per-use LLM API (gpt-3.5-turbo)
Frameworks
- LangChain for query/LLM orchestration
Architectures
- Q-learning table-based agent (discrete actions)
Optimization Features
Token Efficiency
- Adaptive top-k selection
- 80% reduction of less-important sentences via Selective Context
Training Optimization
- Train RL on a small curated set to reduce training cost
Inference Optimization
- Reduce prompt tokens by selecting top-k sentences and compressing others
Reproducibility
Data Urls
- ArXiv dataset (Li 2023) and BBC News dataset (Li 2023) as used by paper
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluation limited to two datasets (ArXiv and BBCNews) generated in March 2023.
- Reward computation requires LLM calls during RL training, which can be costly; authors train on a small sample.
- ROUGE-based QA quality may miss factual or reasoning errors that humans would catch.
- Method relies on embedding quality and chosen chunk size; performance may vary with other embedding models or domains.
When Not To Use
- When you can afford to fine-tune a domain model or keep everything on an internal model.
- When users need human-readable full-text summaries rather than machine-oriented compressed context.
- When critical facts are scattered and require long cross-sentence context that compression might remove.
Failure Modes
- RL picks too small k and removes sentences containing the answer, causing 'No answer' or wrong answers.
- Open-source compressor drops factual details needed for correctness even if top-k kept important sentences.
- Different chunking or poor embeddings lead to mis-ranked sentences and lower QA quality.
Core Entities
Models
- gpt-3.5-turbo (gpt-turbo-3.5)
- all-MiniLM-L6-v2
- Flan-T5-Base
- SBERT (sentence-transformers)
- GPT-2 (used for self-information in Selective Context)
Metrics
- Avg. total tokens
- Avg. prompt tokens
- Cost Savings (%)
- ROUGE scores
Datasets
- ArXiv dataset (March 2023 subset, Li 2023)
- BBC News dataset (March 2023 subset, Li 2023)
Benchmarks
- ROUGE-1
- ROUGE-2
- ROUGE-L
Context Entities
Models
- CQSumDP (ChatGPT-based query-aware summarizer)
- Semantic Compression (GPT-based compression)
- T5-base summarizer
- Selective Context (self-information filtering)

