Cut prompt cost by up to ~68% by keeping only query-relevant sentences and lightly compressing the rest

Overview

Decision SnapshotNeeds Validation

The method is practical and plug-and-play: embed, rank sentences, compress the rest, and call LLMs; evidence comes from two datasets and concrete token/ROUGE measurements.

Citations2

Evidence Strength0.60

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Links

Abstract / PDF / Data

Why It Matters For Business

LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.

Who Should Care

Product Manager ML Engineer Founder CTO

Summary TLDR

LeanContext reduces the tokens sent to pay-per-use LLMs by keeping a small set of query-relevant sentences intact and compressing the rest. A lightweight Q-learning agent chooses how many sentences (top-k) to keep per query. On ArXiv and BBCNews tests LeanContext cut prompt cost 37%–68% with a small ROUGE-1 drop (~0.014–0.026 absolute). Adding top-k sentences to cheap open-source summarizers recovers or improves QA quality while still saving cost.

Problem Statement

Feeding long, domain documents into pay-per-use LLMs is expensive because API cost scales with input tokens. Standard summarizers aim at human-readable summaries and can remove details LLMs need to answer domain questions. The paper asks: how can we reduce prompt tokens for domain QA while keeping answer quality?

Main Contribution

LeanContext: a pipeline that ranks sentences by query relevance, keeps top-k sentences intact, and compresses the remaining text with an open-source summarizer.

Adaptive top-k selection via a small Q-learning agent that picks a reduction threshold per query/context state.

Key Findings

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

NumbersArXiv N=4: prompt tokens 321->521, cost savings 37.29%, ROUGE-1 drop 0.3985->0.3844 (-0.0141)

Practical UseYou can cut prompt usage ~37% while keeping near-original answer quality on similar document-QA setups.

Evidence RefTable 1

Maximum observed cost savings are large on news-like data

NumbersBBCNews N=8: prompt tokens 218->724, cost savings 67.81%, ROUGE-1 drop 0.5498->0.5233 (-0.0265)

Practical UseFor news-type documents, expect up to ~68% prompt-cost reduction with a small drop in ROUGE-1 on evaluated sets.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ArXiv (N=4) prompt tokens	Adaptive LeanContext: 321 vs Original: 521	Context (Original)	-200 tokens (~37.29% cost saved)	ArXiv (Table 1)	Table 1, row LeanContext (Adaptive k [RL])	Table 1
ArXiv ROUGE-1	Adaptive LeanContext 0.3844 vs Original 0.3985	Context (Original)	-0.0141 absolute (~1.41% points)	ArXiv (Table 1)	Table 1 ROUGE-1 values	Table 1

What To Try In 7 Days

Measure current prompt token cost on a small sample of your domain documents.

Implement a simple pipeline: retrieve N chunks, embed sentences, keep top-10% sentences intact and compress the rest with an open-source summarizer.

If cost/quality tradeoff is promising, train a small Q-learning agent on ~20–100 example queries to choose adaptive top-k thresholds.

Agent Features

Planning

Adaptive threshold selection per query (choose top-k ratio)

Tool Use

vector database (ChromaDB)embedding model (all-MiniLM-L6-v2)pay-per-use LLM API (gpt-3.5-turbo)

Frameworks

LangChain for query/LLM orchestration

Architectures

Q-learning table-based agent (discrete actions)

Optimization Features

Token Efficiency

Adaptive top-k selection80% reduction of less-important sentences via Selective Context

Training Optimization

Train RL on a small curated set to reduce training cost

Inference Optimization

Reduce prompt tokens by selecting top-k sentences and compressing others

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

ArXiv dataset (Li 2023) and BBC News dataset (Li 2023) as used by paper

Risks & Boundaries

Limitations

Evaluation limited to two datasets (ArXiv and BBCNews) generated in March 2023.

Reward computation requires LLM calls during RL training, which can be costly; authors train on a small sample.

When Not To Use

When you can afford to fine-tune a domain model or keep everything on an internal model.

When users need human-readable full-text summaries rather than machine-oriented compressed context.

Failure Modes

RL picks too small k and removes sentences containing the answer, causing 'No answer' or wrong answers.

Open-source compressor drops factual details needed for correctness even if top-k kept important sentences.

Core Entities

Models

gpt-3.5-turbo (gpt-turbo-3.5)all-MiniLM-L6-v2Flan-T5-BaseSBERT (sentence-transformers)GPT-2 (used for self-information in Selective Context)

Metrics

Avg. total tokensAvg. prompt tokensCost Savings (%)ROUGE scores

Datasets

ArXiv dataset (March 2023 subset, Li 2023)BBC News dataset (March 2023 subset, Li 2023)

Benchmarks

ROUGE-1ROUGE-2ROUGE-L

Context Entities

Models

CQSumDP (ChatGPT-based query-aware summarizer)Semantic Compression (GPT-based compression)T5-base summarizerSelective Context (self-information filtering)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

Maximum observed cost savings are large on news-like data

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Case-aware LLM-as-a-judge scoring: eight enterprise metrics, severity-weighting, and JSON outputs for multi-turn RAG

Key finding

RGB: a bilingual benchmark diagnosing how LLMs fail when using retrieved evidence

Key finding

Curate systematic reviews + guidelines to make RAG answers more trustworthy for Long COVID

Key finding

Mask untruthful parts of context to cut hallucinations and keep helpful facts

Key finding

Practical survey of RAG: paradigms, core components, benchmarks, and engineering gaps

Key finding