Cut prompt cost by up to ~68% by keeping only query-relevant sentences and lightly compressing the rest

September 2, 20238 min

Overview

Decision SnapshotNeeds Validation

The method is practical and plug-and-play: embed, rank sentences, compress the rest, and call LLMs; evidence comes from two datasets and concrete token/ROUGE measurements.

Citations2

Evidence Strength0.60

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 60%

Novelty: 60%

Authors

Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Links

Abstract / PDF / Data

Why It Matters For Business

LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.

Who Should Care

Summary TLDR

LeanContext reduces the tokens sent to pay-per-use LLMs by keeping a small set of query-relevant sentences intact and compressing the rest. A lightweight Q-learning agent chooses how many sentences (top-k) to keep per query. On ArXiv and BBCNews tests LeanContext cut prompt cost 37%–68% with a small ROUGE-1 drop (~0.014–0.026 absolute). Adding top-k sentences to cheap open-source summarizers recovers or improves QA quality while still saving cost.

Problem Statement

Feeding long, domain documents into pay-per-use LLMs is expensive because API cost scales with input tokens. Standard summarizers aim at human-readable summaries and can remove details LLMs need to answer domain questions. The paper asks: how can we reduce prompt tokens for domain QA while keeping answer quality?

Main Contribution

LeanContext: a pipeline that ranks sentences by query relevance, keeps top-k sentences intact, and compresses the remaining text with an open-source summarizer.

Adaptive top-k selection via a small Q-learning agent that picks a reduction threshold per query/context state.

Key Findings

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

NumbersArXiv N=4: prompt tokens 321->521, cost savings 37.29%, ROUGE-1 drop 0.3985->0.3844 (-0.0141)

Practical UseYou can cut prompt usage ~37% while keeping near-original answer quality on similar document-QA setups.

Evidence RefTable 1

Maximum observed cost savings are large on news-like data

NumbersBBCNews N=8: prompt tokens 218->724, cost savings 67.81%, ROUGE-1 drop 0.5498->0.5233 (-0.0265)

Practical UseFor news-type documents, expect up to ~68% prompt-cost reduction with a small drop in ROUGE-1 on evaluated sets.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
ArXiv (N=4) prompt tokensAdaptive LeanContext: 321 vs Original: 521Context (Original)-200 tokens (~37.29% cost saved)ArXiv (Table 1)Table 1, row LeanContext (Adaptive k [RL])Table 1
ArXiv ROUGE-1Adaptive LeanContext 0.3844 vs Original 0.3985Context (Original)-0.0141 absolute (~1.41% points)ArXiv (Table 1)Table 1 ROUGE-1 valuesTable 1

What To Try In 7 Days

Measure current prompt token cost on a small sample of your domain documents.

Implement a simple pipeline: retrieve N chunks, embed sentences, keep top-10% sentences intact and compress the rest with an open-source summarizer.

If cost/quality tradeoff is promising, train a small Q-learning agent on ~20–100 example queries to choose adaptive top-k thresholds.

Agent Features

Planning
Adaptive threshold selection per query (choose top-k ratio)
Tool Use
vector database (ChromaDB)embedding model (all-MiniLM-L6-v2)pay-per-use LLM API (gpt-3.5-turbo)
Frameworks
LangChain for query/LLM orchestration
Architectures
Q-learning table-based agent (discrete actions)

Optimization Features

Token Efficiency
Adaptive top-k selection80% reduction of less-important sentences via Selective Context
Training Optimization
Train RL on a small curated set to reduce training cost
Inference Optimization
Reduce prompt tokens by selecting top-k sentences and compressing others

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ArXiv dataset (Li 2023) and BBC News dataset (Li 2023) as used by paper

Risks & Boundaries

Limitations

Evaluation limited to two datasets (ArXiv and BBCNews) generated in March 2023.

Reward computation requires LLM calls during RL training, which can be costly; authors train on a small sample.

When Not To Use

When you can afford to fine-tune a domain model or keep everything on an internal model.

When users need human-readable full-text summaries rather than machine-oriented compressed context.

Failure Modes

RL picks too small k and removes sentences containing the answer, causing 'No answer' or wrong answers.

Open-source compressor drops factual details needed for correctness even if top-k kept important sentences.

Core Entities

Models

gpt-3.5-turbo (gpt-turbo-3.5)all-MiniLM-L6-v2Flan-T5-BaseSBERT (sentence-transformers)GPT-2 (used for self-information in Selective Context)

Metrics

Avg. total tokensAvg. prompt tokensCost Savings (%)ROUGE scores

Datasets

ArXiv dataset (March 2023 subset, Li 2023)BBC News dataset (March 2023 subset, Li 2023)

Benchmarks

ROUGE-1ROUGE-2ROUGE-L

Context Entities

Models

CQSumDP (ChatGPT-based query-aware summarizer)Semantic Compression (GPT-based compression)T5-base summarizerSelective Context (self-information filtering)