Cut prompt cost by up to ~68% by keeping only query-relevant sentences and lightly compressing the rest

September 2, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

2

Authors

Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Links

Abstract / PDF

Why It Matters For Business

LeanContext lowers pay-per-use LLM input tokens so small teams can run domain QA faster and cheaper while keeping similar answer quality.

Summary TLDR

LeanContext reduces the tokens sent to pay-per-use LLMs by keeping a small set of query-relevant sentences intact and compressing the rest. A lightweight Q-learning agent chooses how many sentences (top-k) to keep per query. On ArXiv and BBCNews tests LeanContext cut prompt cost 37%–68% with a small ROUGE-1 drop (~0.014–0.026 absolute). Adding top-k sentences to cheap open-source summarizers recovers or improves QA quality while still saving cost.

Problem Statement

Feeding long, domain documents into pay-per-use LLMs is expensive because API cost scales with input tokens. Standard summarizers aim at human-readable summaries and can remove details LLMs need to answer domain questions. The paper asks: how can we reduce prompt tokens for domain QA while keeping answer quality?

Main Contribution

LeanContext: a pipeline that ranks sentences by query relevance, keeps top-k sentences intact, and compresses the remaining text with an open-source summarizer.

Adaptive top-k selection via a small Q-learning agent that picks a reduction threshold per query/context state.

Empirical evaluation on ArXiv and BBCNews showing major token/cost savings with small ROUGE-1 drops, and benefits when combined with existing open-source summarizers.

Key Findings

Adaptive LeanContext reduces prompt tokens and saves cost with little accuracy loss

NumbersArXiv N=4: prompt tokens 321->521, cost savings 37.29%, ROUGE-1 drop 0.3985->0.3844 (-0.0141)

Maximum observed cost savings are large on news-like data

NumbersBBCNews N=8: prompt tokens 218->724, cost savings 67.81%, ROUGE-1 drop 0.5498->0.5233 (-0.0265)

Top-k sentences boost weak summarizers a lot when cascaded

NumbersArXiv: T5-base ROUGE-1 0.1614 -> T5+LeanContext 0.4073 (absolute +0.2459)

Results

ArXiv (N=4) prompt tokens

ValueAdaptive LeanContext: 321 vs Original: 521

BaselineContext (Original)

ArXiv ROUGE-1

ValueAdaptive LeanContext 0.3844 vs Original 0.3985

BaselineContext (Original)

BBCNews (N=8) prompt tokens

ValueAdaptive LeanContext: 218 vs Original: 724

BaselineContext (Original)

BBCNews ROUGE-1

ValueAdaptive LeanContext 0.5233 vs Original 0.5498

BaselineContext (Original)

T5-base + LeanContext (ArXiv)

ValueROUGE-1 0.4073 vs T5 baseline 0.1614

BaselineT5-base

Who Should Care

What To Try In 7 Days

Measure current prompt token cost on a small sample of your domain documents.

Implement a simple pipeline: retrieve N chunks, embed sentences, keep top-10% sentences intact and compress the rest with an open-source summarizer.

If cost/quality tradeoff is promising, train a small Q-learning agent on ~20–100 example queries to choose adaptive top-k thresholds.

Agent Features

Planning

  • Adaptive threshold selection per query (choose top-k ratio)

Tool Use

  • vector database (ChromaDB)
  • embedding model (all-MiniLM-L6-v2)
  • pay-per-use LLM API (gpt-3.5-turbo)

Frameworks

  • LangChain for query/LLM orchestration

Architectures

  • Q-learning table-based agent (discrete actions)

Optimization Features

Token Efficiency

  • Adaptive top-k selection
  • 80% reduction of less-important sentences via Selective Context

Training Optimization

  • Train RL on a small curated set to reduce training cost

Inference Optimization

  • Reduce prompt tokens by selecting top-k sentences and compressing others

Reproducibility

Data Urls

  • ArXiv dataset (Li 2023) and BBC News dataset (Li 2023) as used by paper

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluation limited to two datasets (ArXiv and BBCNews) generated in March 2023.
  • Reward computation requires LLM calls during RL training, which can be costly; authors train on a small sample.
  • ROUGE-based QA quality may miss factual or reasoning errors that humans would catch.
  • Method relies on embedding quality and chosen chunk size; performance may vary with other embedding models or domains.

When Not To Use

  • When you can afford to fine-tune a domain model or keep everything on an internal model.
  • When users need human-readable full-text summaries rather than machine-oriented compressed context.
  • When critical facts are scattered and require long cross-sentence context that compression might remove.

Failure Modes

  • RL picks too small k and removes sentences containing the answer, causing 'No answer' or wrong answers.
  • Open-source compressor drops factual details needed for correctness even if top-k kept important sentences.
  • Different chunking or poor embeddings lead to mis-ranked sentences and lower QA quality.

Core Entities

Models

  • gpt-3.5-turbo (gpt-turbo-3.5)
  • all-MiniLM-L6-v2
  • Flan-T5-Base
  • SBERT (sentence-transformers)
  • GPT-2 (used for self-information in Selective Context)

Metrics

  • Avg. total tokens
  • Avg. prompt tokens
  • Cost Savings (%)
  • ROUGE scores

Datasets

  • ArXiv dataset (March 2023 subset, Li 2023)
  • BBC News dataset (March 2023 subset, Li 2023)

Benchmarks

  • ROUGE-1
  • ROUGE-2
  • ROUGE-L

Context Entities

Models

  • CQSumDP (ChatGPT-based query-aware summarizer)
  • Semantic Compression (GPT-based compression)
  • T5-base summarizer
  • Selective Context (self-information filtering)