Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Compress prompts into readable information units to cut LLM API cost and latency while often improving downstream accuracy on evaluated tasks.
Summary TLDR
Prompt-SAW converts a prompt's text into a small knowledge graph (entities + relations), selects key graph triplets, and reassembles them into a compressed prompt. Compared to token-level compressors, it preserves grammar and human readability while often improving downstream task accuracy. In experiments on GSM8K-aug (math reasoning), NaturalQuestions (QA) and ShareGPT (dialog), Prompt-SAW reduces prompt length by 33–94% and improves task metrics versus prior compressors on evaluated benchmarks.
Problem Statement
Long prompts hurt latency, cost, and clarity. Existing compressors remove tokens without preserving syntax or meaning, producing unreadable prompts and harming downstream answers. The paper asks: can we compress prompts by extracting and selecting relation-aware information units so compressed prompts stay readable and keep utility?
Main Contribution
Prompt-SAW: a graph-based prompt compressor that extracts (subject, relation, object) triplets and selects a subgraph to build a compressed, readable prompt.
GSM8K-aug: an i-shot extension of GSM8K (i∈{1,2,4,8}) to test compression across shot counts.
Comprehensive evaluation showing Prompt-SAW often beats token-level baselines in both task-agnostic and task-aware settings while keeping higher fluency.
Key Findings
On GSM8K-aug (task-agnostic, 2-shot) Prompt-SAW improved Exact Match (EM) versus best baseline by 10.1% while cutting prompt tokens by 34.9%.
On NaturalQuestions (task-aware) with GPT3.5-turbo, Prompt-SAW raised Span Accuracy by 39.0% at a target compression rate η*=0.5 while cutting prompt tokens from 524 to 227 (~56.7% reduction).
Compressed prompts from Prompt-SAW score higher on fluency than token-level compressors (FL 6.3 vs 5.74 for LLMLingua on GSM8K-aug).
Estimated compute cost reduction can reach roughly 5× when compression target η*=0.2 using OpenIE-based graph extraction assumptions.
Results
GSM8K-aug EM (task-agnostic)
NaturalQuestions SpanAcc (task-aware)
ShareGPT ROUGE-1 (dialog)
Compression (tokens)
Compression (tokens, task-aware)
Who Should Care
What To Try In 7 Days
Run Prompt-SAW on a few long prompt templates and compare accuracy and token billing versus your current compressor.
Use Open-IE + an embedding encoder to build triplets from your documentation or demos and compress to 30–60% of original tokens.
Measure fluency and a downstream metric; prefer subgraph selection for task-aware prompts and similarity-threshold pruning for demo-style prompts.
Optimization Features
Token Efficiency
- Prompt Compression
Inference Optimization
- Context Compression
- Token Budgeting
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Works best when prompt content can be expressed as subject–relation–object triplets.
- Quality depends on OpenIE/graph extraction; extraction errors propagate to compressed prompt.
- Extreme compression (very small η*) can drop accuracy because long but relevant structures may be removed.
When Not To Use
- Prompts that are narrative or resist decomposition into clean triplets.
- Cases where OpenIE fails due to noisy or informal text.
- When you need lossless reproduction of original wording rather than concise facts.
Failure Modes
- OpenIE fails to extract key facts → missing information in compressed prompt.
- Similarity threshold removes subtly important but long triplets → lower task accuracy.
- Target LLMs sensitive to context order may degrade on heavily restructured prompts.
Core Entities
Models
- GPT3.5-turbo
- LLaMA2-7B-chat
- GPT-4
- Phi-3-mini
Metrics
- Exact Match (EM)
- Accuracy
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Fluency (FL)
Datasets
- GSM8K-aug
- GSM8K
- NaturalQuestions
- ShareGPT
Context Entities
Models
- LLMLingua
- LongLLMlingua
- Selective-Context

