Compress prompts by turning text into relation graphs, keeping readability and model utility

March 30, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.7

Citation Count

0

Authors

Muhammad Asif Ali, Zhengping Li, Shu Yang, Keyuan Cheng, Yang Cao, Tianhao Huang, Guimin Hu, Weimin Lyu, Lijie Hu, Lu Yu, Di Wang

Links

Abstract / PDF

Why It Matters For Business

Compress prompts into readable information units to cut LLM API cost and latency while often improving downstream accuracy on evaluated tasks.

Summary TLDR

Prompt-SAW converts a prompt's text into a small knowledge graph (entities + relations), selects key graph triplets, and reassembles them into a compressed prompt. Compared to token-level compressors, it preserves grammar and human readability while often improving downstream task accuracy. In experiments on GSM8K-aug (math reasoning), NaturalQuestions (QA) and ShareGPT (dialog), Prompt-SAW reduces prompt length by 33–94% and improves task metrics versus prior compressors on evaluated benchmarks.

Problem Statement

Long prompts hurt latency, cost, and clarity. Existing compressors remove tokens without preserving syntax or meaning, producing unreadable prompts and harming downstream answers. The paper asks: can we compress prompts by extracting and selecting relation-aware information units so compressed prompts stay readable and keep utility?

Main Contribution

Prompt-SAW: a graph-based prompt compressor that extracts (subject, relation, object) triplets and selects a subgraph to build a compressed, readable prompt.

GSM8K-aug: an i-shot extension of GSM8K (i∈{1,2,4,8}) to test compression across shot counts.

Comprehensive evaluation showing Prompt-SAW often beats token-level baselines in both task-agnostic and task-aware settings while keeping higher fluency.

Key Findings

On GSM8K-aug (task-agnostic, 2-shot) Prompt-SAW improved Exact Match (EM) versus best baseline by 10.1% while cutting prompt tokens by 34.9%.

NumbersEM ↑ 10.1%; tokens 612 → 399 (−34.9%)

On NaturalQuestions (task-aware) with GPT3.5-turbo, Prompt-SAW raised Span Accuracy by 39.0% at a target compression rate η*=0.5 while cutting prompt tokens from 524 to 227 (~56.7% reduction).

NumbersSpanAcc ↑ 39.0%; tokens 524 → 227 (−56.7%)

Compressed prompts from Prompt-SAW score higher on fluency than token-level compressors (FL 6.3 vs 5.74 for LLMLingua on GSM8K-aug).

NumbersFluency 6.3 vs 5.74

Estimated compute cost reduction can reach roughly 5× when compression target η*=0.2 using OpenIE-based graph extraction assumptions.

NumbersEstimated c ≈ 0.2017·L·c_LLMs → ~5× savings

Results

GSM8K-aug EM (task-agnostic)

Value73.17 EM (Prompt-SAW, 2-shot)

BaselineLLMLingua / GPT4-generation (best baseline ~66.44 EM)

NaturalQuestions SpanAcc (task-aware)

Value82.93 Acc (Prompt-SAW, GPT3.5-turbo, η*=0.5)

BaselineLongLLMlingua 59.65 Acc

ShareGPT ROUGE-1 (dialog)

Value49.31 (Prompt-SAW, GPT3.5-turbo, η*=0.5)

BaselineLongLLMlingua 38.13

Compression (tokens)

Value399 tokens (Prompt-SAW, GSM8K-aug 2-shot)

Baseline612 tokens original

Compression (tokens, task-aware)

Value227 tokens (Prompt-SAW, NaturalQuestions, η*=0.5)

Baseline524 tokens original

Who Should Care

What To Try In 7 Days

Run Prompt-SAW on a few long prompt templates and compare accuracy and token billing versus your current compressor.

Use Open-IE + an embedding encoder to build triplets from your documentation or demos and compress to 30–60% of original tokens.

Measure fluency and a downstream metric; prefer subgraph selection for task-aware prompts and similarity-threshold pruning for demo-style prompts.

Optimization Features

Token Efficiency

  • Prompt Compression

Inference Optimization

  • Context Compression
  • Token Budgeting

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Works best when prompt content can be expressed as subject–relation–object triplets.
  • Quality depends on OpenIE/graph extraction; extraction errors propagate to compressed prompt.
  • Extreme compression (very small η*) can drop accuracy because long but relevant structures may be removed.

When Not To Use

  • Prompts that are narrative or resist decomposition into clean triplets.
  • Cases where OpenIE fails due to noisy or informal text.
  • When you need lossless reproduction of original wording rather than concise facts.

Failure Modes

  • OpenIE fails to extract key facts → missing information in compressed prompt.
  • Similarity threshold removes subtly important but long triplets → lower task accuracy.
  • Target LLMs sensitive to context order may degrade on heavily restructured prompts.

Core Entities

Models

  • GPT3.5-turbo
  • LLaMA2-7B-chat
  • GPT-4
  • Phi-3-mini

Metrics

  • Exact Match (EM)
  • Accuracy
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • Fluency (FL)

Datasets

  • GSM8K-aug
  • GSM8K
  • NaturalQuestions
  • ShareGPT

Context Entities

Models

  • LLMLingua
  • LongLLMlingua
  • Selective-Context