52 papers found

A better visual tokenizer lets language models match or beat diffusion models on ImageNet and video tasks

0.60
0.70
0.60
21

A better visual tokenizer can make language-model pipelines produce higher-quality images/videos with fewer inference steps and offer a new compressed token format that speeds downstream generation and saves bandwidth.

Key finding

On ImageNet 512×512 class-conditional generation, MLM + MAGVIT-v2 achieved FID 1.91 with guidance versus diffusion baseline VDM++ FID 2.65.

Numbers: FID 1.91 vs 2.65 (512×512); 28% relative improvement

ReWOO separates planning from fetching evidence to cut repeating prompt tokens and run smaller models

0.70
0.60
0.80
15

ReWOO cuts API token usage and hosting cost by separating planning from tool calls, so multi-step tool-using pipelines can run cheaper and scale with smaller models.

Key finding

ReWOO reduces token use on HotpotQA by about 5× compared to an observation-dependent ALM (ReAct).

Numbers: ReAct 9795.1 tokens vs ReWOO 1986.2 tokens (HotpotQA)

Reorder table rows and fields to boost LLM prompt-cache reuse and cut latency/costs

0.80
0.60
0.80
6

If you run LLMs over tables in batches, reordering rows and fields can cut inference time and API bills materially by increasing prompt-cache reuse; it is a low-cost software change that often outperforms adding hardware.

Key finding

GGR reduces end-to-end LLM query latency by 1.5–3.4× vs. caching without reordering (Cache Original) on evaluated queries.

Numbers: 1.53.4× speedup (Sec 6.2; Fig 3/4)

Cheap prompts often match expensive ones: GPT‑3.5 can do unsupervised product entity resolution cost‑efficiently

0.60
0.40
0.70
2

LLMs let you do pairwise ER without labeled training data and with lower engineering effort; using short prompts can cut API costs substantially but you must combine LLMs with blocking to control scale.

Key finding

GPT‑3.5 is viable as an unsupervised ER similarity function on product data.

Numbers: Many prompt patterns achieved F1 ≥ 0.80; examples: WDC single-attr F1=0.93, AG multi-sim F1=0.95

MorphPiece: a morpheme-aware tokenizer that improves LM and embedding quality

0.50
0.60
0.40
2

MorphPiece yields better language modeling and embedding quality without changing model architecture, which can improve search, classification, and prediction pipelines but increases token counts and compute.

Key finding

MorphGPT lowers token-level perplexity vs GPT-2 on standard LM benchmarks.

Numbers: PennTreeBank ppl 61.86 -> 38.25 (Morph200)

LazyLLM: compute KV only for important tokens to speed up long-context LLMs

0.70
0.60
0.70
1

LazyLLM cuts the time-to-first-token on long prompts and lowers total token computation without model retraining, reducing latency and inference costs for long-context applications.

Key finding

LazyLLM achieves 2.34× TTFT speedup on multi-document QA with negligible accuracy loss on Llama 2 7B.

Numbers: 2.34× TTFT; score 22.31 vs baseline 22.43-0.12)

Teach a single reasoning model to switch between fast (answer-only) and slow (chain-of-thought) modes to save tokens without losing accuracy

0.60
0.62
0.70
1

OThink-R1 cuts costly reasoning tokens at inference while keeping accuracy, lowering latency and per-request compute cost for products that use step-by-step reasoning.

Key finding

LRMs produce many more tokens than non-reasoning LLMs on common QA/math tasks.

Numbers: LRMs generate on average 7.32× more tokens than non-reasoning LLMs (Table 1).

Harvest millisecond GPU idle cycles by slicing work into tokens, layers, and tiny KV checkpoints.

0.70
0.60
0.80
1

You can run batch jobs (benchmarks, analytics) on the same expensive GPUs used for live LLM inference without degrading customer-facing latency. That turns idle capacity into usable throughput and reduces waste from overprovisioning.

Key finding

ConServe reduces online tail latency while co-serving.

Numbers: P99 online latency reduced by up to 2.9× (avg reported in paper)

Speed up vision-language inference by keeping only the attention-heavy tokens per layer.

0.70
0.60
0.80
1

ZipVL cuts compute and memory for large vision-language generation. That lowers cloud GPU costs for long images/videos, reduces time-to-first-token for interactive apps, and increases decoding throughput so more requests fit a GPU.

Key finding

Prefill (attention) latency reduced up to 2.3× on long inputs.

Numbers: 2.3× prefill latency reduction (128K tokens)

Match reasoning strategies by compute: token-budgeted evaluation shows simple self-consistency often beats complex methods

0.60
0.50
0.70
1

Compare reasoning methods by token cost, not just accuracy; cheaper self-consistency often gives better accuracy-per-dollar and reduces deployment cost.

Key finding

When token/query budgets are matched, chain-of-thought with self-consistency (CoT SC) often matches or outperforms more complex methods like Multi-Agent Debate and Reflexion.

Numbers: Experiments run up to 20 queries or 10k tokens; SC outperforms MAD/Reflexion across 5 datasets except HotpotQA

Practical token pruning cuts inference time 20–34% with minimal effect on few-shot intent accuracy

0.80
0.45
0.70
0

You can cut embedding-costs and latency by ~20–34% using post-training token pruning without retraining per task, keeping few-shot accuracy competitive—useful when serving many small intents in production.

Key finding

Their production system was best in the majority of few-shot settings tested.

Numbers: 24 out of 36 few-shot settings

Store more tokens at lower bit precision to shrink KV cache and often improve long-context accuracy

0.60
0.60
0.70
0

You can cut KV-cache memory and often improve long-context accuracy by storing more tokens at lower precision. This reduces GPU memory cost for long inputs and enables longer effective context without model changes.

Key finding

Keeping more tokens at lower precision often beats keeping fewer tokens at full precision.

Numbers: Example: Llama-3 RULER-8k: 512 tokens@16-bit = 67.5 vs 2048 tokens@4-bit = 82.2 (+14.7)

Use mined "shortcuts" from past multi-agent runs to cut tokens and speed up code generation

0.40
0.60
0.70
0

Co-Saving can cut token bills and developer compute costs by reusing prior multi-agent transitions, while keeping or improving code quality on similar tasks, so teams can scale automated software generation under a fixed budget.

Key finding

Co-Saving reduces token usage versus ChatDev.

Numbers: 50.85% average reduction in tokens (paper abstract).

Train a cheap controller LLM to route queries to expert LLMs via RL so the system meets different cost budgets while keeping high accuracy.

0.60
0.40
0.75
0

CORL enables predictable cost-vs-accuracy trade-offs from one deployed system. You can run a single LLM controller that adapts to customer budget tiers, saving inference spend at scale while keeping acceptable accuracy.

Key finding

CORL lets one trained controller exceed the best single expert at high budget on evaluated math sets.

Numbers: MATH500: CORL High Pass@1 0.958 vs o3 0.938

Survey reframing LLM reasoning from fixed efficiency to input-aware adaptivity

0.50
0.50
0.60
0

Adaptive reasoning reduces wasted compute on easy cases and directs budget to hard cases, lowering inference cost and improving reliability where it matters. Training-free solutions deliver quick wins; training-based solutions scale control into the model for repeated production use.

Key finding

Many LLMs currently overthink easy problems and fail to extend reasoning on hard problems.

Plan compute at inference time: reusable multi-agent modules + short/long-horizon planning to spend a fixed budget smarter.

0.40
0.70
0.70
0

FutureWeaver helps you spend inference budget where it matters across cooperating agents, raising task success per dollar. It automates reusable multi-agent patterns and avoids leaving budget unused—useful for cost-sensitive production agents that combine search, browsing, and reasoning.

Key finding

FUTUREWEAVER improves accuracy on GAIA with Claude models at low budget.

Numbers: Acc@0.2: FUTUREWEAVER 38.89% vs ReAct 35.80% (+3.09 pp)

GVote: per-request KV-cache compression that auto-selects how much to keep, cutting memory ~2× while keeping accuracy

0.60
0.60
0.70
0

GVote can cut GPU memory used by KV-caches about in half without manual tuning. That frees headroom for larger batch sizes, longer contexts, or lower-cost GPUs and reduces engineering time spent tuning budgets per workload.

Key finding

GVote reduces KV-cache usage roughly twofold on evaluated benchmarks while keeping accuracy similar or better.

Numbers: memory reduction reported across eight datasets (avg)

Prune redundant reasoning tokens at inference to boost accuracy and shrink KV cache

0.70
0.50
0.60
0

You can boost reasoning accuracy and cut inference memory by changing only the decoder strategy, making long-form reasoning cheaper and more reliable without retraining.

Key finding

Plug-in pruning raises average accuracy for Qwen2.5-7B from 57.9% to 63.4% on six math benchmarks.

Numbers: 57.9%63.4% average (Table 1)

Prune module-level operations to reallocate tokens and cut MLLM compute by up to 86% with small accuracy loss

0.70
0.70
0.80
0

DOP cuts prefilling compute and real GPU latency substantially while keeping task accuracy near-original, enabling cheaper inference and higher throughput for multimodal deployments.

Key finding

DOP can cut theoretical FLOPs by 86% while incurring ~1% average performance loss on LLaVA-NeXT-7B.

Numbers: 86% TFLOPs reduction; ~1% perf loss

RadioLLM: use LLMs for radio tasks via hybrid prompts and token reprogramming

0.60
0.60
0.50
0

RadioLLM lets you reuse LLM priors for multiple radio tasks, improving classification and denoising while cutting prompt overhead and latency in many benchmark scenarios.

Key finding

RadioLLM beats many baselines on modulation classification.

Numbers: OA: 58.10% (RML16A), 58.35% (RML16B), 68.19% (RML16C)

ExLLM: evolving compact experience + k-offspring LLM optimizer that sets new PMO SOTA

0.70
0.60
0.70
0

ExLLM turns LLMs into a sample-efficient, no-training optimizer that cuts API cost and runtime and generalizes across chemistry, engineering and code tasks, lowering the barrier to rapid design under limited evaluation budgets.

Key finding

ExLLM achieves the top aggregate PMO score reported in the paper.

Numbers: PMO aggregate 19.165 (max 23) vs prior SOTA 17.862

Training-free token pruning that fixes attention bias to cut Video-LLM FLOPs while keeping accuracy

0.60
0.60
0.70
0

AdaTP cuts inference FLOPs by up to ~73% on evaluated Video LLMs without losing task accuracy, lowering compute cost for production video understanding.

Key finding

Attention scores in early layers concentrate at sequence ends (global bias).

Numbers: 86.8% of top-10% attention tokens lie in last 4 of 32 frames (Layer 1)

Swap LLaMa's tokenizer for a Russian Unigram vocab to improve Russian quality and cut training/inference cost

0.70
0.40
0.80
0

Replacing an English-focused tokenizer with a language-specific Unigram vocab can improve non-English accuracy and cut fine-tuning and inference costs, lowering time-to-market and cloud bills for localized LLM products.

Key finding

Unigram tokenization preserves word roots better than BPE.