29 papers found

Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

0.30
0.80
0.70
40

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Key finding

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Numbers: KV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

0.60
0.70
0.60
16

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Key finding

Modular WebAgent dramatically improves real-site success rates.

Numbers: Success: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

Monarch Mixer: replace attention and MLPs with sub-quadratic GEMM-friendly layers to speed long-context models

0.50
0.70
0.70
14

If you run models with long contexts or want lower parameter cost, M2 can cut compute or model size and improve throughput on many GPUs while keeping accuracy; expect implementation and kernel work before production parity on all hardware.

Key finding

M2-BERT matches BERT-base GLUE while cutting parameters.

Numbers: GLUE 79.9 vs 79.6; −27% params (M2 80M vs BERT 110M)

Hyper Attention for efficient, long multi-image and long-video understanding

0.60
0.62
0.60
6

mPLUG-Owl3 shows you can run an 8B multimodal model that is both accurate on many image/video tasks and more efficient on long visual inputs — useful for product features that need long-video or multi-image understanding.

Key finding

mPLUG-Owl3 achieves state-of-the-art among 8B models on a wide benchmark suite.

Numbers: SOTA on 14 of 20 evaluated benchmarks

XGen-7B: an open 7B LLM trained up to 8K context (1.5T tokens) with instruction-tuned releases

0.70
0.50
0.70
4

XGen-7B gives teams a practical, open 7B model that handles long documents (up to 8K tokens) and competitive instruction-following, lowering cost versus much larger closed models while keeping good accuracy.

Key finding

Stage-wise training yields an 8K-capable model that uses long context.

Numbers: 800B@2K + 400B@4K + 300B@8K = 1.5T tokens

EYEGLAXS: fine-tune LLMs with LoRA and FlashAttention to extract summaries from long scientific papers

0.60
0.55
0.35
2

You can get reliable, extractive summaries from large LLMs with modest adapter tuning (LoRA) and modern attention tricks, but expect much higher compute costs for long contexts.

Key finding

LoRA fine-tuning substantially improves extractive performance vs frozen LLMs.

Numbers: PubMed 4K ChatGLM2: R1 42.79 -> 49.96 (+7.17)

LaRA: when to use retrieval vs feeding the full long context

0.70
0.60
0.60
1

Choose RAG or LC based on model size, document length, and task. This reduces cost and error: RAG protects smaller models from long-context failure and reduces hallucinations; LC is better for synthesis and reasoning on strong long-context models.

Key finding

No universal winner — best choice depends on model size, context length, task, and chunks.

A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

0.60
0.60
0.50
1

If you need models to read and act on long documents (reports, codebases, books), adding a few thousand diverse long instruction examples and using packing + loss weighting cuts training time and materially improves task performance without hurting short-context skills.

Key finding

More long instruction data materially improves long-context instruction performance.

Numbers: LongBench-Chat: 3.73 (0k) → 6.21 (10k) average score

Use tiny fixed KV caches and learned 1‑D convolutions to compress thousands of tokens with low memory and near-full performance

0.70
0.60
0.70
0

LoCoCo lets you handle much longer documents without buying more GPU memory or changing the model core. That reduces infrastructure cost for long‑context applications and speeds up inference prefill.

Key finding

LoCoCo can compress very long prefill contexts into a tiny KV cache during inference.

Numbers: compressed 3,482 tokens into a 128-size KV cache; accuracy gain vs baseline 0.2791 (reported)

Train one hybrid reasoning model, get many deployable sizes for free

0.70
0.60
0.80
0

You can train one large reasoning model and ship multiple quality/latency variants without per‑size retraining. That cuts token costs and storage needs, simplifies model ops, and makes offering multiple service tiers cheaper.

Key finding

Derive 6B and 9B models from a single 12B run using 110B training tokens.

Numbers: 110B tokens total (Table 2)

LOOKAT: 64× KV-cache compression via lookup-table attention, no retraining

0.65
0.60
0.80
0

LOOKAT can cut KV-cache memory and DRAM bandwidth on edge devices by tens of times without retraining, enabling larger context or lower-cost hardware for real-time inference.

Key finding

LOOKAT achieves 64× KV-cache compression while keeping model output close to FP16.

Numbers: 64× compression → cosine sim 0.957

Build a DAG of chunk synopses and use MCTS to find relevant facts for long‑context QA

0.60
0.60
0.40
0

JERR improves answer accuracy and long‑range recall on long documents while producing an interpretable graph of facts; build once and reuse graphs to amortize cost.

Key finding

JERR yields the best accuracy on QuALITY multi-choice QA.

Numbers: 86.39% (JERR) vs 85.02% (GraphRAG) (Table 2)

Use 4-bit QK estimates plus block-sparse masks to speed up long-context LLM prefilling with minimal quality loss

0.70
0.60
0.80
0

SALE cuts attention compute for very long inputs with no model retraining, lowering inference cost and enabling cheaper long-document apps while fitting into existing inference stacks.

Key finding

SALE cuts attention prefilling time by about 3.36× on Llama-3.1-8B for inputs ≥64K tokens.

Numbers: ≥3.36× speedup (64K, Table 1)

HiCo compresses hours of video to ~1/50 tokens so MLLMs can efficiently reason over 10k+ frames

0.70
0.60
0.80
0

Reduce inference cost for hour-scale video by roughly two orders of magnitude, enabling long-video features on single GPUs and lowering hosting and latency costs.

Key finding

HiCo compresses each frame to about 16 tokens (≈2% of dense tokenization) with almost no performance loss.

Numbers: 16 tokens/frame; compression ratio ≈2% (1/50)

A simple, efficient multimodal LLM that boosts high‑res image and long‑video handling with token merging and visual experts

0.60
0.60
0.60
0

MammothModa gives competitive multimodal accuracy while cutting visual token compute and inference time, making it practical for products needing high‑res image, OCR, document VQA, or long‑video understanding.

Key finding

Dynamic splitting at high equivalent resolution (DS-12) substantially improves fine‑grained and document tasks.

Numbers: Avg +45.0; OCRBench +105; DocVQA +28.83 (vs Resize)

An open‑source Llama3-based model with a 128K context window that matches or beats many proprietary models on ultra-long and RAG tasks

0.70
0.65
0.70
0

You can run an open‑source 70B model that reads 100K+ tokens and often matches or beats commercial models on retrieval and long‑document QA, reducing dependence on closed APIs and giving control over data and cost.

Key finding

ChatQA‑2‑70B achieves top average on four ultra‑long InfiniteBench tasks.

Numbers: Avg 41.04 vs GPT‑4‑Turbo 33.16 (InfiniteBench)

Fast Multipole Attention: a physics-inspired multilevel attention that cuts attention cost to O(n log n) or O(n)

0.70
0.70
0.80
0

FMA lowers GPU memory and inference latency for long text and high-resolution images, letting teams train bigger models or use longer contexts without buying more hardware.

Key finding

FMA changes attention complexity from quadratic to log-linear or linear.

Numbers: Complexity reduced from O(n^2) to O(n log n); O(n) with query downsampling

Equivariant Transformer raises SLMC acceptance and reproduces observables on a 2D spin-fermion lattice

0.30
0.60
0.40
0

If you run Monte Carlo samplers or physics-informed simulators, embedding symmetry-aware attention improves proposal acceptance and preserves observables, which can cut compute per independent sample and reduce simulation cost.

Key finding

Attention layers raise SLMC acceptance compared to a linear effective model.

Numbers: Linear model acceptance = 21% on 6×6 lattice at T=0.05 t; acceptance increases with number of attention layers (Fig.4, L

Fine-tune existing MHA LLMs to DeepSeek MLA for up to ~97% KV-cache savings with 0.6–1% data

0.70
0.60
0.80
0

MHA2MLA lets teams cut KV cache memory by ~90%+ and keep near-original quality, lowering GPU RAM needs and cost for long-context inference while requiring only tiny fine-tuning budgets.

Key finding

MHA2MLA adapts pretrained MHA/GQA models using a tiny fraction of data.

Numbers: 0.6%–1% of pretraining tokens used for fine-tuning

A theoretical blueprint to predict user intent from gaze, EEG, heart rate and context with <100 ms edge inference

0.20
0.70
0.60
0

Proactive, low-latency intent prediction can reduce user friction, improve accessibility, and cut cloud costs by doing inference on-device, but these gains are theoretical and need empirical validation.

Key finding

Projected intent accuracy with EEG integration is high relative to single modalities

Numbers: 8590% accuracy (projected, with EEG)

A time-series view explains why transformer attention heads show stable or random patterns and uses that signal to compress KV caches and to

0.60
0.60
0.70
0

TAPPA gives a cheap, model-side signal (q-similarity) to decide which parts of a model and which cached tokens are compressible. That can cut memory and latency for long-context inference and allow more aggressive structured pruning with less accuracy loss.

Key finding

High q-similarity (smooth queries) predicts predictable attention heads; low q-similarity predicts retrieval-like, unpredictable heads.

Numbers: avg q-similarity ≈ 0.80 (Llama-3.1) and ≈ 0.86 (Qwen2.5) on evaluated datasets

Add a small gated latent memory to frozen LLMs to improve multi-hop reasoning and relation extraction

0.40
0.70
0.60
0

G-MemLLM boosts evidence-grounded QA and relation extraction with a tiny, trainable memory add-on, offering notable accuracy gains without full-model finetuning or large parameter increases.

Key finding

G-MemLLM raises ZsRE accuracy by 13.3 percentage points on Llama 3.1-8B.

Numbers: ZsRE: 55.63 -> 63.03 (+13.3%)

Use a CPU suffix-automaton to recall distant tokens and let windowed attention match global-attention quality

0.60
0.70
0.60
0

ROSA lets teams support very long inputs (documents, multi-turn chat history) with near-global-attention accuracy while avoiding large GPU memory and compute increases, lowering inference cost for long-context applications.

Key finding

ROSA largely recovers long-context accuracy lost by windowed attention.

Numbers: LongBench AVG: Global 59.21; Window 29.41; Window+ROSA 57.14