Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

Overview

Decision SnapshotNeeds Validation

The architecture shows clear gains in memory and throughput for long contexts and comparable benchmark quality; but the released model is a raw base checkpoint (no instruction tuning or safety tools), so it needs further adaptation before end-user deployment.

Citations40

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: Apache-2.0

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 80%

Authors

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

Links

Abstract / PDF / Code

Why It Matters For Business

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Who Should Care

CTO ML Engineer Product Manager Engineering Lead Data Scientist

Summary TLDR

Jamba is a public base LLM that mixes Transformer layers, Mamba state-space layers (SSM), and Mixture-of-Experts (MoE). The hybrid design cuts KV cache memory heavily (4GB at 256K tokens), runs much faster on long contexts (about 3x Mixtral), and supports very long contexts (released up to 256K, trained up to 1M). Quality-wise it matches or nears top open models on many benchmarks. The release is a base pretrained model (no alignment or instruction tuning) under Apache-2.0.

Problem Statement

Transformers struggle with long contexts because attention needs large key-value (KV) caches and per-token computation on the whole context. Pure SSMs (like Mamba) train efficiently and handle long-range patterns but lag on some tasks and in-context learning. The field needs an architecture that balances memory, throughput, and quality for very long inputs.

Main Contribution

A hybrid block that interleaves Transformer (attention) and Mamba (state-space) layers plus optional MoE MLPs.

A working 12B-active / 52B-available parameter model that fits an 80GB GPU (with int8 weights) and supports 256K token contexts.

Key Findings

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

NumbersKV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Practical UseYou can run much longer documents on a single GPU and avoid the KV cache memory bottleneck.

Evidence RefTable 1; Sec.2

Throughput at long contexts is about 3× higher than Mixtral.

Numbers3× tokens/sec at 128K context and in 8K-batch tests vs Mixtral

Practical UseExpect far faster processing of long-document workloads (faster inference and higher batch throughput).

Evidence RefSec.3.2; Figure 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
KV cache (256K context, 16bit)	4GB (Jamba)	32GB (Mixtral); 128GB (Llama-2)	≈8× smaller vs Mixtral; 32× vs Llama-2	Long context memory	Table 1 comparing KV cache sizes	Sec.2; Table 1
Throughput (tokens/sec)	≈3× Mixtral	Mixtral-8x7B	3× at 128K context and in batch tests	Long-context throughput	Figures 3a,b; Sec.3.2	Sec.3.2; Figure 3

What To Try In 7 Days

Run the Hugging Face Jamba-v0.1 checkpoint on your long-doc QA pipeline and measure KV cache and throughput vs your current model.

Benchmark Jamba on a few 10–100K token real inputs to see latency and answer quality tradeoffs.

Try the released base model for offline batch processing (not user-facing) to assess suitability before alignment/tuning.

Agent Features

Memory

reduced KV cache (attention memory)supports long-context SSM summary state

Architectures

hybrid Transformer-MambaMoE

Optimization Features

Token Efficiency

design trades attention vs Mamba layers to reduce per-token attention cost

Infra Optimization

smaller KV cache enables larger context fits per GPU

Model Optimization

MoE to expand available params without increasing active computeRMSNorm in Mamba layers to stabilize large-scale training

System Optimization

fits longer contexts on one GPU, cutting inter-GPU communication

Training Optimization

expert parallelism and sequence parallelismtrained with mixed parallel strategies (FSDP, tensor parallelism)

Inference Optimization

high Mamba:attention ratio to lower attention compute on long contextsint8 weights to fit model on 80GB GPU

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseApache-2.0

Code URLs

https://huggingface.co/ai21labs/Jamba-v0.1

Risks & Boundaries

Limitations

Released model is a base pretrained checkpoint without alignment, instruction tuning, or moderation.

Some tasks show hybrid equals SOTA, but other aggregate benchmarks have mixed wins and losses.

When Not To Use

Directly in user-facing systems without alignment or moderation.

When you need a chat/instruction-tuned model out of the box.

Failure Modes

Format non-adherence and poor in-context learning with pure Mamba configurations.

Training instability (loss spikes) without RMSNorm in Mamba internals.

Core Entities

Models

JambaMixtral-8x7BLlama-2 70BLlama-2 13BMistral 7BGemmaMamba (SSM)Hyena / StripedHyena

Metrics

tokens/second (throughput)KV cache memory (GB)F1 (QA long-context)AccuracyLog-prob per byte (perplexity proxy)OLLM leaderboard score

Datasets

L-EvalNeedle-in-a-haystackLongFQANarrativeQACUADSFictionTrec-FineNLU IntentBanking77CLINC150Natural QuestionsHellaSwagWinoGrandeARC-E / ARC-ChallengePIQABoolQGSM8KHumanEvalTruthfulQAMMLUBBHC4Bookscode

Benchmarks

HellaSwagWinoGrandeARCPIQABoolQQuACGSM8KHumanEvalNatural QuestionsTruthfulQAMMLUBBHLong-context QA (L-Eval subsets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Throughput at long contexts is about 3× higher than Mixtral.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A 2.6B foundation LLM that blends new attention and polynomial activations to boost math and code performance while keeping costs moderate

Key finding

LaRA: when to use retrieval vs feeding the full long context

Key finding

A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

Key finding

Dicta-LM 3.0 — open-weight Hebrew LLMs (24B/12B/1.7B) with 65k context and a new Hebrew chat benchmark

Key finding

Use 4-bit QK estimates plus block-sparse masks to speed up long-context LLM prefilling with minimal quality loss

Key finding