Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

March 28, 20249 min

Overview

Decision SnapshotNeeds Validation

The architecture shows clear gains in memory and throughput for long contexts and comparable benchmark quality; but the released model is a raw base checkpoint (no instruction tuning or safety tools), so it needs further adaptation before end-user deployment.

Citations40

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/5

Reproducibility

Status: Partial assets available

Open source: Partial

License: Apache-2.0

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 80%

Authors

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

Links

Abstract / PDF / Code

Why It Matters For Business

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Who Should Care

Summary TLDR

Jamba is a public base LLM that mixes Transformer layers, Mamba state-space layers (SSM), and Mixture-of-Experts (MoE). The hybrid design cuts KV cache memory heavily (4GB at 256K tokens), runs much faster on long contexts (about 3x Mixtral), and supports very long contexts (released up to 256K, trained up to 1M). Quality-wise it matches or nears top open models on many benchmarks. The release is a base pretrained model (no alignment or instruction tuning) under Apache-2.0.

Problem Statement

Transformers struggle with long contexts because attention needs large key-value (KV) caches and per-token computation on the whole context. Pure SSMs (like Mamba) train efficiently and handle long-range patterns but lag on some tasks and in-context learning. The field needs an architecture that balances memory, throughput, and quality for very long inputs.

Main Contribution

A hybrid block that interleaves Transformer (attention) and Mamba (state-space) layers plus optional MoE MLPs.

A working 12B-active / 52B-available parameter model that fits an 80GB GPU (with int8 weights) and supports 256K token contexts.

Key Findings

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

NumbersKV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Practical UseYou can run much longer documents on a single GPU and avoid the KV cache memory bottleneck.

Evidence RefTable 1; Sec.2

Throughput at long contexts is about 3× higher than Mixtral.

Numbers tokens/sec at 128K context and in 8K-batch tests vs Mixtral

Practical UseExpect far faster processing of long-document workloads (faster inference and higher batch throughput).

Evidence RefSec.3.2; Figure 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
KV cache (256K context, 16bit)4GB (Jamba)32GB (Mixtral); 128GB (Llama-2)≈8× smaller vs Mixtral; 32× vs Llama-2Long context memoryTable 1 comparing KV cache sizesSec.2; Table 1
Throughput (tokens/sec)≈3× MixtralMixtral-8x7B at 128K context and in batch testsLong-context throughputFigures 3a,b; Sec.3.2Sec.3.2; Figure 3

What To Try In 7 Days

Run the Hugging Face Jamba-v0.1 checkpoint on your long-doc QA pipeline and measure KV cache and throughput vs your current model.

Benchmark Jamba on a few 10–100K token real inputs to see latency and answer quality tradeoffs.

Try the released base model for offline batch processing (not user-facing) to assess suitability before alignment/tuning.

Agent Features

Memory
reduced KV cache (attention memory)supports long-context SSM summary state
Architectures
hybrid Transformer-MambaMoE

Optimization Features

Token Efficiency
design trades attention vs Mamba layers to reduce per-token attention cost
Infra Optimization
smaller KV cache enables larger context fits per GPU
Model Optimization
MoE to expand available params without increasing active computeRMSNorm in Mamba layers to stabilize large-scale training
System Optimization
fits longer contexts on one GPU, cutting inter-GPU communication
Training Optimization
expert parallelism and sequence parallelismtrained with mixed parallel strategies (FSDP, tensor parallelism)
Inference Optimization
high Mamba:attention ratio to lower attention compute on long contextsint8 weights to fit model on 80GB GPU

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseApache-2.0

Risks & Boundaries

Limitations

Released model is a base pretrained checkpoint without alignment, instruction tuning, or moderation.

Some tasks show hybrid equals SOTA, but other aggregate benchmarks have mixed wins and losses.

When Not To Use

Directly in user-facing systems without alignment or moderation.

When you need a chat/instruction-tuned model out of the box.

Failure Modes

Format non-adherence and poor in-context learning with pure Mamba configurations.

Training instability (loss spikes) without RMSNorm in Mamba internals.

Core Entities

Models

JambaMixtral-8x7BLlama-2 70BLlama-2 13BMistral 7BGemmaMamba (SSM)Hyena / StripedHyena

Metrics

tokens/second (throughput)KV cache memory (GB)F1 (QA long-context)AccuracyLog-prob per byte (perplexity proxy)OLLM leaderboard score

Datasets

L-EvalNeedle-in-a-haystackLongFQANarrativeQACUADSFictionTrec-FineNLU IntentBanking77CLINC150Natural QuestionsHellaSwagWinoGrandeARC-E / ARC-ChallengePIQABoolQGSM8KHumanEvalTruthfulQAMMLUBBHC4Bookscode

Benchmarks

HellaSwagWinoGrandeARCPIQABoolQQuACGSM8KHumanEvalNatural QuestionsTruthfulQAMMLUBBHLong-context QA (L-Eval subsets)