Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

March 28, 20249 min

Overview

Production Readiness

0.3

Novelty Score

0.8

Cost Impact Score

0.7

Citation Count

40

Authors

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

Links

Abstract / PDF

Why It Matters For Business

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Summary TLDR

Jamba is a public base LLM that mixes Transformer layers, Mamba state-space layers (SSM), and Mixture-of-Experts (MoE). The hybrid design cuts KV cache memory heavily (4GB at 256K tokens), runs much faster on long contexts (about 3x Mixtral), and supports very long contexts (released up to 256K, trained up to 1M). Quality-wise it matches or nears top open models on many benchmarks. The release is a base pretrained model (no alignment or instruction tuning) under Apache-2.0.

Problem Statement

Transformers struggle with long contexts because attention needs large key-value (KV) caches and per-token computation on the whole context. Pure SSMs (like Mamba) train efficiently and handle long-range patterns but lag on some tasks and in-context learning. The field needs an architecture that balances memory, throughput, and quality for very long inputs.

Main Contribution

A hybrid block that interleaves Transformer (attention) and Mamba (state-space) layers plus optional MoE MLPs.

A working 12B-active / 52B-available parameter model that fits an 80GB GPU (with int8 weights) and supports 256K token contexts.

A set of ablations showing hybrid benefits over pure Transformer or pure Mamba, and that MoE improves hybrid performance.

Public release of the model weights under Apache-2.0 to enable follow-up exploration.

Key Findings

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

NumbersKV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Throughput at long contexts is about 3× higher than Mixtral.

Numbers3× tokens/sec at 128K context and in 8K-batch tests vs Mixtral

Released model supports 256K tokens and training went up to 1M tokens context.

NumbersReleased max context 256K; trained up to 1M

Hybrid architecture matches or exceeds similar-size SOTA on many benchmarks.

NumbersHellaSwag: Jamba 87.1 vs Mixtral 86.7; MMLU avg ~67.4 vs Mixtral 70.6 (varies)

Adding MoE improves hybrid performance.

NumbersOLLM: 36.6 → 38.1; HellaSwag: 62.5 → 66.0; log-prob improves ≈0.013

Results

KV cache (256K context, 16bit)

Value4GB (Jamba)

Baseline32GB (Mixtral); 128GB (Llama-2)

Throughput (tokens/sec)

Value≈3× Mixtral

BaselineMixtral-8x7B

Max context fit on one A100 80GB

Value2× Mixtral; 7× Llama-2 (relative)

BaselineMixtral, Llama-2 70B

Accuracy

Value87.1 (Jamba)

Baseline86.7 (Mixtral), 85.3 (Llama‑2 70B)

Long-context QA (avg F1, 3-shot)

Value0.44 (Jamba avg)

Baseline0.43 (Mixtral avg)

Who Should Care

What To Try In 7 Days

Run the Hugging Face Jamba-v0.1 checkpoint on your long-doc QA pipeline and measure KV cache and throughput vs your current model.

Benchmark Jamba on a few 10–100K token real inputs to see latency and answer quality tradeoffs.

Try the released base model for offline batch processing (not user-facing) to assess suitability before alignment/tuning.

Agent Features

Memory

  • reduced KV cache (attention memory)
  • supports long-context SSM summary state

Architectures

  • hybrid Transformer-Mamba
  • MoE

Optimization Features

Token Efficiency

  • design trades attention vs Mamba layers to reduce per-token attention cost

Infra Optimization

  • smaller KV cache enables larger context fits per GPU

Model Optimization

  • MoE to expand available params without increasing active compute
  • RMSNorm in Mamba layers to stabilize large-scale training

System Optimization

  • fits longer contexts on one GPU, cutting inter-GPU communication

Training Optimization

  • expert parallelism and sequence parallelism
  • trained with mixed parallel strategies (FSDP, tensor parallelism)

Inference Optimization

  • high Mamba:attention ratio to lower attention compute on long contexts
  • int8 weights to fit model on 80GB GPU

Reproducibility

License

  • Apache-2.0

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Released model is a base pretrained checkpoint without alignment, instruction tuning, or moderation.
  • Some tasks show hybrid equals SOTA, but other aggregate benchmarks have mixed wins and losses.
  • Pure Mamba struggles to follow I/O formats and in-context learning; hybrid needs at least some attention layers.
  • Large-scale Mamba requires RMSNorm to avoid loss spikes during training.

When Not To Use

  • Directly in user-facing systems without alignment or moderation.
  • When you need a chat/instruction-tuned model out of the box.
  • If your infra is optimized only for pure attention models and cannot support SSM or MoE execution.

Failure Modes

  • Format non-adherence and poor in-context learning with pure Mamba configurations.
  • Training instability (loss spikes) without RMSNorm in Mamba internals.
  • Quality may lag on some benchmarks despite gains on long-context tasks.

Core Entities

Models

  • Jamba
  • Mixtral-8x7B
  • Llama-2 70B
  • Llama-2 13B
  • Mistral 7B
  • Gemma
  • Mamba (SSM)
  • Hyena / StripedHyena

Metrics

  • tokens/second (throughput)
  • KV cache memory (GB)
  • F1 (QA long-context)
  • Accuracy
  • Log-prob per byte (perplexity proxy)
  • OLLM leaderboard score

Datasets

  • L-Eval
  • Needle-in-a-haystack
  • LongFQA
  • NarrativeQA
  • CUAD
  • SFiction
  • Trec-Fine
  • NLU Intent
  • Banking77
  • CLINC150
  • Natural Questions
  • HellaSwag
  • WinoGrande
  • ARC-E / ARC-Challenge
  • PIQA
  • BoolQ
  • GSM8K
  • HumanEval
  • TruthfulQA
  • MMLU
  • BBH
  • C4
  • Books
  • code

Benchmarks

  • HellaSwag
  • WinoGrande
  • ARC
  • PIQA
  • BoolQ
  • QuAC
  • GSM8K
  • HumanEval
  • Natural Questions
  • TruthfulQA
  • MMLU
  • BBH
  • Long-context QA (L-Eval subsets)