Overview
Production Readiness
0.3
Novelty Score
0.8
Cost Impact Score
0.7
Citation Count
40
Why It Matters For Business
Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.
Summary TLDR
Jamba is a public base LLM that mixes Transformer layers, Mamba state-space layers (SSM), and Mixture-of-Experts (MoE). The hybrid design cuts KV cache memory heavily (4GB at 256K tokens), runs much faster on long contexts (about 3x Mixtral), and supports very long contexts (released up to 256K, trained up to 1M). Quality-wise it matches or nears top open models on many benchmarks. The release is a base pretrained model (no alignment or instruction tuning) under Apache-2.0.
Problem Statement
Transformers struggle with long contexts because attention needs large key-value (KV) caches and per-token computation on the whole context. Pure SSMs (like Mamba) train efficiently and handle long-range patterns but lag on some tasks and in-context learning. The field needs an architecture that balances memory, throughput, and quality for very long inputs.
Main Contribution
A hybrid block that interleaves Transformer (attention) and Mamba (state-space) layers plus optional MoE MLPs.
A working 12B-active / 52B-available parameter model that fits an 80GB GPU (with int8 weights) and supports 256K token contexts.
A set of ablations showing hybrid benefits over pure Transformer or pure Mamba, and that MoE improves hybrid performance.
Public release of the model weights under Apache-2.0 to enable follow-up exploration.
Key Findings
Hybrid Jamba reduces KV cache for 256K tokens to 4GB.
Throughput at long contexts is about 3× higher than Mixtral.
Released model supports 256K tokens and training went up to 1M tokens context.
Hybrid architecture matches or exceeds similar-size SOTA on many benchmarks.
Adding MoE improves hybrid performance.
Results
KV cache (256K context, 16bit)
Throughput (tokens/sec)
Max context fit on one A100 80GB
Accuracy
Long-context QA (avg F1, 3-shot)
Who Should Care
What To Try In 7 Days
Run the Hugging Face Jamba-v0.1 checkpoint on your long-doc QA pipeline and measure KV cache and throughput vs your current model.
Benchmark Jamba on a few 10–100K token real inputs to see latency and answer quality tradeoffs.
Try the released base model for offline batch processing (not user-facing) to assess suitability before alignment/tuning.
Agent Features
Memory
- reduced KV cache (attention memory)
- supports long-context SSM summary state
Architectures
- hybrid Transformer-Mamba
- MoE
Optimization Features
Token Efficiency
- design trades attention vs Mamba layers to reduce per-token attention cost
Infra Optimization
- smaller KV cache enables larger context fits per GPU
Model Optimization
- MoE to expand available params without increasing active compute
- RMSNorm in Mamba layers to stabilize large-scale training
System Optimization
- fits longer contexts on one GPU, cutting inter-GPU communication
Training Optimization
- expert parallelism and sequence parallelism
- trained with mixed parallel strategies (FSDP, tensor parallelism)
Inference Optimization
- high Mamba:attention ratio to lower attention compute on long contexts
- int8 weights to fit model on 80GB GPU
Reproducibility
License
- Apache-2.0
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Released model is a base pretrained checkpoint without alignment, instruction tuning, or moderation.
- Some tasks show hybrid equals SOTA, but other aggregate benchmarks have mixed wins and losses.
- Pure Mamba struggles to follow I/O formats and in-context learning; hybrid needs at least some attention layers.
- Large-scale Mamba requires RMSNorm to avoid loss spikes during training.
When Not To Use
- Directly in user-facing systems without alignment or moderation.
- When you need a chat/instruction-tuned model out of the box.
- If your infra is optimized only for pure attention models and cannot support SSM or MoE execution.
Failure Modes
- Format non-adherence and poor in-context learning with pure Mamba configurations.
- Training instability (loss spikes) without RMSNorm in Mamba internals.
- Quality may lag on some benchmarks despite gains on long-context tasks.
Core Entities
Models
- Jamba
- Mixtral-8x7B
- Llama-2 70B
- Llama-2 13B
- Mistral 7B
- Gemma
- Mamba (SSM)
- Hyena / StripedHyena
Metrics
- tokens/second (throughput)
- KV cache memory (GB)
- F1 (QA long-context)
- Accuracy
- Log-prob per byte (perplexity proxy)
- OLLM leaderboard score
Datasets
- L-Eval
- Needle-in-a-haystack
- LongFQA
- NarrativeQA
- CUAD
- SFiction
- Trec-Fine
- NLU Intent
- Banking77
- CLINC150
- Natural Questions
- HellaSwag
- WinoGrande
- ARC-E / ARC-Challenge
- PIQA
- BoolQ
- GSM8K
- HumanEval
- TruthfulQA
- MMLU
- BBH
- C4
- Books
- code
Benchmarks
- HellaSwag
- WinoGrande
- ARC
- PIQA
- BoolQ
- QuAC
- GSM8K
- HumanEval
- Natural Questions
- TruthfulQA
- MMLU
- BBH
- Long-context QA (L-Eval subsets)

