Two open 1.8B LLMs (base + chat) trained with FP8 and staged data; Danube2 tops open leaderboard under 2B

January 30, 20247 min

Overview

Production Readiness

0.8

Novelty Score

0.4

Cost Impact Score

0.8

Citation Count

3

Authors

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Links

Abstract / PDF

Why It Matters For Business

You get permissively licensed, high-performing small LLMs (1.8B) ready for commercial use; smaller models cut inference cost and enable community fine-tuning under Apache 2.0.

Summary TLDR

H2O.ai releases H2O-Danube-1.8B and an improved continuation H2O-Danube2-1.8B (initialized from Danube and trained +2T tokens). Both are 1.8B-parameter decoder models with public weights under Apache 2.0. The team used FP8 for speed, progressive context-length training, grouped-query attention and data-stage filtering. Danube2 ranks highest on the Hugging Face Open LLM Leaderboard among open models below 2B by their reported average (48.72). They also publish chat variants fine-tuned by supervised learning (SFT) and Direct Preference Optimization (DPO).

Problem Statement

Small, permissively licensed LLMs are useful for low-cost inference and community fine-tuning. The paper asks: how far can careful training, precision tricks (FP8), and staged data push a 1.8B model's performance so it competes with larger or more heavily trained alternatives?

Main Contribution

Release of H2O-Danube-1.8B (trained on 1T tokens) and H2O-Danube2-1.8B (continued training +2T tokens) under Apache 2.0.

Practical training recipe: FP8 for many linear ops, grouped-query attention, progressive sequence-length schedule up to 16k, and data-stage filtering.

Published SFT and DPO chat models and evaluation on standard benchmarks and MT-Bench (GPT-4 judged).

Architectural tweaks for Danube2: remove sliding window, reduce max context to 8k, switch to Mistral tokenizer, and staged data mixes to raise data quality.

Key Findings

Danube2 is top-ranked among open models below 2B on Hugging Face Open LLM Leaderboard

NumbersAverage score 48.72 (Table 6)

Continued training (+2T tokens) and dataset staging improved the original Danube average from 39.12 to 48.72

NumbersAvg 39.12 -> 48.72 (+9.6)

FP8 training and setup achieved 292.7k tokens/sec throughput on an 8×H100 node

NumbersThroughput 292.7k tokens/s

Chat fine-tune (SFT + DPO) yields strong MT-Bench chat results; H2O-Danube-1.8B-Chat turn1 avg = 6.41

NumbersMT-Bench turn1 avg 6.41 (Table 3)

Model does not target code tasks and performs poorly on GSM8k math benchmark

NumbersGSM 5-shot score 1.44 (Danube base, Table 2)

Results

Open LLM Leaderboard average (Danube base)

Value39.12

Open LLM Leaderboard average (Danube2 base)

Value48.72

BaselineH2O-Danube 39.12

MT-Bench turn1 average (H2O-Danube-1.8B-Chat)

Value6.41

BaselineStablelm-2-Zephyr 6.41

Pretraining tokens

Value1.0T (Danube) + 2.0T (Danube2 continued)

Training throughput

Value292.7k tokens/s

Who Should Care

What To Try In 7 Days

Run H2O-Danube2-1.8B on representative tasks to compare latency and baseline quality.

Fine-tune the released SFT weights on your domain-specific dialogs or instructions.

Evaluate chat models with MT-Bench or human checks, focusing on non-math categories where they perform well.

Optimization Features

Token Efficiency

  • Trained long context behavior adjustments (sliding window removed in Danube2)

Infra Optimization

  • Use of FlashAttention-2 implementation

Model Optimization

  • Grouped-query attention (32 heads, 8 KV heads)
  • RMSNorm instead of LayerNorm

System Optimization

  • Single-node 8×H100 DDP with 1.18M token batch for throughput

Training Optimization

  • FP8 for many linear layers (Hopper) while keeping lm_head bfloat16
  • Progressive sequence-length schedule up to 16,384
  • Staged data mixes to increase data quality over time

Reproducibility

License

  • Apache 2.0

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Not trained on code; authors excluded coding data.
  • Weak on math benchmark GSM8k and MMLU compared to some peers.
  • Automated chat evaluation relies on GPT-4 judgments rather than large human studies.
  • Full training data composition and exact preprocessing heuristics are summarized but not fully public.

When Not To Use

  • Tasks requiring reliable math reasoning or code generation without further fine-tuning.
  • High-stakes settings that require extensive human evaluation or certification.
  • When you need strict provenance of training data for regulatory reasons.

Failure Modes

  • Hallucinations typical of decoder LLMs on factual queries.
  • Poor performance on math and some multi-step reasoning benchmarks.
  • Overfitting to patterns in staged data if misapplied for niche domains.

Core Entities

Models

  • H2O-Danube-1.8B
  • H2O-Danube2-1.8B
  • H2O-Danube-1.8B-Chat
  • H2O-Danube2-1.8B-Chat

Metrics

  • Accuracy
  • exact_match
  • MT-Bench score
  • OpenLLM average

Datasets

  • OpenOrca
  • MetaMathQA
  • UltraChat200k
  • Oasst2
  • UltraFeedback Binarized
  • Orca DPO Pairs
  • Distilabel Math Preference DPO

Benchmarks

  • ARC
  • HellaSwag
  • OpenBookQA
  • PIQA
  • Winogrande
  • TriviaQA
  • BoolQ
  • MMLU
  • TruthfulQA
  • GSM8k
  • MT-Bench
  • Open LLM Leaderboard

Context Entities

Models

  • TinyLlama
  • Falcon
  • Qwen
  • Stable LM 2
  • Phi-1.5
  • Gemma-2B