Two open 1.8B LLMs (base + chat) trained with FP8 and staged data; Danube2 tops open leaderboard under 2B

January 30, 20247 min

Overview

Decision SnapshotNeeds Validation

The models are production-ready for many general NLP and chat tasks due to permissive licensing, moderate size, and competitive leaderboard scores; evidence is primarily benchmark and automated chat evaluation, with limited human evaluation.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache 2.0

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 40%

Authors

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Links

Abstract / PDF / Code

Why It Matters For Business

You get permissively licensed, high-performing small LLMs (1.8B) ready for commercial use; smaller models cut inference cost and enable community fine-tuning under Apache 2.0.

Who Should Care

Summary TLDR

H2O.ai releases H2O-Danube-1.8B and an improved continuation H2O-Danube2-1.8B (initialized from Danube and trained +2T tokens). Both are 1.8B-parameter decoder models with public weights under Apache 2.0. The team used FP8 for speed, progressive context-length training, grouped-query attention and data-stage filtering. Danube2 ranks highest on the Hugging Face Open LLM Leaderboard among open models below 2B by their reported average (48.72). They also publish chat variants fine-tuned by supervised learning (SFT) and Direct Preference Optimization (DPO).

Problem Statement

Small, permissively licensed LLMs are useful for low-cost inference and community fine-tuning. The paper asks: how far can careful training, precision tricks (FP8), and staged data push a 1.8B model's performance so it competes with larger or more heavily trained alternatives?

Main Contribution

Release of H2O-Danube-1.8B (trained on 1T tokens) and H2O-Danube2-1.8B (continued training +2T tokens) under Apache 2.0.

Practical training recipe: FP8 for many linear ops, grouped-query attention, progressive sequence-length schedule up to 16k, and data-stage filtering.

Key Findings

Danube2 is top-ranked among open models below 2B on Hugging Face Open LLM Leaderboard

NumbersAverage score 48.72 (Table 6)

Practical UseFor off-the-shelf open models under 2B, try H2O-Danube2-1.8B first when you need a high general benchmark score.

Evidence RefTable 6

Continued training (+2T tokens) and dataset staging improved the original Danube average from 39.12 to 48.72

NumbersAvg 39.12 -> 48.72 (+9.6)

Practical UseInvesting compute in careful continued pretraining and raising data quality can yield large benchmark gains for small models.

Evidence RefTable 6 vs Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Open LLM Leaderboard average (Danube base)39.12Open LLM Leaderboard (Table 6)H2O-Danube 1.8B averageTable 6
Open LLM Leaderboard average (Danube2 base)48.72H2O-Danube 39.12+9.6Open LLM Leaderboard (Table 6)H2O-Danube2 averageTable 6

What To Try In 7 Days

Run H2O-Danube2-1.8B on representative tasks to compare latency and baseline quality.

Fine-tune the released SFT weights on your domain-specific dialogs or instructions.

Evaluate chat models with MT-Bench or human checks, focusing on non-math categories where they perform well.

Optimization Features

Token Efficiency
Trained long context behavior adjustments (sliding window removed in Danube2)
Infra Optimization
Use of FlashAttention-2 implementation
Model Optimization
Grouped-query attention (32 heads, 8 KV heads)RMSNorm instead of LayerNorm
System Optimization
Single-node 8×H100 DDP with 1.18M token batch for throughput
Training Optimization
FP8 for many linear layers (Hopper) while keeping lm_head bfloat16Progressive sequence-length schedule up to 16,384Staged data mixes to increase data quality over time

Reproducibility

Risks & Boundaries

Limitations

Not trained on code; authors excluded coding data.

Weak on math benchmark GSM8k and MMLU compared to some peers.

When Not To Use

Tasks requiring reliable math reasoning or code generation without further fine-tuning.

High-stakes settings that require extensive human evaluation or certification.

Failure Modes

Hallucinations typical of decoder LLMs on factual queries.

Poor performance on math and some multi-step reasoning benchmarks.

Core Entities

Models

H2O-Danube-1.8BH2O-Danube2-1.8BH2O-Danube-1.8B-ChatH2O-Danube2-1.8B-Chat

Metrics

Accuracyexact_matchMT-Bench scoreOpenLLM average

Datasets

OpenOrcaMetaMathQAUltraChat200kOasst2UltraFeedback BinarizedOrca DPO PairsDistilabel Math Preference DPO

Benchmarks

ARCHellaSwagOpenBookQAPIQAWinograndeTriviaQABoolQMMLUTruthfulQAGSM8kMT-BenchOpen LLM Leaderboard

Context Entities

Models

TinyLlamaFalconQwenStable LM 2Phi-1.5Gemma-2B