Two open 1.8B LLMs (base + chat) trained with FP8 and staged data; Danube2 tops open leaderboard under 2B

Overview

Decision SnapshotNeeds Validation

The models are production-ready for many general NLP and chat tasks due to permissive licensing, moderate size, and competitive leaderboard scores; evidence is primarily benchmark and automated chat evaluation, with limited human evaluation.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: Apache 2.0

At A Glance

Cost impact: 80%

Production readiness: 80%

Novelty: 40%

Authors

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati

Links

Abstract / PDF / Code

Why It Matters For Business

You get permissively licensed, high-performing small LLMs (1.8B) ready for commercial use; smaller models cut inference cost and enable community fine-tuning under Apache 2.0.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

H2O.ai releases H2O-Danube-1.8B and an improved continuation H2O-Danube2-1.8B (initialized from Danube and trained +2T tokens). Both are 1.8B-parameter decoder models with public weights under Apache 2.0. The team used FP8 for speed, progressive context-length training, grouped-query attention and data-stage filtering. Danube2 ranks highest on the Hugging Face Open LLM Leaderboard among open models below 2B by their reported average (48.72). They also publish chat variants fine-tuned by supervised learning (SFT) and Direct Preference Optimization (DPO).

Problem Statement

Small, permissively licensed LLMs are useful for low-cost inference and community fine-tuning. The paper asks: how far can careful training, precision tricks (FP8), and staged data push a 1.8B model's performance so it competes with larger or more heavily trained alternatives?

Main Contribution

Release of H2O-Danube-1.8B (trained on 1T tokens) and H2O-Danube2-1.8B (continued training +2T tokens) under Apache 2.0.

Practical training recipe: FP8 for many linear ops, grouped-query attention, progressive sequence-length schedule up to 16k, and data-stage filtering.

Key Findings

Danube2 is top-ranked among open models below 2B on Hugging Face Open LLM Leaderboard

NumbersAverage score 48.72 (Table 6)

Practical UseFor off-the-shelf open models under 2B, try H2O-Danube2-1.8B first when you need a high general benchmark score.

Evidence RefTable 6

Continued training (+2T tokens) and dataset staging improved the original Danube average from 39.12 to 48.72

NumbersAvg 39.12 -> 48.72 (+9.6)

Practical UseInvesting compute in careful continued pretraining and raising data quality can yield large benchmark gains for small models.

Evidence RefTable 6 vs Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Open LLM Leaderboard average (Danube base)	39.12	—	—	Open LLM Leaderboard (Table 6)	H2O-Danube 1.8B average	Table 6
Open LLM Leaderboard average (Danube2 base)	48.72	H2O-Danube 39.12	+9.6	Open LLM Leaderboard (Table 6)	H2O-Danube2 average	Table 6

What To Try In 7 Days

Run H2O-Danube2-1.8B on representative tasks to compare latency and baseline quality.

Fine-tune the released SFT weights on your domain-specific dialogs or instructions.

Evaluate chat models with MT-Bench or human checks, focusing on non-math categories where they perform well.

Optimization Features

Token Efficiency

Trained long context behavior adjustments (sliding window removed in Danube2)

Infra Optimization

Use of FlashAttention-2 implementation

Model Optimization

Grouped-query attention (32 heads, 8 KV heads)RMSNorm instead of LayerNorm

System Optimization

Single-node 8×H100 DDP with 1.18M token batch for throughput

Training Optimization

FP8 for many linear layers (Hopper) while keeping lm_head bfloat16Progressive sequence-length schedule up to 16,384Staged data mixes to increase data quality over time

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseApache 2.0

Code URLs

https://huggingface.co/h2oai/h2o-danube2-1.8b-base https://huggingface.co/h2oai/h2o-danube2-1.8b-chat https://huggingface.co/h2oai/h2o-danube-1.8b-sft https://github.com/h2oai/h2o-llmstudio

Risks & Boundaries

Limitations

Not trained on code; authors excluded coding data.

Weak on math benchmark GSM8k and MMLU compared to some peers.

When Not To Use

Tasks requiring reliable math reasoning or code generation without further fine-tuning.

High-stakes settings that require extensive human evaluation or certification.

Failure Modes

Hallucinations typical of decoder LLMs on factual queries.

Poor performance on math and some multi-step reasoning benchmarks.

Core Entities

Models

H2O-Danube-1.8BH2O-Danube2-1.8BH2O-Danube-1.8B-ChatH2O-Danube2-1.8B-Chat

Metrics

Accuracyexact_matchMT-Bench scoreOpenLLM average

Datasets

OpenOrcaMetaMathQAUltraChat200kOasst2UltraFeedback BinarizedOrca DPO PairsDistilabel Math Preference DPO

Benchmarks

ARCHellaSwagOpenBookQAPIQAWinograndeTriviaQABoolQMMLUTruthfulQAGSM8kMT-BenchOpen LLM Leaderboard

Context Entities

Models

TinyLlamaFalconQwenStable LM 2Phi-1.5Gemma-2B

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Danube2 is top-ranked among open models below 2B on Hugging Face Open LLM Leaderboard

Continued training (+2T tokens) and dataset staging improved the original Danube average from 39.12 to 48.72

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding