Dicta-LM 3.0 — open-weight Hebrew LLMs (24B/12B/1.7B) with 65k context and a new Hebrew chat benchmark

Overview

Decision SnapshotNeeds Validation

The paper provides concrete training recipes, datasets composition, and leaderboard evaluations; models are released, but full training code and raw training data are not public, so reproducibility has limits.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Partial

License: permissive (models on HuggingFace; training code/data not fully published)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

Links

Abstract / PDF

Why It Matters For Business

Open-weight, Hebrew-specialist LLMs cut integration time for Hebrew products, enable local legal/regulatory control, and let teams prototype long-document features without building retrieval layers.

Who Should Care

CTO Product Manager ML Engineer Founder Engineering Lead Data Scientist

Summary TLDR

Dicta-LM 3.0 is a released suite of open-weight Hebrew-focused LLMs (24B, 12B, 1.7B) trained by continuing pretraining on ~100B Hebrew tokens mixed with ~30B English tokens. Models support 65k native context, come in base and chat variants (instruct and "thinking" reasoning style), and are post-trained with supervised fine-tuning, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The 24B thinking model achieves top scores on several Hebrew tasks (notably diacritization and trivia) and the collection is published on HuggingFace with a permissive license.

Problem Statement

Frontier open-weight models lack strong coverage for low-resource languages. Hebrew has limited large corpora, complex morphology, and poor evaluation tools. Practitioners need sovereign Hebrew models and a way to evaluate chat-style Hebrew capabilities.

Main Contribution

Released Dicta-LM 3.0 models: 24B, 12B, 1.7B (base and chat variants) with native 65k context.

Continued pretraining on ~100B Hebrew tokens (75% of pretraining) mixed with ~30B English tokens.

Key Findings

Continuous Hebrew-focused pretraining improved Hebrew leaderboard averages.

Numbers24B avg +6.5 points; 12B +12.8; 1.7B +8.2 (Table 3)

Practical UseIf you adapt a strong Englishbase and continue pretraining on dedicated Hebrew data, expect multi-point gains across Hebrew tasks versus the original base models.

Evidence RefTable 3

Models support very long context windows trained end-to-end.

NumbersNative context length = 65,280 tokens; phase 2 trained on 18B tokens with long-context sampling

Practical UseUse these models for tasks needing long documents (books, transcripts) without external chunking or complex retrieval.

Evidence RefSection 3.2, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Hebrew leaderboard average (24B)	66.0 -> 72.5 (DictaLM-3.0)	Mistral-Small-3.1-24B	+6.5	Hebrew LLM Leaderboard (various tasks)	Table 3 shows average improvement +6.5 for 24B after CPT	Table 3
Hebrew leaderboard average (12B)	53.7 -> 66.5	Nemotron-Nano-12B-v2	+12.8	Hebrew LLM Leaderboard (various tasks)	Table 3 reports +12.8 average improvement for 12B	Table 3

What To Try In 7 Days

Download DictaLM-3.0-24B base from HuggingFace and test Hebrew QA and trivia workloads.

Pilot nikud (diacritization) with the 24B-Thinking chat to automate Hebrew text normalization.

Run the new Hebrew chat benchmark on your internal prompts to compare in-house models quickly.

Agent Features

Memory

long-context (65k token native context)

Tool Use

tool-calling support (Hermes-style JSON schema)tool_response tokens (<tool_response> ... </tool_response>)

Frameworks

Hermes tool-calling conventionQwen3 message delimiter tokens

Is Agentic

Yes

Architectures

transformer (base models adapted from Mistral, Nemotron, Qwen)

Optimization Features

Token Efficiency

packed sequences into 65k tokens (first-fit-decreasing packing)Accuracy

Model Optimization

continued pretraining from SOTA bases to save computeGRPO

System Optimization

NVIDIA NeMo, NeMo-RL, Megatron-LM, vLLM used for scaling and training

Training Optimization

two-phase CPT: 4,096 then 65k context phasesampling 75% long docs and 25% short docs in long-context phaseused 80 H200 GPUs on NVIDIA DGX Cloud Lepton

Inference Optimization

context-parallel training settings (Context Parallelism up to 16 for 12B)Nemotron hybrid SSM architecture advantage noted for throughput

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

Licensepermissive (models on HuggingFace; training code/data not fully published)

Risks & Boundaries

Limitations

Training data mixes internal, scraped, and partnered proprietary sources that are not fully released.

No public release of the full pretraining corpus or exact data-cleaning scripts.

When Not To Use

When you require a fully auditable, public training data provenance for compliance.

If your main use case is code generation—authors did not prioritize code tasks.

Failure Modes

Possible leftover biases or private-data leakage from proprietary or scraped sources.

Language drift if used in mixed-language pipelines without prompt constraints (model may switch languages after tool outputs).

Core Entities

Models

DictaLM-3.0-24B-BaseDictaLM-3.0-24B-ThinkingDictaLM-3.0-Nemotron-12B-BaseDictaLM-3.0-Nemotron-12B-InstructDictaLM-3.0-1.7B-BaseDictaLM-3.0-1.7B-InstructDictaLM-3.0-1.7B-ThinkingMistral-Small-3.1-24BNVIDIA-Nemotron-Nano-12B-v2Qwen3-1.7B

Metrics

AccuracyLeaderboard average scoreNikud percent words correctWin rate vs Gemini-2.5-Pro (chat)English capability retention (>98%)

Datasets

Hebrew pretraining corpus (~100B tokens)English mix (~30B tokens)Nemotron-CCFineWeb-EduSlimPajamaSFTNemotron Post Training DatasetCCMatrix English-Hebrew pairsIn-house diacritized nikud corpus

Benchmarks

Hebrew LLM Leaderboard (base few-shot)New Hebrew chat benchmark (summarization, translation, Winograd, Israeli trivia, nikud)CommonsenseQAWinoGrandeARC-ChallengeOlmes evaluation suite

Context Entities

Models

Gemma-3-27Baya-expanse-32BLlama-3.3-70B-Instructgemma-3-12b-itQwen3-14B

Metrics

AccuracyToken usage (instruct vs thinking efficiency)GRPO

Datasets

Ben-Yehuda projectSefariaSocial media corpora (Hebrew twitter, blogs)News & legal transcriptsHebrew Treebank / UD Hebrew

Benchmarks

MATH OMEGABigBenchHardMMLUAlpacaEval 2

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Continuous Hebrew-focused pretraining improved Hebrew leaderboard averages.

Models support very long context windows trained end-to-end.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding