Dicta-LM 3.0 — open-weight Hebrew LLMs (24B/12B/1.7B) with 65k context and a new Hebrew chat benchmark

February 2, 20268 min

Overview

Decision SnapshotNeeds Validation

The paper provides concrete training recipes, datasets composition, and leaderboard evaluations; models are released, but full training code and raw training data are not public, so reproducibility has limits.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: No open assets linked

Open source: Partial

License: permissive (models on HuggingFace; training code/data not fully published)

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

Links

Abstract / PDF

Why It Matters For Business

Open-weight, Hebrew-specialist LLMs cut integration time for Hebrew products, enable local legal/regulatory control, and let teams prototype long-document features without building retrieval layers.

Who Should Care

Summary TLDR

Dicta-LM 3.0 is a released suite of open-weight Hebrew-focused LLMs (24B, 12B, 1.7B) trained by continuing pretraining on ~100B Hebrew tokens mixed with ~30B English tokens. Models support 65k native context, come in base and chat variants (instruct and "thinking" reasoning style), and are post-trained with supervised fine-tuning, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The 24B thinking model achieves top scores on several Hebrew tasks (notably diacritization and trivia) and the collection is published on HuggingFace with a permissive license.

Problem Statement

Frontier open-weight models lack strong coverage for low-resource languages. Hebrew has limited large corpora, complex morphology, and poor evaluation tools. Practitioners need sovereign Hebrew models and a way to evaluate chat-style Hebrew capabilities.

Main Contribution

Released Dicta-LM 3.0 models: 24B, 12B, 1.7B (base and chat variants) with native 65k context.

Continued pretraining on ~100B Hebrew tokens (75% of pretraining) mixed with ~30B English tokens.

Key Findings

Continuous Hebrew-focused pretraining improved Hebrew leaderboard averages.

Numbers24B avg +6.5 points; 12B +12.8; 1.7B +8.2 (Table 3)

Practical UseIf you adapt a strong Englishbase and continue pretraining on dedicated Hebrew data, expect multi-point gains across Hebrew tasks versus the original base models.

Evidence RefTable 3

Models support very long context windows trained end-to-end.

NumbersNative context length = 65,280 tokens; phase 2 trained on 18B tokens with long-context sampling

Practical UseUse these models for tasks needing long documents (books, transcripts) without external chunking or complex retrieval.

Evidence RefSection 3.2, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Hebrew leaderboard average (24B)66.0 -> 72.5 (DictaLM-3.0)Mistral-Small-3.1-24B+6.5Hebrew LLM Leaderboard (various tasks)Table 3 shows average improvement +6.5 for 24B after CPTTable 3
Hebrew leaderboard average (12B)53.7 -> 66.5Nemotron-Nano-12B-v2+12.8Hebrew LLM Leaderboard (various tasks)Table 3 reports +12.8 average improvement for 12BTable 3

What To Try In 7 Days

Download DictaLM-3.0-24B base from HuggingFace and test Hebrew QA and trivia workloads.

Pilot nikud (diacritization) with the 24B-Thinking chat to automate Hebrew text normalization.

Run the new Hebrew chat benchmark on your internal prompts to compare in-house models quickly.

Agent Features

Memory
long-context (65k token native context)
Tool Use
tool-calling support (Hermes-style JSON schema)tool_response tokens (<tool_response> ... </tool_response>)
Frameworks
Hermes tool-calling conventionQwen3 message delimiter tokens
Is Agentic

Yes

Architectures
transformer (base models adapted from Mistral, Nemotron, Qwen)

Optimization Features

Token Efficiency
packed sequences into 65k tokens (first-fit-decreasing packing)Accuracy
Model Optimization
continued pretraining from SOTA bases to save computeGRPO
System Optimization
NVIDIA NeMo, NeMo-RL, Megatron-LM, vLLM used for scaling and training
Training Optimization
two-phase CPT: 4,096 then 65k context phasesampling 75% long docs and 25% short docs in long-context phaseused 80 H200 GPUs on NVIDIA DGX Cloud Lepton
Inference Optimization
context-parallel training settings (Context Parallelism up to 16 for 12B)Nemotron hybrid SSM architecture advantage noted for throughput

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
Licensepermissive (models on HuggingFace; training code/data not fully published)

Risks & Boundaries

Limitations

Training data mixes internal, scraped, and partnered proprietary sources that are not fully released.

No public release of the full pretraining corpus or exact data-cleaning scripts.

When Not To Use

When you require a fully auditable, public training data provenance for compliance.

If your main use case is code generation—authors did not prioritize code tasks.

Failure Modes

Possible leftover biases or private-data leakage from proprietary or scraped sources.

Language drift if used in mixed-language pipelines without prompt constraints (model may switch languages after tool outputs).

Core Entities

Models

DictaLM-3.0-24B-BaseDictaLM-3.0-24B-ThinkingDictaLM-3.0-Nemotron-12B-BaseDictaLM-3.0-Nemotron-12B-InstructDictaLM-3.0-1.7B-BaseDictaLM-3.0-1.7B-InstructDictaLM-3.0-1.7B-ThinkingMistral-Small-3.1-24BNVIDIA-Nemotron-Nano-12B-v2Qwen3-1.7B

Metrics

AccuracyLeaderboard average scoreNikud percent words correctWin rate vs Gemini-2.5-Pro (chat)English capability retention (>98%)

Datasets

Hebrew pretraining corpus (~100B tokens)English mix (~30B tokens)Nemotron-CCFineWeb-EduSlimPajamaSFTNemotron Post Training DatasetCCMatrix English-Hebrew pairsIn-house diacritized nikud corpus

Benchmarks

Hebrew LLM Leaderboard (base few-shot)New Hebrew chat benchmark (summarization, translation, Winograd, Israeli trivia, nikud)CommonsenseQAWinoGrandeARC-ChallengeOlmes evaluation suite

Context Entities

Models

Gemma-3-27Baya-expanse-32BLlama-3.3-70B-Instructgemma-3-12b-itQwen3-14B

Metrics

AccuracyToken usage (instruct vs thinking efficiency)GRPO

Datasets

Ben-Yehuda projectSefariaSocial media corpora (Hebrew twitter, blogs)News & legal transcriptsHebrew Treebank / UD Hebrew

Benchmarks

MATH OMEGABigBenchHardMMLUAlpacaEval 2