Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Open-weight, Hebrew-specialist LLMs cut integration time for Hebrew products, enable local legal/regulatory control, and let teams prototype long-document features without building retrieval layers.
Summary TLDR
Dicta-LM 3.0 is a released suite of open-weight Hebrew-focused LLMs (24B, 12B, 1.7B) trained by continuing pretraining on ~100B Hebrew tokens mixed with ~30B English tokens. Models support 65k native context, come in base and chat variants (instruct and "thinking" reasoning style), and are post-trained with supervised fine-tuning, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The 24B thinking model achieves top scores on several Hebrew tasks (notably diacritization and trivia) and the collection is published on HuggingFace with a permissive license.
Problem Statement
Frontier open-weight models lack strong coverage for low-resource languages. Hebrew has limited large corpora, complex morphology, and poor evaluation tools. Practitioners need sovereign Hebrew models and a way to evaluate chat-style Hebrew capabilities.
Main Contribution
Released Dicta-LM 3.0 models: 24B, 12B, 1.7B (base and chat variants) with native 65k context.
Continued pretraining on ~100B Hebrew tokens (75% of pretraining) mixed with ~30B English tokens.
Built and published a Hebrew chat benchmark suite covering summarization, translation, Winograd, Israeli trivia, and diacritization (nikud).
Post-training pipeline: supervised fine-tuning (instruct/thinking), DPO, and GRPO for reasoning/alignment.
Open-weight release on HuggingFace under a permissive license.
Key Findings
Continuous Hebrew-focused pretraining improved Hebrew leaderboard averages.
Models support very long context windows trained end-to-end.
Chat variants show strong Hebrew-specialized task performance.
English capability largely retained after heavy Hebrew focus.
Models and chat-benchmark are publicly released.
Results
Hebrew leaderboard average (24B)
Hebrew leaderboard average (12B)
Chat task - Nikud (24B-Thinking)
English capability retention
Phase-2 long-context training volume
Who Should Care
What To Try In 7 Days
Download DictaLM-3.0-24B base from HuggingFace and test Hebrew QA and trivia workloads.
Pilot nikud (diacritization) with the 24B-Thinking chat to automate Hebrew text normalization.
Run the new Hebrew chat benchmark on your internal prompts to compare in-house models quickly.
Agent Features
Memory
- long-context (65k token native context)
Tool Use
- tool-calling support (Hermes-style JSON schema)
- tool_response tokens (<tool_response> ... </tool_response>)
Frameworks
- Hermes tool-calling convention
- Qwen3 message delimiter tokens
Is Agentic
true
Architectures
- transformer (base models adapted from Mistral, Nemotron, Qwen)
Optimization Features
Token Efficiency
- packed sequences into 65k tokens (first-fit-decreasing packing)
- Accuracy
Model Optimization
- continued pretraining from SOTA bases to save compute
- GRPO
System Optimization
- NVIDIA NeMo, NeMo-RL, Megatron-LM, vLLM used for scaling and training
Training Optimization
- two-phase CPT: 4,096 then 65k context phase
- sampling 75% long docs and 25% short docs in long-context phase
- used 80 H200 GPUs on NVIDIA DGX Cloud Lepton
Inference Optimization
- context-parallel training settings (Context Parallelism up to 16 for 12B)
- Nemotron hybrid SSM architecture advantage noted for throughput
Reproducibility
License
- permissive (models on HuggingFace; training code/data not fully published)
Open Source Status
- partial
Risks & Boundaries
Limitations
- Training data mixes internal, scraped, and partnered proprietary sources that are not fully released.
- No public release of the full pretraining corpus or exact data-cleaning scripts.
- Authors did not focus on code-generation capabilities; coding tasks were not evaluated.
- Evaluation relies partly on LLM judges (GPT-4o) which can bias comparative scoring.
When Not To Use
- When you require a fully auditable, public training data provenance for compliance.
- If your main use case is code generation—authors did not prioritize code tasks.
- If you need models smaller than 1.7B for extreme edge deployment.
Failure Modes
- Possible leftover biases or private-data leakage from proprietary or scraped sources.
- Language drift if used in mixed-language pipelines without prompt constraints (model may switch languages after tool outputs).
- Evaluation blind spots: LLM-as-judge scoring may favor certain styles and miss factual subtlety.
Core Entities
Models
- DictaLM-3.0-24B-Base
- DictaLM-3.0-24B-Thinking
- DictaLM-3.0-Nemotron-12B-Base
- DictaLM-3.0-Nemotron-12B-Instruct
- DictaLM-3.0-1.7B-Base
- DictaLM-3.0-1.7B-Instruct
- DictaLM-3.0-1.7B-Thinking
- Mistral-Small-3.1-24B
- NVIDIA-Nemotron-Nano-12B-v2
- Qwen3-1.7B
Metrics
- Accuracy
- Leaderboard average score
- Nikud percent words correct
- Win rate vs Gemini-2.5-Pro (chat)
- English capability retention (>98%)
Datasets
- Hebrew pretraining corpus (~100B tokens)
- English mix (~30B tokens)
- Nemotron-CC
- FineWeb-Edu
- SlimPajama
- SFT
- Nemotron Post Training Dataset
- CCMatrix English-Hebrew pairs
- In-house diacritized nikud corpus
Benchmarks
- Hebrew LLM Leaderboard (base few-shot)
- New Hebrew chat benchmark (summarization, translation, Winograd, Israeli trivia, nikud)
- CommonsenseQA
- WinoGrande
- ARC-Challenge
- Olmes evaluation suite
Context Entities
Models
- Gemma-3-27B
- aya-expanse-32B
- Llama-3.3-70B-Instruct
- gemma-3-12b-it
- Qwen3-14B
Metrics
- Accuracy
- Token usage (instruct vs thinking efficiency)
- GRPO
Datasets
- Ben-Yehuda project
- Sefaria
- Social media corpora (Hebrew twitter, blogs)
- News & legal transcripts
- Hebrew Treebank / UD Hebrew
Benchmarks
- MATH OMEGA
- BigBenchHard
- MMLU
- AlpacaEval 2

