Dicta-LM 3.0 — open-weight Hebrew LLMs (24B/12B/1.7B) with 65k context and a new Hebrew chat benchmark

February 2, 20268 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

Links

Abstract / PDF

Why It Matters For Business

Open-weight, Hebrew-specialist LLMs cut integration time for Hebrew products, enable local legal/regulatory control, and let teams prototype long-document features without building retrieval layers.

Summary TLDR

Dicta-LM 3.0 is a released suite of open-weight Hebrew-focused LLMs (24B, 12B, 1.7B) trained by continuing pretraining on ~100B Hebrew tokens mixed with ~30B English tokens. Models support 65k native context, come in base and chat variants (instruct and "thinking" reasoning style), and are post-trained with supervised fine-tuning, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). The 24B thinking model achieves top scores on several Hebrew tasks (notably diacritization and trivia) and the collection is published on HuggingFace with a permissive license.

Problem Statement

Frontier open-weight models lack strong coverage for low-resource languages. Hebrew has limited large corpora, complex morphology, and poor evaluation tools. Practitioners need sovereign Hebrew models and a way to evaluate chat-style Hebrew capabilities.

Main Contribution

Released Dicta-LM 3.0 models: 24B, 12B, 1.7B (base and chat variants) with native 65k context.

Continued pretraining on ~100B Hebrew tokens (75% of pretraining) mixed with ~30B English tokens.

Built and published a Hebrew chat benchmark suite covering summarization, translation, Winograd, Israeli trivia, and diacritization (nikud).

Post-training pipeline: supervised fine-tuning (instruct/thinking), DPO, and GRPO for reasoning/alignment.

Open-weight release on HuggingFace under a permissive license.

Key Findings

Continuous Hebrew-focused pretraining improved Hebrew leaderboard averages.

Numbers24B avg +6.5 points; 12B +12.8; 1.7B +8.2 (Table 3)

Models support very long context windows trained end-to-end.

NumbersNative context length = 65,280 tokens; phase 2 trained on 18B tokens with long-context sampling

Chat variants show strong Hebrew-specialized task performance.

Numbers24B-Thinking: Nikud 86.86, Winograd 78.06, Summarization 56.86 (Table 8)

English capability largely retained after heavy Hebrew focus.

NumbersAuthors report retaining >98% of English capabilities of original models

Models and chat-benchmark are publicly released.

NumbersMultiple model variants listed for public release on HuggingFace

Results

Hebrew leaderboard average (24B)

Value66.0 -> 72.5 (DictaLM-3.0)

BaselineMistral-Small-3.1-24B

Hebrew leaderboard average (12B)

Value53.7 -> 66.5

BaselineNemotron-Nano-12B-v2

Chat task - Nikud (24B-Thinking)

Value86.86

Baselinegemma3-27B-it 60.21

English capability retention

Value>98% retained (authors' claim)

BaselineOriginal base models

Phase-2 long-context training volume

Value18B tokens

Who Should Care

What To Try In 7 Days

Download DictaLM-3.0-24B base from HuggingFace and test Hebrew QA and trivia workloads.

Pilot nikud (diacritization) with the 24B-Thinking chat to automate Hebrew text normalization.

Run the new Hebrew chat benchmark on your internal prompts to compare in-house models quickly.

Agent Features

Memory

  • long-context (65k token native context)

Tool Use

  • tool-calling support (Hermes-style JSON schema)
  • tool_response tokens (<tool_response> ... </tool_response>)

Frameworks

  • Hermes tool-calling convention
  • Qwen3 message delimiter tokens

Is Agentic

true

Architectures

  • transformer (base models adapted from Mistral, Nemotron, Qwen)

Optimization Features

Token Efficiency

  • packed sequences into 65k tokens (first-fit-decreasing packing)
  • Accuracy

Model Optimization

  • continued pretraining from SOTA bases to save compute
  • GRPO

System Optimization

  • NVIDIA NeMo, NeMo-RL, Megatron-LM, vLLM used for scaling and training

Training Optimization

  • two-phase CPT: 4,096 then 65k context phase
  • sampling 75% long docs and 25% short docs in long-context phase
  • used 80 H200 GPUs on NVIDIA DGX Cloud Lepton

Inference Optimization

  • context-parallel training settings (Context Parallelism up to 16 for 12B)
  • Nemotron hybrid SSM architecture advantage noted for throughput

Reproducibility

License

  • permissive (models on HuggingFace; training code/data not fully published)

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Training data mixes internal, scraped, and partnered proprietary sources that are not fully released.
  • No public release of the full pretraining corpus or exact data-cleaning scripts.
  • Authors did not focus on code-generation capabilities; coding tasks were not evaluated.
  • Evaluation relies partly on LLM judges (GPT-4o) which can bias comparative scoring.

When Not To Use

  • When you require a fully auditable, public training data provenance for compliance.
  • If your main use case is code generation—authors did not prioritize code tasks.
  • If you need models smaller than 1.7B for extreme edge deployment.

Failure Modes

  • Possible leftover biases or private-data leakage from proprietary or scraped sources.
  • Language drift if used in mixed-language pipelines without prompt constraints (model may switch languages after tool outputs).
  • Evaluation blind spots: LLM-as-judge scoring may favor certain styles and miss factual subtlety.

Core Entities

Models

  • DictaLM-3.0-24B-Base
  • DictaLM-3.0-24B-Thinking
  • DictaLM-3.0-Nemotron-12B-Base
  • DictaLM-3.0-Nemotron-12B-Instruct
  • DictaLM-3.0-1.7B-Base
  • DictaLM-3.0-1.7B-Instruct
  • DictaLM-3.0-1.7B-Thinking
  • Mistral-Small-3.1-24B
  • NVIDIA-Nemotron-Nano-12B-v2
  • Qwen3-1.7B

Metrics

  • Accuracy
  • Leaderboard average score
  • Nikud percent words correct
  • Win rate vs Gemini-2.5-Pro (chat)
  • English capability retention (>98%)

Datasets

  • Hebrew pretraining corpus (~100B tokens)
  • English mix (~30B tokens)
  • Nemotron-CC
  • FineWeb-Edu
  • SlimPajama
  • SFT
  • Nemotron Post Training Dataset
  • CCMatrix English-Hebrew pairs
  • In-house diacritized nikud corpus

Benchmarks

  • Hebrew LLM Leaderboard (base few-shot)
  • New Hebrew chat benchmark (summarization, translation, Winograd, Israeli trivia, nikud)
  • CommonsenseQA
  • WinoGrande
  • ARC-Challenge
  • Olmes evaluation suite

Context Entities

Models

  • Gemma-3-27B
  • aya-expanse-32B
  • Llama-3.3-70B-Instruct
  • gemma-3-12b-it
  • Qwen3-14B

Metrics

  • Accuracy
  • Token usage (instruct vs thinking efficiency)
  • GRPO

Datasets

  • Ben-Yehuda project
  • Sefaria
  • Social media corpora (Hebrew twitter, blogs)
  • News & legal transcripts
  • Hebrew Treebank / UD Hebrew

Benchmarks

  • MATH OMEGA
  • BigBenchHard
  • MMLU
  • AlpacaEval 2