Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
Motif-2.6B offers strong code and math performance with a modest parameter count, making it cost-effective for teams that need high-quality reasoning or coding without 7–70B model compute costs.
Summary TLDR
Motif-2.6B is a 2.6B-parameter decoder-only foundation model that combines two main innovations—Differential Attention (subtracting two attention maps to reduce noise) and PolyNorm (polynomial-based activation/normalization)—plus a dynamic data-mixing pretraining schedule and RoPE-based context extension. On a broad suite of benchmarks Motif beats many similarly sized open models on math and coding tasks (very large relative gains) and posts a positive average vs several baselines, but it underperforms on some open-domain QA and commonsense benchmarks. Training used ~2.5T tokens, custom ROCm kernels, and post-training alignment with DPO.
Problem Statement
Building a high-quality foundational LLM that is computationally affordable yet strong across reasoning, code, and long-context tasks remains hard for smaller research groups; this work designs an architecture and training recipe for a 2.6B model to improve long-context comprehension, reduce hallucination, and boost in-context learning while keeping compute and token budgets moderate.
Main Contribution
Design and release of Motif-2.6B, a 2.6B decoder-only model with Differential Attention and PolyNorm activations.
A two-stage dynamic data-mixing pretraining schedule that linearly shifts domain ratios (general → Korean/code/math) across training.
Long-context extension (Motif-2.6B-LC) via increasing RoPE base frequency to 500k for 16k token context.
Post-training pipeline combining curated human and synthetic datasets, dataset fusion, rejection-sampling synthesis, and DPO alignment.
Key Findings
Motif-2.6B achieves a positive average improvement vs Mistral 7B across evaluated benchmarks.
Huge gains on math and reasoning benchmarks compared to some baselines.
Large improvements on coding benchmarks.
Clear weaknesses on open-domain retrieval-style QA and some commonsense tasks.
Results
Average improvement vs Mistral 7B
HumanEval P@1
MATH (maj@4)
GSM8K
NQ
Who Should Care
What To Try In 7 Days
Run Motif-2.6B on your unit math/reasoning and code tasks to validate reported gains.
Evaluate Motif-2.6B-LC on any long-document workflows (up to 16k tokens) and compare latency vs your current models.
Use the provided HuggingFace kernels to test integration on ROCm hardware before porting production pipelines.
Optimization Features
Token Efficiency
- Tokenizer expanded to 219,520 tokens; improved bytes-per-token for Korean by 12.6%
Infra Optimization
- Training under controlled compute budget (3×10^20 FLOPs across experiments)
Model Optimization
- PolyNorm polynomial activation (degree ≤3)
- Differential Attention (subtract two attention maps)
System Optimization
- Custom HIP kernels optimized for ROCm on AMD GPUs
Training Optimization
- Dynamic data-mixing scheduler (linear annealing of domain ratios)
- Simple Moving Average over recent checkpoints every 8B tokens
- Warmup-Stable-Decay learning schedule
- AdamW optimizer with specified hyperparameters
Inference Optimization
- Long-context variant uses RoPE base frequency adjustment for 16k context
Reproducibility
Code Urls
Data Urls
- References to datasets: DCLM, TxT360, Fineweb2, FineMath, Tulu, Granite (see paper refs)
Open Source Status
- partial
Risks & Boundaries
Limitations
- Underperforms on some open-domain QA and retrieval benchmarks (e.g., NQ, TriviaQA).
- Custom ROCm kernels and internal framework are not fully open, complicating reproduction.
- Some improvements may depend on specific dataset mixtures and synthetic data choices that are not fully public.
When Not To Use
- Do not use as a drop-in replacement for retrieval-heavy or open-domain QA systems without adding retrieval.
- Avoid relying solely on it for commonsense zero-shot tasks where it trails larger models.
- Not ideal if you require fully open training code and all datasets for auditing.
Failure Modes
- Weak factual recall in open-domain QA (very low NQ/TriviaQA scores).
- Potential bias or artefacts from synthetic dataset fusion and proprietary Korean corpus.
- Performance variability across tasks: strong on math/code, weaker on some commonsense and knowledge benchmarks.
Core Entities
Models
- Motif-2.6B
- Motif-2.6B-LC
- Mistral 7B
- Gemma 1 (2B/7B)
- Gemma 2 (2B/9B)
- Gemma 3 (1B/4B)
- Llama 3 (8B)
- Llama 3.2 (1B/3B)
- Phi-2 (2.7B)
- Phi-3 (3.8B/7B)
- Qwen3-8B
- Qwen3 (excluded from comparisons)
Metrics
- Accuracy
- P@1
- F1
- Average improvement % (relative delta)
Datasets
- DCLM
- TxT360
- Fineweb2
- FineMath
- Tulu (Tulu 3 mixtures)
- Granite
- MagpieLM
- LM-Sys
- Exam-CoT (synthetic)
- EvolKit (Auto Evol-Instruct outputs)
- In-house Korean corpus
Benchmarks
- MMLU
- HellaSwag
- WinoGrande
- PIQA
- ARC-E
- ARC-C
- NQ
- TriviaQA
- HumanEval
- MBPP
- MATH
- GSM8K
- BBH
- AGIEval
- DROP
- SIQA
- BoolQ
- GPQA
- IFEval

