A 2.6B foundation LLM that blends new attention and polynomial activations to boost math and code performance while keeping costs moderate

August 2, 20258 min

Overview

Decision SnapshotNeeds Validation

The paper provides replicated benchmark comparisons and ablations supporting the architecture and training choices, but many details (proprietary synthetic data and internal framework) limit full external reproduction.

Citations0

Evidence Strength0.70

Confidence0.65

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junghwan Lim, Sungmin Lee, Dongseok Kim, Eunhwan Park, Hyunbyung Park, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Jihwan Kim, Minjae Kim, Taehwan Kim, Youngrok Kim, Haesol Lee, Jeesoo Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Daewon Suh, Dongjoo Weon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Motif-2.6B offers strong code and math performance with a modest parameter count, making it cost-effective for teams that need high-quality reasoning or coding without 7–70B model compute costs.

Who Should Care

Summary TLDR

Motif-2.6B is a 2.6B-parameter decoder-only foundation model that combines two main innovations—Differential Attention (subtracting two attention maps to reduce noise) and PolyNorm (polynomial-based activation/normalization)—plus a dynamic data-mixing pretraining schedule and RoPE-based context extension. On a broad suite of benchmarks Motif beats many similarly sized open models on math and coding tasks (very large relative gains) and posts a positive average vs several baselines, but it underperforms on some open-domain QA and commonsense benchmarks. Training used ~2.5T tokens, custom ROCm kernels, and post-training alignment with DPO.

Problem Statement

Building a high-quality foundational LLM that is computationally affordable yet strong across reasoning, code, and long-context tasks remains hard for smaller research groups; this work designs an architecture and training recipe for a 2.6B model to improve long-context comprehension, reduce hallucination, and boost in-context learning while keeping compute and token budgets moderate.

Main Contribution

Design and release of Motif-2.6B, a 2.6B decoder-only model with Differential Attention and PolyNorm activations.

A two-stage dynamic data-mixing pretraining schedule that linearly shifts domain ratios (general → Korean/code/math) across training.

Key Findings

Motif-2.6B achieves a positive average improvement vs Mistral 7B across evaluated benchmarks.

Numbers+25.47% average improvement vs Mistral 7B

Practical UseExpect overall stronger task performance than Mistral 7B on the included benchmark mix; good baseline if you need a compact model that often outperforms a 7B model on these tasks.

Evidence RefTable 4 / Appendix A.2 (average over listed benchmarks)

Huge gains on math and reasoning benchmarks compared to some baselines.

NumbersMATH +206.87% vs Mistral 7B; GSM8K +53.66%

Practical UseUse Motif for math and multi-step reasoning tasks where it substantially outperforms several open baselines.

Evidence RefAppendix A.2 tables (MATH, GSM8K rows)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Average improvement vs Mistral 7B+25.47%Mistral 7B+25.47%Average across listed benchmarksTable 4 / Appendix A.2 averageTable 4
HumanEval P@168.3Mistral 7B 30.5+123.93%HumanEval (0-shot)Appendix A.2 HumanEval rowAppendix A.2

What To Try In 7 Days

Run Motif-2.6B on your unit math/reasoning and code tasks to validate reported gains.

Evaluate Motif-2.6B-LC on any long-document workflows (up to 16k tokens) and compare latency vs your current models.

Use the provided HuggingFace kernels to test integration on ROCm hardware before porting production pipelines.

Optimization Features

Token Efficiency
Tokenizer expanded to 219,520 tokens; improved bytes-per-token for Korean by 12.6%
Infra Optimization
Training under controlled compute budget (3×10^20 FLOPs across experiments)
Model Optimization
PolyNorm polynomial activation (degree ≤3)Differential Attention (subtract two attention maps)
System Optimization
Custom HIP kernels optimized for ROCm on AMD GPUs
Training Optimization
Dynamic data-mixing scheduler (linear annealing of domain ratios)Simple Moving Average over recent checkpoints every 8B tokensWarmup-Stable-Decay learning scheduleAdamW optimizer with specified hyperparameters
Inference Optimization
Long-context variant uses RoPE base frequency adjustment for 16k context

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Data URLs

References to datasets: DCLM, TxT360, Fineweb2, FineMath, Tulu, Granite (see paper refs)

Risks & Boundaries

Limitations

Underperforms on some open-domain QA and retrieval benchmarks (e.g., NQ, TriviaQA).

Custom ROCm kernels and internal framework are not fully open, complicating reproduction.

When Not To Use

Do not use as a drop-in replacement for retrieval-heavy or open-domain QA systems without adding retrieval.

Avoid relying solely on it for commonsense zero-shot tasks where it trails larger models.

Failure Modes

Weak factual recall in open-domain QA (very low NQ/TriviaQA scores).

Potential bias or artefacts from synthetic dataset fusion and proprietary Korean corpus.

Core Entities

Models

Motif-2.6BMotif-2.6B-LCMistral 7BGemma 1 (2B/7B)Gemma 2 (2B/9B)Gemma 3 (1B/4B)Llama 3 (8B)Llama 3.2 (1B/3B)Phi-2 (2.7B)Phi-3 (3.8B/7B)Qwen3-8BQwen3 (excluded from comparisons)

Metrics

AccuracyP@1F1Average improvement % (relative delta)

Datasets

DCLMTxT360Fineweb2FineMathTulu (Tulu 3 mixtures)GraniteMagpieLMLM-SysExam-CoT (synthetic)EvolKit (Auto Evol-Instruct outputs)In-house Korean corpus

Benchmarks

MMLUHellaSwagWinoGrandePIQAARC-EARC-CNQTriviaQAHumanEvalMBPPMATHGSM8KBBHAGIEvalDROPSIQABoolQGPQAIFEval