A 2.6B foundation LLM that blends new attention and polynomial activations to boost math and code performance while keeping costs moderate

Overview

Decision SnapshotNeeds Validation

The paper provides replicated benchmark comparisons and ablations supporting the architecture and training choices, but many details (proprietary synthetic data and internal framework) limit full external reproduction.

Citations0

Evidence Strength0.70

Confidence0.65

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Junghwan Lim, Sungmin Lee, Dongseok Kim, Eunhwan Park, Hyunbyung Park, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Jihwan Kim, Minjae Kim, Taehwan Kim, Youngrok Kim, Haesol Lee, Jeesoo Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Daewon Suh, Dongjoo Weon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Motif-2.6B offers strong code and math performance with a modest parameter count, making it cost-effective for teams that need high-quality reasoning or coding without 7–70B model compute costs.

Who Should Care

CTO ML Engineer Engineering Lead Product Manager Data Scientist Founder

Summary TLDR

Motif-2.6B is a 2.6B-parameter decoder-only foundation model that combines two main innovations—Differential Attention (subtracting two attention maps to reduce noise) and PolyNorm (polynomial-based activation/normalization)—plus a dynamic data-mixing pretraining schedule and RoPE-based context extension. On a broad suite of benchmarks Motif beats many similarly sized open models on math and coding tasks (very large relative gains) and posts a positive average vs several baselines, but it underperforms on some open-domain QA and commonsense benchmarks. Training used ~2.5T tokens, custom ROCm kernels, and post-training alignment with DPO.

Problem Statement

Building a high-quality foundational LLM that is computationally affordable yet strong across reasoning, code, and long-context tasks remains hard for smaller research groups; this work designs an architecture and training recipe for a 2.6B model to improve long-context comprehension, reduce hallucination, and boost in-context learning while keeping compute and token budgets moderate.

Main Contribution

Design and release of Motif-2.6B, a 2.6B decoder-only model with Differential Attention and PolyNorm activations.

A two-stage dynamic data-mixing pretraining schedule that linearly shifts domain ratios (general → Korean/code/math) across training.

Key Findings

Motif-2.6B achieves a positive average improvement vs Mistral 7B across evaluated benchmarks.

Numbers+25.47% average improvement vs Mistral 7B

Practical UseExpect overall stronger task performance than Mistral 7B on the included benchmark mix; good baseline if you need a compact model that often outperforms a 7B model on these tasks.

Evidence RefTable 4 / Appendix A.2 (average over listed benchmarks)

Huge gains on math and reasoning benchmarks compared to some baselines.

NumbersMATH +206.87% vs Mistral 7B; GSM8K +53.66%

Practical UseUse Motif for math and multi-step reasoning tasks where it substantially outperforms several open baselines.

Evidence RefAppendix A.2 tables (MATH, GSM8K rows)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Average improvement vs Mistral 7B	+25.47%	Mistral 7B	+25.47%	Average across listed benchmarks	Table 4 / Appendix A.2 average	Table 4
HumanEval P@1	68.3	Mistral 7B 30.5	+123.93%	HumanEval (0-shot)	Appendix A.2 HumanEval row	Appendix A.2

What To Try In 7 Days

Run Motif-2.6B on your unit math/reasoning and code tasks to validate reported gains.

Evaluate Motif-2.6B-LC on any long-document workflows (up to 16k tokens) and compare latency vs your current models.

Use the provided HuggingFace kernels to test integration on ROCm hardware before porting production pipelines.

Optimization Features

Token Efficiency

Tokenizer expanded to 219,520 tokens; improved bytes-per-token for Korean by 12.6%

Infra Optimization

Training under controlled compute budget (3×10^20 FLOPs across experiments)

Model Optimization

PolyNorm polynomial activation (degree ≤3)Differential Attention (subtract two attention maps)

System Optimization

Custom HIP kernels optimized for ROCm on AMD GPUs

Training Optimization

Dynamic data-mixing scheduler (linear annealing of domain ratios)Simple Moving Average over recent checkpoints every 8B tokensWarmup-Stable-Decay learning scheduleAdamW optimizer with specified hyperparameters

Inference Optimization

Long-context variant uses RoPE base frequency adjustment for 16k context

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/Motif-Technologies/activation https://huggingface.co/MotifTechnologies/optimizer

Data URLs

References to datasets: DCLM, TxT360, Fineweb2, FineMath, Tulu, Granite (see paper refs)

Risks & Boundaries

Limitations

Underperforms on some open-domain QA and retrieval benchmarks (e.g., NQ, TriviaQA).

Custom ROCm kernels and internal framework are not fully open, complicating reproduction.

When Not To Use

Do not use as a drop-in replacement for retrieval-heavy or open-domain QA systems without adding retrieval.

Avoid relying solely on it for commonsense zero-shot tasks where it trails larger models.

Failure Modes

Weak factual recall in open-domain QA (very low NQ/TriviaQA scores).

Potential bias or artefacts from synthetic dataset fusion and proprietary Korean corpus.

Core Entities

Models

Motif-2.6BMotif-2.6B-LCMistral 7BGemma 1 (2B/7B)Gemma 2 (2B/9B)Gemma 3 (1B/4B)Llama 3 (8B)Llama 3.2 (1B/3B)Phi-2 (2.7B)Phi-3 (3.8B/7B)Qwen3-8BQwen3 (excluded from comparisons)

Metrics

AccuracyP@1F1Average improvement % (relative delta)

Datasets

DCLMTxT360Fineweb2FineMathTulu (Tulu 3 mixtures)GraniteMagpieLMLM-SysExam-CoT (synthetic)EvolKit (Auto Evol-Instruct outputs)In-house Korean corpus

Benchmarks

MMLUHellaSwagWinoGrandePIQAARC-EARC-CNQTriviaQAHumanEvalMBPPMATHGSM8KBBHAGIEvalDROPSIQABoolQGPQAIFEval

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Motif-2.6B achieves a positive average improvement vs Mistral 7B across evaluated benchmarks.

Huge gains on math and reasoning benchmarks compared to some baselines.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

LaRA: when to use retrieval vs feeding the full long context

Key finding

A practical recipe (data + training + benchmark) to finetune LLMs to read and follow instructions on 8k–64k+ contexts

Key finding

Dicta-LM 3.0 — open-weight Hebrew LLMs (24B/12B/1.7B) with 65k context and a new Hebrew chat benchmark

Key finding

Use 4-bit QK estimates plus block-sparse masks to speed up long-context LLM prefilling with minimal quality loss

Key finding

MiniCPM4: an 8B on-device LLM that uses sparse attention, careful data, and quantization to run long-context workloads faster and with far少r

Key finding