A 2.6B foundation LLM that blends new attention and polynomial activations to boost math and code performance while keeping costs moderate

August 2, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Junghwan Lim, Sungmin Lee, Dongseok Kim, Eunhwan Park, Hyunbyung Park, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Jihwan Kim, Minjae Kim, Taehwan Kim, Youngrok Kim, Haesol Lee, Jeesoo Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Daewon Suh, Dongjoo Weon

Links

Abstract / PDF

Why It Matters For Business

Motif-2.6B offers strong code and math performance with a modest parameter count, making it cost-effective for teams that need high-quality reasoning or coding without 7–70B model compute costs.

Summary TLDR

Motif-2.6B is a 2.6B-parameter decoder-only foundation model that combines two main innovations—Differential Attention (subtracting two attention maps to reduce noise) and PolyNorm (polynomial-based activation/normalization)—plus a dynamic data-mixing pretraining schedule and RoPE-based context extension. On a broad suite of benchmarks Motif beats many similarly sized open models on math and coding tasks (very large relative gains) and posts a positive average vs several baselines, but it underperforms on some open-domain QA and commonsense benchmarks. Training used ~2.5T tokens, custom ROCm kernels, and post-training alignment with DPO.

Problem Statement

Building a high-quality foundational LLM that is computationally affordable yet strong across reasoning, code, and long-context tasks remains hard for smaller research groups; this work designs an architecture and training recipe for a 2.6B model to improve long-context comprehension, reduce hallucination, and boost in-context learning while keeping compute and token budgets moderate.

Main Contribution

Design and release of Motif-2.6B, a 2.6B decoder-only model with Differential Attention and PolyNorm activations.

A two-stage dynamic data-mixing pretraining schedule that linearly shifts domain ratios (general → Korean/code/math) across training.

Long-context extension (Motif-2.6B-LC) via increasing RoPE base frequency to 500k for 16k token context.

Post-training pipeline combining curated human and synthetic datasets, dataset fusion, rejection-sampling synthesis, and DPO alignment.

Key Findings

Motif-2.6B achieves a positive average improvement vs Mistral 7B across evaluated benchmarks.

Numbers+25.47% average improvement vs Mistral 7B

Huge gains on math and reasoning benchmarks compared to some baselines.

NumbersMATH +206.87% vs Mistral 7B; GSM8K +53.66%

Large improvements on coding benchmarks.

NumbersHumanEval P@1 68.3 vs Mistral 30.5 (+123.93%)

Clear weaknesses on open-domain retrieval-style QA and some commonsense tasks.

NumbersNQ 11.1 vs Mistral 28.8 (-61.32%); HellaSwag -24.54%

Results

Average improvement vs Mistral 7B

Value+25.47%

BaselineMistral 7B

HumanEval P@1

Value68.3

BaselineMistral 7B 30.5

MATH (maj@4)

Value40.2

BaselineMistral 7B 13.1

GSM8K

Value75.7

BaselineMistral 7B 52.2

NQ

Value11.1

BaselineMistral 7B 28.8

Who Should Care

What To Try In 7 Days

Run Motif-2.6B on your unit math/reasoning and code tasks to validate reported gains.

Evaluate Motif-2.6B-LC on any long-document workflows (up to 16k tokens) and compare latency vs your current models.

Use the provided HuggingFace kernels to test integration on ROCm hardware before porting production pipelines.

Optimization Features

Token Efficiency

  • Tokenizer expanded to 219,520 tokens; improved bytes-per-token for Korean by 12.6%

Infra Optimization

  • Training under controlled compute budget (3×10^20 FLOPs across experiments)

Model Optimization

  • PolyNorm polynomial activation (degree ≤3)
  • Differential Attention (subtract two attention maps)

System Optimization

  • Custom HIP kernels optimized for ROCm on AMD GPUs

Training Optimization

  • Dynamic data-mixing scheduler (linear annealing of domain ratios)
  • Simple Moving Average over recent checkpoints every 8B tokens
  • Warmup-Stable-Decay learning schedule
  • AdamW optimizer with specified hyperparameters

Inference Optimization

  • Long-context variant uses RoPE base frequency adjustment for 16k context

Reproducibility

Data Urls

  • References to datasets: DCLM, TxT360, Fineweb2, FineMath, Tulu, Granite (see paper refs)

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Underperforms on some open-domain QA and retrieval benchmarks (e.g., NQ, TriviaQA).
  • Custom ROCm kernels and internal framework are not fully open, complicating reproduction.
  • Some improvements may depend on specific dataset mixtures and synthetic data choices that are not fully public.

When Not To Use

  • Do not use as a drop-in replacement for retrieval-heavy or open-domain QA systems without adding retrieval.
  • Avoid relying solely on it for commonsense zero-shot tasks where it trails larger models.
  • Not ideal if you require fully open training code and all datasets for auditing.

Failure Modes

  • Weak factual recall in open-domain QA (very low NQ/TriviaQA scores).
  • Potential bias or artefacts from synthetic dataset fusion and proprietary Korean corpus.
  • Performance variability across tasks: strong on math/code, weaker on some commonsense and knowledge benchmarks.

Core Entities

Models

  • Motif-2.6B
  • Motif-2.6B-LC
  • Mistral 7B
  • Gemma 1 (2B/7B)
  • Gemma 2 (2B/9B)
  • Gemma 3 (1B/4B)
  • Llama 3 (8B)
  • Llama 3.2 (1B/3B)
  • Phi-2 (2.7B)
  • Phi-3 (3.8B/7B)
  • Qwen3-8B
  • Qwen3 (excluded from comparisons)

Metrics

  • Accuracy
  • P@1
  • F1
  • Average improvement % (relative delta)

Datasets

  • DCLM
  • TxT360
  • Fineweb2
  • FineMath
  • Tulu (Tulu 3 mixtures)
  • Granite
  • MagpieLM
  • LM-Sys
  • Exam-CoT (synthetic)
  • EvolKit (Auto Evol-Instruct outputs)
  • In-house Korean corpus

Benchmarks

  • MMLU
  • HellaSwag
  • WinoGrande
  • PIQA
  • ARC-E
  • ARC-C
  • NQ
  • TriviaQA
  • HumanEval
  • MBPP
  • MATH
  • GSM8K
  • BBH
  • AGIEval
  • DROP
  • SIQA
  • BoolQ
  • GPQA
  • IFEval