An 11B Transformer tuned for Polish that rivals much larger models across European benchmarks

December 30, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.55

Cost Impact Score

0.7

Citation Count

0

Authors

Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej

Links

Abstract / PDF

Why It Matters For Business

Get strong Polish and European-language performance with an 11B model that runs on consumer GPUs and supports quantized deployment—cut infrastructure costs versus 70B+ models while keeping high accuracy for local applications.

Summary TLDR

Bielik 11B v3 is an 11-billion-parameter Transformer derived from Mistral 7B and depth up-scaled to 50 layers. It was pretrained on ~1.1T tokens with a heavy Polish focus (54% of docs), then instruction-tuned (SFT), aligned with Direct Preference Optimization (DPO-P), and refined with reinforcement learning (GRPO/Dr. GRPO). The instruction-tuned model achieves top results among open models on Polish benchmarks (PLCC 71.83%, Open PL 65.93, Belebele 82.98) while remaining runnable on consumer GPUs and offering quantized deployment options.

Problem Statement

High-quality LLMs for less-represented European languages need better parameter efficiency and language-specific tuning. The paper builds a model optimized for Polish that still performs well in other European languages, while remaining deployable on mainstream GPUs.

Main Contribution

An 11B-parameter model built by depth up-scaling Mistral 7B to 50 layers, keeping consumer-GPU deployability.

A large multilingual pretraining mix (1.1T tokens, 32 languages) with Polish as 54.25% of documents.

A four-stage alignment pipeline: continued pretraining, supervised fine-tuning (20M instructions), DPO-Positive on 114k preference pairs, and RL (GRPO/Dr. GRPO) on 143k verifiable problems.

Comprehensive evaluation across Polish and multilingual benchmarks showing strong Polish-specific and competitive English performance.

Practical deployment features: long contexts (up to 65k–131k with YaRN), sample packing, FlexAttention, and quantization options.

Key Findings

Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.

NumbersOpen PL LLM Leaderboard (Instruct): 65.93 average

Strong cultural and language knowledge in Polish.

NumbersPLCC: 71.83% (top among open-source models)

Excellent reading-comprehension across European languages.

NumbersBelebele reading comprehension: 82.98 (Instruct)

Competitive English and math/reasoning skills.

NumbersOpen LLM Leaderboard (English, Instruct): 72.45 average; GSM8K: 85.60

Large, Polish-heavy pretraining mix and long-context training.

Numbers1.1 trillion tokens pretraining; Polish share: 54.25%; context up to 65,536 tokens

Instruction tuning, DPO-P, and RL materially improve specialized tasks.

NumbersPolish Medical: base 45.86% → Instruct 50.21%; CPTUB: 3.73 (Instruct)

Parameter efficiency: outperforms many models with 2–6× parameters on evaluated benchmarks.

NumbersMultiple benchmark rank comparisons versus 70B+ and 14–32B models (see Tables 4–13)

Results

Parameters

Value~11.2B

BaselineMistral 7B

Pretraining tokens

Value1.1 trillion tokens

Polish share of corpus

Value54.25% of documents

Open PL LLM Leaderboard (instruction-tuned)

Value65.93 (average)

BaselineBielik-11B-v2 (65.71–65.45 variants)

PLCC (Polish cultural competency)

Value71.83%

BaselineBielik-11B-v2.6-Instruct (65.50%)

Belebele reading comprehension (Instruct)

Value82.98 average

BaselineBielik-11B-v2.6-Instruct (68.67)

FLORES translation (BLEU)

Value19.22 average (Instruct)

BaselineBielik-11B-v2 (11.25)

Open LLM Leaderboard (English, Instruct)

Value72.45 average

BaselineBielik-11B-v2.6-Instruct (65.50)

Polish Medical Leaderboard (Instruct)

Value50.21%

BaselineBase Bielik-11B-v3 (45.86%)

Who Should Care

What To Try In 7 Days

Benchmark Bielik-11B-v3 on your Polish test set to compare cost vs. larger models.

Deploy a quantized v3 model on a 24GB GPU to validate latency and memory footprint.

Run instruction-tuned variant for dialogue or QA tasks and compare user-facing quality vs base model.

Agent Features

Memory

  • native context: 32,768 tokens
  • extended context via YaRN: up to 65,536–131,072 tokens

Tool Use

  • function calling support (tool use)

Frameworks

  • DPO / DPO-Positive
  • GRPO
  • VERL (RL framework)

Architectures

  • Transformer (Mistral-derived)
  • Depth up-scaling to 50 layers
  • Grouped-Query Attention (GQA)
  • SwiGLU activation
  • RoPE positional embeddings
  • RMSNorm pre-normalization

Optimization Features

Token Efficiency

  • retained Mistral 32k tokenizer with small vocab tweaks (32,128 tokens)
  • tokenization trade-off chosen for Polish and multilingual balance

Infra Optimization

  • designed to run on consumer GPUs up to 24GB VRAM
  • training used HPC clusters with isolated evaluation nodes

Model Optimization

  • depth up-scaling (duplicate-and-trim layers)
  • GQA to reduce attention KV heads

System Optimization

  • selective gradient checkpointing
  • tensor parallelism for long-context training
  • FlexAttention to speed packed sequence processing

Training Optimization

  • AdamW optimizer (β1=0.9, β2=0.95)
  • cosine LR decay with linear warmup
  • bfloat16 mixed precision
  • gradient clipping norm 1.0
  • checkpoint merging / weight averaging

Inference Optimization

  • extensive quantization options (details not enumerated)
  • sample packing to reduce padding
  • FlexAttention masks for packed sequences

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • May produce factual errors and hallucinations; not safe for unverified high-stakes use (Sec.6).
  • Training corpus includes copyrighted and sensitive documents—mitigations described but residual risk remains (Sec.3.1.2).
  • Some benchmark wins depend on Polish-heavy pretraining mix; performance may drop for non-European languages.
  • Quantization options mentioned but detailed trade-offs (accuracy vs size) are not published in this report.

When Not To Use

  • Do not use as sole source for high-stakes medical or legal decisions without human verification.
  • Avoid relying on it for languages far outside the 32-language mix without thorough testing.
  • Not suitable where full reproducibility or open-source code/data is required.

Failure Modes

  • Hallucination: plausible but incorrect statements, especially on obscure facts.
  • Adversarial brittleness: lower accuracy on CPTUB tricky questions (3.19–3.73 range).
  • Potential length inflation in responses (mitigated by Dr. GRPO but may still occur).

Core Entities

Models

  • Bielik-11B-v3
  • Mistral-7B-v0.2
  • Bielik-11B-v2
  • Bielik-11B-v2.6
  • DeepSeek-V3-0324
  • Qwen2.5-14B
  • Meta-Llama-3.1-70B
  • phi-4
  • Qwen2.5-72B

Metrics

  • Accuracy
  • average score
  • BLEU
  • binary F1
  • macro F1
  • levenshtein
  • GSM8K score

Datasets

  • CulturaX
  • HPLT v2.0
  • FineWeb
  • FineWeb-Edu
  • SlimPajama-627B
  • Common Crawl
  • Parliamentary Discourse Corpus
  • Science Library
  • Polish Wikipedia (incl. Silesian, Kashubian)

Benchmarks

  • Open PL LLM Leaderboard
  • Open LLM Leaderboard
  • Polish EQ-Bench
  • CPTUB
  • Polish Medical Leaderboard (PES)
  • PLCC
  • INCLUDE-base-44
  • Belebele
  • FLORES