An 11B Transformer tuned for Polish that rivals much larger models across European benchmarks

December 30, 20259 min

Overview

Decision SnapshotReady For Pilot

The paper includes many standard benchmarks and numeric results, but is a preliminary preprint and lacks an explicit full code/data release, so treat replication as moderate effort.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 8/9

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 55%

Authors

Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej

Links

Abstract / PDF

Why It Matters For Business

Get strong Polish and European-language performance with an 11B model that runs on consumer GPUs and supports quantized deployment—cut infrastructure costs versus 70B+ models while keeping high accuracy for local applications.

Who Should Care

Summary TLDR

Bielik 11B v3 is an 11-billion-parameter Transformer derived from Mistral 7B and depth up-scaled to 50 layers. It was pretrained on ~1.1T tokens with a heavy Polish focus (54% of docs), then instruction-tuned (SFT), aligned with Direct Preference Optimization (DPO-P), and refined with reinforcement learning (GRPO/Dr. GRPO). The instruction-tuned model achieves top results among open models on Polish benchmarks (PLCC 71.83%, Open PL 65.93, Belebele 82.98) while remaining runnable on consumer GPUs and offering quantized deployment options.

Problem Statement

High-quality LLMs for less-represented European languages need better parameter efficiency and language-specific tuning. The paper builds a model optimized for Polish that still performs well in other European languages, while remaining deployable on mainstream GPUs.

Main Contribution

An 11B-parameter model built by depth up-scaling Mistral 7B to 50 layers, keeping consumer-GPU deployability.

A large multilingual pretraining mix (1.1T tokens, 32 languages) with Polish as 54.25% of documents.

Key Findings

Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.

NumbersOpen PL LLM Leaderboard (Instruct): 65.93 average

Practical UseYou can get near state-of-the-art Polish performance with an 11B model instead of much larger models—useful when GPU or budget is limited.

Evidence RefTable 5; Sec.5.1

Strong cultural and language knowledge in Polish.

NumbersPLCC: 71.83% (top among open-source models)

Practical UseBest choice for apps needing Polish cultural accuracy, like education or local chatbots.

Evidence RefTable 9; Sec.5.5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Parameters~11.2BMistral 7B≈×1.6Model scaled to ~11B via depth up-scaling (Sec.2; Table 1)Table 1; Sec.2.1
Pretraining tokens1.1 trillion tokens larger than Bielik v2Total pretraining tokens = 1.1T; fivefold increase versus v2 (Sec.3.1)Sec.3.1

What To Try In 7 Days

Benchmark Bielik-11B-v3 on your Polish test set to compare cost vs. larger models.

Deploy a quantized v3 model on a 24GB GPU to validate latency and memory footprint.

Run instruction-tuned variant for dialogue or QA tasks and compare user-facing quality vs base model.

Agent Features

Memory
native context: 32,768 tokensextended context via YaRN: up to 65,536–131,072 tokens
Tool Use
function calling support (tool use)
Frameworks
DPO / DPO-PositiveGRPOVERL (RL framework)
Architectures
Transformer (Mistral-derived)Depth up-scaling to 50 layersGrouped-Query Attention (GQA)SwiGLU activationRoPE positional embeddingsRMSNorm pre-normalization

Optimization Features

Token Efficiency
retained Mistral 32k tokenizer with small vocab tweaks (32,128 tokens)tokenization trade-off chosen for Polish and multilingual balance
Infra Optimization
designed to run on consumer GPUs up to 24GB VRAMtraining used HPC clusters with isolated evaluation nodes
Model Optimization
depth up-scaling (duplicate-and-trim layers)GQA to reduce attention KV heads
System Optimization
selective gradient checkpointingtensor parallelism for long-context trainingFlexAttention to speed packed sequence processing
Training Optimization
AdamW optimizer (β1=0.9, β2=0.95)cosine LR decay with linear warmupbfloat16 mixed precisiongradient clipping norm 1.0checkpoint merging / weight averaging
Inference Optimization
extensive quantization options (details not enumerated)sample packing to reduce paddingFlexAttention masks for packed sequences

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

May produce factual errors and hallucinations; not safe for unverified high-stakes use (Sec.6).

Training corpus includes copyrighted and sensitive documents—mitigations described but residual risk remains (Sec.3.1.2).

When Not To Use

Do not use as sole source for high-stakes medical or legal decisions without human verification.

Avoid relying on it for languages far outside the 32-language mix without thorough testing.

Failure Modes

Hallucination: plausible but incorrect statements, especially on obscure facts.

Adversarial brittleness: lower accuracy on CPTUB tricky questions (3.19–3.73 range).

Core Entities

Models

Bielik-11B-v3Mistral-7B-v0.2Bielik-11B-v2Bielik-11B-v2.6DeepSeek-V3-0324Qwen2.5-14BMeta-Llama-3.1-70Bphi-4Qwen2.5-72B

Metrics

Accuracyaverage scoreBLEUbinary F1macro F1levenshteinGSM8K score

Datasets

CulturaXHPLT v2.0FineWebFineWeb-EduSlimPajama-627BCommon CrawlParliamentary Discourse CorpusScience LibraryPolish Wikipedia (incl. Silesian, Kashubian)

Benchmarks

Open PL LLM LeaderboardOpen LLM LeaderboardPolish EQ-BenchCPTUBPolish Medical Leaderboard (PES)PLCCINCLUDE-base-44BelebeleFLORES