An 11B Transformer tuned for Polish that rivals much larger models across European benchmarks

Overview

Decision SnapshotReady For Pilot

The paper includes many standard benchmarks and numeric results, but is a preliminary preprint and lacks an explicit full code/data release, so treat replication as moderate effort.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 8/9

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 55%

Authors

Krzysztof Ociepa, Łukasz Flis, Remigiusz Kinas, Krzysztof Wróbel, Adrian Gwoździej

Links

Abstract / PDF

Why It Matters For Business

Get strong Polish and European-language performance with an 11B model that runs on consumer GPUs and supports quantized deployment—cut infrastructure costs versus 70B+ models while keeping high accuracy for local applications.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

Bielik 11B v3 is an 11-billion-parameter Transformer derived from Mistral 7B and depth up-scaled to 50 layers. It was pretrained on ~1.1T tokens with a heavy Polish focus (54% of docs), then instruction-tuned (SFT), aligned with Direct Preference Optimization (DPO-P), and refined with reinforcement learning (GRPO/Dr. GRPO). The instruction-tuned model achieves top results among open models on Polish benchmarks (PLCC 71.83%, Open PL 65.93, Belebele 82.98) while remaining runnable on consumer GPUs and offering quantized deployment options.

Problem Statement

High-quality LLMs for less-represented European languages need better parameter efficiency and language-specific tuning. The paper builds a model optimized for Polish that still performs well in other European languages, while remaining deployable on mainstream GPUs.

Main Contribution

An 11B-parameter model built by depth up-scaling Mistral 7B to 50 layers, keeping consumer-GPU deployability.

A large multilingual pretraining mix (1.1T tokens, 32 languages) with Polish as 54.25% of documents.

Key Findings

Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.

NumbersOpen PL LLM Leaderboard (Instruct): 65.93 average

Practical UseYou can get near state-of-the-art Polish performance with an 11B model instead of much larger models—useful when GPU or budget is limited.

Evidence RefTable 5; Sec.5.1

Strong cultural and language knowledge in Polish.

NumbersPLCC: 71.83% (top among open-source models)

Practical UseBest choice for apps needing Polish cultural accuracy, like education or local chatbots.

Evidence RefTable 9; Sec.5.5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Parameters	~11.2B	Mistral 7B	≈×1.6	—	Model scaled to ~11B via depth up-scaling (Sec.2; Table 1)	Table 1; Sec.2.1
Pretraining tokens	1.1 trillion tokens	—	5× larger than Bielik v2	—	Total pretraining tokens = 1.1T; fivefold increase versus v2 (Sec.3.1)	Sec.3.1

What To Try In 7 Days

Benchmark Bielik-11B-v3 on your Polish test set to compare cost vs. larger models.

Deploy a quantized v3 model on a 24GB GPU to validate latency and memory footprint.

Run instruction-tuned variant for dialogue or QA tasks and compare user-facing quality vs base model.

Agent Features

Memory

native context: 32,768 tokensextended context via YaRN: up to 65,536–131,072 tokens

Tool Use

function calling support (tool use)

Frameworks

DPO / DPO-PositiveGRPOVERL (RL framework)

Architectures

Transformer (Mistral-derived)Depth up-scaling to 50 layersGrouped-Query Attention (GQA)SwiGLU activationRoPE positional embeddingsRMSNorm pre-normalization

Optimization Features

Token Efficiency

retained Mistral 32k tokenizer with small vocab tweaks (32,128 tokens)tokenization trade-off chosen for Polish and multilingual balance

Infra Optimization

designed to run on consumer GPUs up to 24GB VRAMtraining used HPC clusters with isolated evaluation nodes

Model Optimization

depth up-scaling (duplicate-and-trim layers)GQA to reduce attention KV heads

System Optimization

selective gradient checkpointingtensor parallelism for long-context trainingFlexAttention to speed packed sequence processing

Training Optimization

AdamW optimizer (β1=0.9, β2=0.95)cosine LR decay with linear warmupbfloat16 mixed precisiongradient clipping norm 1.0checkpoint merging / weight averaging

Inference Optimization

extensive quantization options (details not enumerated)sample packing to reduce paddingFlexAttention masks for packed sequences

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

May produce factual errors and hallucinations; not safe for unverified high-stakes use (Sec.6).

Training corpus includes copyrighted and sensitive documents—mitigations described but residual risk remains (Sec.3.1.2).

When Not To Use

Do not use as sole source for high-stakes medical or legal decisions without human verification.

Avoid relying on it for languages far outside the 32-language mix without thorough testing.

Failure Modes

Hallucination: plausible but incorrect statements, especially on obscure facts.

Adversarial brittleness: lower accuracy on CPTUB tricky questions (3.19–3.73 range).

Core Entities

Models

Bielik-11B-v3Mistral-7B-v0.2Bielik-11B-v2Bielik-11B-v2.6DeepSeek-V3-0324Qwen2.5-14BMeta-Llama-3.1-70Bphi-4Qwen2.5-72B

Metrics

Accuracyaverage scoreBLEUbinary F1macro F1levenshteinGSM8K score

Datasets

CulturaXHPLT v2.0FineWebFineWeb-EduSlimPajama-627BCommon CrawlParliamentary Discourse CorpusScience LibraryPolish Wikipedia (incl. Silesian, Kashubian)

Benchmarks

Open PL LLM LeaderboardOpen LLM LeaderboardPolish EQ-BenchCPTUBPolish Medical Leaderboard (PES)PLCCINCLUDE-base-44BelebeleFLORES

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.

Strong cultural and language knowledge in Polish.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

First holistic Burmese benchmark (BURMESE-SAN) that tests LLMs on understanding, reasoning, and generation.

Key finding

BiasLab: a multilingual, dual-framing toolkit for robust output-level bias audits

Key finding

Decouple concepts from language: an MoE design that keeps strong multilingual accuracy and cuts token costs

Key finding

EthioLLM: open multilingual LLMs and a new EthioBenchmark for five Ethiopian languages plus English

Key finding

MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

Key finding