Overview
Production Readiness
0.7
Novelty Score
0.55
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
Get strong Polish and European-language performance with an 11B model that runs on consumer GPUs and supports quantized deployment—cut infrastructure costs versus 70B+ models while keeping high accuracy for local applications.
Summary TLDR
Bielik 11B v3 is an 11-billion-parameter Transformer derived from Mistral 7B and depth up-scaled to 50 layers. It was pretrained on ~1.1T tokens with a heavy Polish focus (54% of docs), then instruction-tuned (SFT), aligned with Direct Preference Optimization (DPO-P), and refined with reinforcement learning (GRPO/Dr. GRPO). The instruction-tuned model achieves top results among open models on Polish benchmarks (PLCC 71.83%, Open PL 65.93, Belebele 82.98) while remaining runnable on consumer GPUs and offering quantized deployment options.
Problem Statement
High-quality LLMs for less-represented European languages need better parameter efficiency and language-specific tuning. The paper builds a model optimized for Polish that still performs well in other European languages, while remaining deployable on mainstream GPUs.
Main Contribution
An 11B-parameter model built by depth up-scaling Mistral 7B to 50 layers, keeping consumer-GPU deployability.
A large multilingual pretraining mix (1.1T tokens, 32 languages) with Polish as 54.25% of documents.
A four-stage alignment pipeline: continued pretraining, supervised fine-tuning (20M instructions), DPO-Positive on 114k preference pairs, and RL (GRPO/Dr. GRPO) on 143k verifiable problems.
Comprehensive evaluation across Polish and multilingual benchmarks showing strong Polish-specific and competitive English performance.
Practical deployment features: long contexts (up to 65k–131k with YaRN), sample packing, FlexAttention, and quantization options.
Key Findings
Instruction-tuned Bielik-11B-v3 ranks among top open models on Polish benchmarks.
Strong cultural and language knowledge in Polish.
Excellent reading-comprehension across European languages.
Competitive English and math/reasoning skills.
Large, Polish-heavy pretraining mix and long-context training.
Instruction tuning, DPO-P, and RL materially improve specialized tasks.
Parameter efficiency: outperforms many models with 2–6× parameters on evaluated benchmarks.
Results
Parameters
Pretraining tokens
Polish share of corpus
Open PL LLM Leaderboard (instruction-tuned)
PLCC (Polish cultural competency)
Belebele reading comprehension (Instruct)
FLORES translation (BLEU)
Open LLM Leaderboard (English, Instruct)
Polish Medical Leaderboard (Instruct)
Who Should Care
What To Try In 7 Days
Benchmark Bielik-11B-v3 on your Polish test set to compare cost vs. larger models.
Deploy a quantized v3 model on a 24GB GPU to validate latency and memory footprint.
Run instruction-tuned variant for dialogue or QA tasks and compare user-facing quality vs base model.
Agent Features
Memory
- native context: 32,768 tokens
- extended context via YaRN: up to 65,536–131,072 tokens
Tool Use
- function calling support (tool use)
Frameworks
- DPO / DPO-Positive
- GRPO
- VERL (RL framework)
Architectures
- Transformer (Mistral-derived)
- Depth up-scaling to 50 layers
- Grouped-Query Attention (GQA)
- SwiGLU activation
- RoPE positional embeddings
- RMSNorm pre-normalization
Optimization Features
Token Efficiency
- retained Mistral 32k tokenizer with small vocab tweaks (32,128 tokens)
- tokenization trade-off chosen for Polish and multilingual balance
Infra Optimization
- designed to run on consumer GPUs up to 24GB VRAM
- training used HPC clusters with isolated evaluation nodes
Model Optimization
- depth up-scaling (duplicate-and-trim layers)
- GQA to reduce attention KV heads
System Optimization
- selective gradient checkpointing
- tensor parallelism for long-context training
- FlexAttention to speed packed sequence processing
Training Optimization
- AdamW optimizer (β1=0.9, β2=0.95)
- cosine LR decay with linear warmup
- bfloat16 mixed precision
- gradient clipping norm 1.0
- checkpoint merging / weight averaging
Inference Optimization
- extensive quantization options (details not enumerated)
- sample packing to reduce padding
- FlexAttention masks for packed sequences
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- May produce factual errors and hallucinations; not safe for unverified high-stakes use (Sec.6).
- Training corpus includes copyrighted and sensitive documents—mitigations described but residual risk remains (Sec.3.1.2).
- Some benchmark wins depend on Polish-heavy pretraining mix; performance may drop for non-European languages.
- Quantization options mentioned but detailed trade-offs (accuracy vs size) are not published in this report.
When Not To Use
- Do not use as sole source for high-stakes medical or legal decisions without human verification.
- Avoid relying on it for languages far outside the 32-language mix without thorough testing.
- Not suitable where full reproducibility or open-source code/data is required.
Failure Modes
- Hallucination: plausible but incorrect statements, especially on obscure facts.
- Adversarial brittleness: lower accuracy on CPTUB tricky questions (3.19–3.73 range).
- Potential length inflation in responses (mitigated by Dr. GRPO but may still occur).
Core Entities
Models
- Bielik-11B-v3
- Mistral-7B-v0.2
- Bielik-11B-v2
- Bielik-11B-v2.6
- DeepSeek-V3-0324
- Qwen2.5-14B
- Meta-Llama-3.1-70B
- phi-4
- Qwen2.5-72B
Metrics
- Accuracy
- average score
- BLEU
- binary F1
- macro F1
- levenshtein
- GSM8K score
Datasets
- CulturaX
- HPLT v2.0
- FineWeb
- FineWeb-Edu
- SlimPajama-627B
- Common Crawl
- Parliamentary Discourse Corpus
- Science Library
- Polish Wikipedia (incl. Silesian, Kashubian)
Benchmarks
- Open PL LLM Leaderboard
- Open LLM Leaderboard
- Polish EQ-Bench
- CPTUB
- Polish Medical Leaderboard (PES)
- PLCC
- INCLUDE-base-44
- Belebele
- FLORES

