Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.
Summary TLDR
The authors release two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B). Both were created by continual pretraining of existing 1.3B decoder models on CorpusNÓS (2.1B words). Human evaluation shows mostly minor form and content errors; one model (Carballo-cerebras) produces cleaner punctuation than the other. On few-shot benchmarks translated to Galician, the models beat some multilingual baselines on a science QA set but overall few-shot scores are near random for many tasks, indicating the need for instruction tuning, larger models, or targeted fine-tuning before production use.
Problem Statement
Large generative LLMs are English-dominant and under‑serve minority languages. Galician lacked an open generative LLM. The paper asks whether continual pretraining of existing decoder models on a large Galician corpus can produce usable Galician LLMs.
Main Contribution
Two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B) released.
A 2.1B-word cleaned Galician corpus (CorpusNÓS) assembled and made public.
Practical continual-pretraining recipe: new BPE tokenizer, embedding re-initialization for new tokens, and mixed-precision training with DeepSpeed.
Human qualitative evaluation and translated task-based benchmark suite for Galician (five tasks, few-shot).
Key Findings
Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.
Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.
On OpenBookQA (five-shot), Carballo-bloom outperformed multilingual baselines.
Across several few-shot tasks, model scores are near random baselines for 2-choice and 4-choice formats.
Results
Accuracy
Accuracy
Accuracy
Form-error rate (human eval)
Form-error rate (human eval)
Who Should Care
What To Try In 7 Days
Download Carballo-bloom and Carballo-cerebras from HuggingFace and run basic generation tests.
Run the provided human-eval example texts to inspect punctuation and register issues.
Evaluate models on a small in‑house Galician QA sample and compare to multilingual baselines using the translated benchmarks provided in the paper.
Agent Features
Memory
- continual pretraining (reuse base model weights)
Frameworks
- HuggingFace Transformers
- DeepSpeed
Architectures
- decoder-only (GPT-style)
- 1.3B parameters
- 16 attention heads, 24 layers
Optimization Features
Infra Optimization
- NVIDIA A100 40GB GPUs at CESGA
Model Optimization
- vocabulary adaptation via new BPE tokenizer and embedding re-init
System Optimization
- DeepSpeed ZeRO stage 2
- BF16 mixed-precision
Training Optimization
- continual pretraining (warm-start)
- Adam optimizer (β1=0.9, β2=0.999, ε=1e-8, weight decay=0.1)
- linear learning rate decay
Reproducibility
Code Urls
Data Urls
- https://github.com/proxectonos/corpora (CorpusNÓS)
- HuggingFace model pages above for checkpoints and tokenizer
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Model size limited (~1.3B) — restricts few-shot reasoning and instruction-following.
- No instruction tuning applied; few-shot benchmark scores are near random for many tasks.
- Human eval used only 60 held-out texts and 6 evaluators, limiting statistical power.
- CorpusNÓS still contains punctuation/boilerplate noise that affects output quality.
When Not To Use
- High-stakes factual QA or decision-making without further fine-tuning.
- Out-of-the-box instruction-following tasks that require reliable few-shot performance.
- Applications demanding near-perfect formatting and punctuation unless post-processed.
Failure Modes
- Punctuation and formatting errors inherited from training corpus.
- Abrupt topic shifts and minor content incoherences in continuations.
- Low few-shot accuracy on reasoning and acceptability tasks without instruction tuning.
Core Entities
Models
- Carballo-bloom-1.3B
- Carballo-cerebras-1.3B
- FLOR-1.3B
- Cerebras-GPT-1.3B
- Bloom-1b1
- Bloom-1b7
- mGPT
- Bloom-1.7B
Metrics
- Accuracy
- human error rates (form/content/register/etc.)
Datasets
- CorpusNÓS
- Belebele (Galician translation)
- OpenBookQA (Galician translation)
- CoLA (Galician translation)
- Parafrases-gl
- PAWS-X (Galician translation)
Benchmarks
- Belebele
- OpenBookQA
- CoLA
- Parafrases-gl
- PAWS-X
Context Entities
Models
- LLaMA 2 (indirectly referenced)
- MarIA family (Spanish context)
Datasets
- CorpusNÓS cleaning pipeline (perplexity-based boilerplate removal)

