Overview
Models are useful as open starting points but are small (1.3B) and not instruction-tuned; human and automatic evaluations show modest task gains and clear weaknesses in form and few-shot reasoning.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 5/5
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 60%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.
Who Should Care
Summary TLDR
The authors release two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B). Both were created by continual pretraining of existing 1.3B decoder models on CorpusNÓS (2.1B words). Human evaluation shows mostly minor form and content errors; one model (Carballo-cerebras) produces cleaner punctuation than the other. On few-shot benchmarks translated to Galician, the models beat some multilingual baselines on a science QA set but overall few-shot scores are near random for many tasks, indicating the need for instruction tuning, larger models, or targeted fine-tuning before production use.
Problem Statement
Large generative LLMs are English-dominant and under‑serve minority languages. Galician lacked an open generative LLM. The paper asks whether continual pretraining of existing decoder models on a large Galician corpus can produce usable Galician LLMs.
Main Contribution
Two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B) released.
A 2.1B-word cleaned Galician corpus (CorpusNÓS) assembled and made public.
Key Findings
Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.
Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.364±0.022 (Carballo-bloom-1.3B) | 0.342±0.021 (FLOR-1.3B) | +0.022 | OpenBookQA (translated) | Table 2 OpenBookQA row | Table 2 |
| Accuracy | 0.271±0.015 (Carballo-cerebras-1.3B) | 0.234±0.014 (Bloom-1b1) | +0.037 | Belebele (translated) | Table 2 Belebele column | Table 2 |
What To Try In 7 Days
Download Carballo-bloom and Carballo-cerebras from HuggingFace and run basic generation tests.
Run the provided human-eval example texts to inspect punctuation and register issues.
Evaluate models on a small in‑house Galician QA sample and compare to multilingual baselines using the translated benchmarks provided in the paper.
Agent Features
Memory
Frameworks
Architectures
Optimization Features
Infra Optimization
Model Optimization
System Optimization
Training Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Model size limited (~1.3B) — restricts few-shot reasoning and instruction-following.
No instruction tuning applied; few-shot benchmark scores are near random for many tasks.
When Not To Use
High-stakes factual QA or decision-making without further fine-tuning.
Out-of-the-box instruction-following tasks that require reliable few-shot performance.
Failure Modes
Punctuation and formatting errors inherited from training corpus.
Abrupt topic shifts and minor content incoherences in continuations.

