Open-source Galician LLMs (1.3B) trained by continual pretraining on a 2.1B-word Galician corpus

June 19, 20247 min

Overview

Decision SnapshotNeeds Validation

Models are useful as open starting points but are small (1.3B) and not instruction-tuned; human and automatic evaluations show modest task gains and clear weaknesses in form and few-shot reasoning.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Pablo Gamallo, Pablo Rodríguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, José Ramom Pichel, Marcos Garcia

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.

Who Should Care

Summary TLDR

The authors release two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B). Both were created by continual pretraining of existing 1.3B decoder models on CorpusNÓS (2.1B words). Human evaluation shows mostly minor form and content errors; one model (Carballo-cerebras) produces cleaner punctuation than the other. On few-shot benchmarks translated to Galician, the models beat some multilingual baselines on a science QA set but overall few-shot scores are near random for many tasks, indicating the need for instruction tuning, larger models, or targeted fine-tuning before production use.

Problem Statement

Large generative LLMs are English-dominant and under‑serve minority languages. Galician lacked an open generative LLM. The paper asks whether continual pretraining of existing decoder models on a large Galician corpus can produce usable Galician LLMs.

Main Contribution

Two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B) released.

A 2.1B-word cleaned Galician corpus (CorpusNÓS) assembled and made public.

Key Findings

Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.

Numbers1.3B params; corpus = 2.13B tokens (2.1B words)

Practical UseTeams can download and fine-tune ready Galician LLMs instead of training from scratch, saving huge compute and data collection costs.

Evidence RefAbstract; Section 3.2; Table 1

Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.

NumbersCarballo-bloom 41% vs Carballo-cerebras 22% vs authentic 27%

Practical UseExpect punctuation/formatting issues; improve corpus cleaning or post-process outputs before user-facing deployment.

Evidence RefSection 5.1.2 (Figure 1 and text)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.364±0.022 (Carballo-bloom-1.3B)0.342±0.021 (FLOR-1.3B)+0.022OpenBookQA (translated)Table 2 OpenBookQA rowTable 2
Accuracy0.271±0.015 (Carballo-cerebras-1.3B)0.234±0.014 (Bloom-1b1)+0.037Belebele (translated)Table 2 Belebele columnTable 2

What To Try In 7 Days

Download Carballo-bloom and Carballo-cerebras from HuggingFace and run basic generation tests.

Run the provided human-eval example texts to inspect punctuation and register issues.

Evaluate models on a small in‑house Galician QA sample and compare to multilingual baselines using the translated benchmarks provided in the paper.

Agent Features

Memory
continual pretraining (reuse base model weights)
Frameworks
HuggingFace TransformersDeepSpeed
Architectures
decoder-only (GPT-style)1.3B parameters16 attention heads, 24 layers

Optimization Features

Infra Optimization
NVIDIA A100 40GB GPUs at CESGA
Model Optimization
vocabulary adaptation via new BPE tokenizer and embedding re-init
System Optimization
DeepSpeed ZeRO stage 2BF16 mixed-precision
Training Optimization
continual pretraining (warm-start)Adam optimizer (β1=0.9, β2=0.999, ε=1e-8, weight decay=0.1)linear learning rate decay

Reproducibility

Risks & Boundaries

Limitations

Model size limited (~1.3B) — restricts few-shot reasoning and instruction-following.

No instruction tuning applied; few-shot benchmark scores are near random for many tasks.

When Not To Use

High-stakes factual QA or decision-making without further fine-tuning.

Out-of-the-box instruction-following tasks that require reliable few-shot performance.

Failure Modes

Punctuation and formatting errors inherited from training corpus.

Abrupt topic shifts and minor content incoherences in continuations.

Core Entities

Models

Carballo-bloom-1.3BCarballo-cerebras-1.3BFLOR-1.3BCerebras-GPT-1.3BBloom-1b1Bloom-1b7mGPTBloom-1.7B

Metrics

Accuracyhuman error rates (form/content/register/etc.)

Datasets

CorpusNÓSBelebele (Galician translation)OpenBookQA (Galician translation)CoLA (Galician translation)Parafrases-glPAWS-X (Galician translation)

Benchmarks

BelebeleOpenBookQACoLAParafrases-glPAWS-X

Context Entities

Models

LLaMA 2 (indirectly referenced)MarIA family (Spanish context)

Datasets

CorpusNÓS cleaning pipeline (perplexity-based boilerplate removal)