Open-source Galician LLMs (1.3B) trained by continual pretraining on a 2.1B-word Galician corpus

Overview

Decision SnapshotNeeds Validation

Models are useful as open starting points but are small (1.3B) and not instruction-tuned; human and automatic evaluations show modest task gains and clear weaknesses in form and few-shot reasoning.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 5/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 50%

Authors

Pablo Gamallo, Pablo Rodríguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, José Ramom Pichel, Marcos Garcia

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.

Who Should Care

ML Engineer Data Scientist CTO Product Manager

Summary TLDR

The authors release two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B). Both were created by continual pretraining of existing 1.3B decoder models on CorpusNÓS (2.1B words). Human evaluation shows mostly minor form and content errors; one model (Carballo-cerebras) produces cleaner punctuation than the other. On few-shot benchmarks translated to Galician, the models beat some multilingual baselines on a science QA set but overall few-shot scores are near random for many tasks, indicating the need for instruction tuning, larger models, or targeted fine-tuning before production use.

Problem Statement

Large generative LLMs are English-dominant and under‑serve minority languages. Galician lacked an open generative LLM. The paper asks whether continual pretraining of existing decoder models on a large Galician corpus can produce usable Galician LLMs.

Main Contribution

Two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B) released.

A 2.1B-word cleaned Galician corpus (CorpusNÓS) assembled and made public.

Key Findings

Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.

Numbers1.3B params; corpus = 2.13B tokens (2.1B words)

Practical UseTeams can download and fine-tune ready Galician LLMs instead of training from scratch, saving huge compute and data collection costs.

Evidence RefAbstract; Section 3.2; Table 1

Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.

NumbersCarballo-bloom 41% vs Carballo-cerebras 22% vs authentic 27%

Practical UseExpect punctuation/formatting issues; improve corpus cleaning or post-process outputs before user-facing deployment.

Evidence RefSection 5.1.2 (Figure 1 and text)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.364±0.022 (Carballo-bloom-1.3B)	0.342±0.021 (FLOR-1.3B)	+0.022	OpenBookQA (translated)	Table 2 OpenBookQA row	Table 2
Accuracy	0.271±0.015 (Carballo-cerebras-1.3B)	0.234±0.014 (Bloom-1b1)	+0.037	Belebele (translated)	Table 2 Belebele column	Table 2

What To Try In 7 Days

Download Carballo-bloom and Carballo-cerebras from HuggingFace and run basic generation tests.

Run the provided human-eval example texts to inspect punctuation and register issues.

Evaluate models on a small in‑house Galician QA sample and compare to multilingual baselines using the translated benchmarks provided in the paper.

Agent Features

Memory

continual pretraining (reuse base model weights)

Frameworks

HuggingFace TransformersDeepSpeed

Architectures

decoder-only (GPT-style)1.3B parameters16 attention heads, 24 layers

Optimization Features

Infra Optimization

NVIDIA A100 40GB GPUs at CESGA

Model Optimization

vocabulary adaptation via new BPE tokenizer and embedding re-init

System Optimization

DeepSpeed ZeRO stage 2BF16 mixed-precision

Training Optimization

continual pretraining (warm-start)Adam optimizer (β1=0.9, β2=0.999, ε=1e-8, weight decay=0.1)linear learning rate decay

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/proxectonos/corpora (CorpusNÓS + cleaning pipeline)https://huggingface.co/proxectonos/Carballo-bloom-1.3B https://huggingface.co/proxectonos/Carballo-cerebras-1.3B

Data URLs

https://github.com/proxectonos/corpora (CorpusNÓS)HuggingFace model pages above for checkpoints and tokenizer

Risks & Boundaries

Limitations

Model size limited (~1.3B) — restricts few-shot reasoning and instruction-following.

No instruction tuning applied; few-shot benchmark scores are near random for many tasks.

When Not To Use

High-stakes factual QA or decision-making without further fine-tuning.

Out-of-the-box instruction-following tasks that require reliable few-shot performance.

Failure Modes

Punctuation and formatting errors inherited from training corpus.

Abrupt topic shifts and minor content incoherences in continuations.

Core Entities

Models

Carballo-bloom-1.3BCarballo-cerebras-1.3BFLOR-1.3BCerebras-GPT-1.3BBloom-1b1Bloom-1b7mGPTBloom-1.7B

Metrics

Accuracyhuman error rates (form/content/register/etc.)

Datasets

CorpusNÓSBelebele (Galician translation)OpenBookQA (Galician translation)CoLA (Galician translation)Parafrases-glPAWS-X (Galician translation)

Benchmarks

BelebeleOpenBookQACoLAParafrases-glPAWS-X

Context Entities

Models

LLaMA 2 (indirectly referenced)MarIA family (Spanish context)

Datasets

CorpusNÓS cleaning pipeline (perplexity-based boilerplate removal)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.

Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding