Open-source Galician LLMs (1.3B) trained by continual pretraining on a 2.1B-word Galician corpus

June 19, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Pablo Gamallo, Pablo Rodríguez, Iria de-Dios-Flores, Susana Sotelo, Silvia Paniagua, Daniel Bardanca, José Ramom Pichel, Marcos Garcia

Links

Abstract / PDF

Why It Matters For Business

Open Galician LLMs let local apps add Galician text generation or fine-tune models without huge compute budgets; expect modest gains for targeted tasks but plan extra cleaning and instruction tuning for production.

Summary TLDR

The authors release two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B). Both were created by continual pretraining of existing 1.3B decoder models on CorpusNÓS (2.1B words). Human evaluation shows mostly minor form and content errors; one model (Carballo-cerebras) produces cleaner punctuation than the other. On few-shot benchmarks translated to Galician, the models beat some multilingual baselines on a science QA set but overall few-shot scores are near random for many tasks, indicating the need for instruction tuning, larger models, or targeted fine-tuning before production use.

Problem Statement

Large generative LLMs are English-dominant and under‑serve minority languages. Galician lacked an open generative LLM. The paper asks whether continual pretraining of existing decoder models on a large Galician corpus can produce usable Galician LLMs.

Main Contribution

Two open-source Galician generative LLMs (Carballo-bloom-1.3B and Carballo-cerebras-1.3B) released.

A 2.1B-word cleaned Galician corpus (CorpusNÓS) assembled and made public.

Practical continual-pretraining recipe: new BPE tokenizer, embedding re-initialization for new tokens, and mixed-precision training with DeepSpeed.

Human qualitative evaluation and translated task-based benchmark suite for Galician (five tasks, few-shot).

Key Findings

Two 1.3B-parameter Galician decoder models were produced via continual pretraining on CorpusNÓS.

Numbers1.3B params; corpus = 2.13B tokens (2.1B words)

Human evaluation found many form errors in generated continuations; Carballo-bloom had 41% form-error rate, Carballo-cerebras 22%, authentic texts 27%.

NumbersCarballo-bloom 41% vs Carballo-cerebras 22% vs authentic 27%

On OpenBookQA (five-shot), Carballo-bloom outperformed multilingual baselines.

NumbersCarballo-bloom 0.364±0.022 vs best baseline FL-1.3B 0.342±0.021 (Δ≈+0.022)

Across several few-shot tasks, model scores are near random baselines for 2-choice and 4-choice formats.

NumbersCoLA ~0.499–0.507 (random 0.5 for 2-choice); Belebele ~0.231–0.271 (4-choice random=0.25)

Results

Accuracy

Value0.364±0.022 (Carballo-bloom-1.3B)

Baseline0.342±0.021 (FLOR-1.3B)

Accuracy

Value0.271±0.015 (Carballo-cerebras-1.3B)

Baseline0.234±0.014 (Bloom-1b1)

Accuracy

Value0.502±0.012 (Carballo-cerebras-1.3B)

Baseline0.507±0.012 (Bloom-1b1)

Form-error rate (human eval)

Value41% (Carballo-bloom continuations)

Baseline27% (authentic continuations)

Form-error rate (human eval)

Value22% (Carballo-cerebras continuations)

Baseline27% (authentic continuations)

Who Should Care

What To Try In 7 Days

Download Carballo-bloom and Carballo-cerebras from HuggingFace and run basic generation tests.

Run the provided human-eval example texts to inspect punctuation and register issues.

Evaluate models on a small in‑house Galician QA sample and compare to multilingual baselines using the translated benchmarks provided in the paper.

Agent Features

Memory

  • continual pretraining (reuse base model weights)

Frameworks

  • HuggingFace Transformers
  • DeepSpeed

Architectures

  • decoder-only (GPT-style)
  • 1.3B parameters
  • 16 attention heads, 24 layers

Optimization Features

Infra Optimization

  • NVIDIA A100 40GB GPUs at CESGA

Model Optimization

  • vocabulary adaptation via new BPE tokenizer and embedding re-init

System Optimization

  • DeepSpeed ZeRO stage 2
  • BF16 mixed-precision

Training Optimization

  • continual pretraining (warm-start)
  • Adam optimizer (β1=0.9, β2=0.999, ε=1e-8, weight decay=0.1)
  • linear learning rate decay

Reproducibility

Data Urls

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Model size limited (~1.3B) — restricts few-shot reasoning and instruction-following.
  • No instruction tuning applied; few-shot benchmark scores are near random for many tasks.
  • Human eval used only 60 held-out texts and 6 evaluators, limiting statistical power.
  • CorpusNÓS still contains punctuation/boilerplate noise that affects output quality.

When Not To Use

  • High-stakes factual QA or decision-making without further fine-tuning.
  • Out-of-the-box instruction-following tasks that require reliable few-shot performance.
  • Applications demanding near-perfect formatting and punctuation unless post-processed.

Failure Modes

  • Punctuation and formatting errors inherited from training corpus.
  • Abrupt topic shifts and minor content incoherences in continuations.
  • Low few-shot accuracy on reasoning and acceptability tasks without instruction tuning.

Core Entities

Models

  • Carballo-bloom-1.3B
  • Carballo-cerebras-1.3B
  • FLOR-1.3B
  • Cerebras-GPT-1.3B
  • Bloom-1b1
  • Bloom-1b7
  • mGPT
  • Bloom-1.7B

Metrics

  • Accuracy
  • human error rates (form/content/register/etc.)

Datasets

  • CorpusNÓS
  • Belebele (Galician translation)
  • OpenBookQA (Galician translation)
  • CoLA (Galician translation)
  • Parafrases-gl
  • PAWS-X (Galician translation)

Benchmarks

  • Belebele
  • OpenBookQA
  • CoLA
  • Parafrases-gl
  • PAWS-X

Context Entities

Models

  • LLaMA 2 (indirectly referenced)
  • MarIA family (Spanish context)

Datasets

  • CorpusNÓS cleaning pipeline (perplexity-based boilerplate removal)