Juru: a 7B model specialized on 1.9B Brazilian legal tokens that improves legal exam accuracy but harms general knowledge

Overview

Decision SnapshotNeeds Validation

The paper provides clear, reproducible steps and quantitative benchmarks; results are scoped to multiple-choice exams and a single domain, so apply cautiously to other tasks.

Citations2

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 40%

Authors

Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can cheaply improve an LLM for a legal product by continued pretraining on a modest, high-quality legal corpus, but expect trade-offs: general-purpose capabilities can degrade.

Who Should Care

ML Engineer Product Manager CTO Founder

Summary TLDR

The authors continued-pretrained Mistral-7B on 1.9 billion tokens drawn from reputable Brazilian legal sources to produce Juru-7B. Juru raises accuracy on Brazilian legal multiple-choice exams (mean +4.7% vs Mistral-7B) but reduces performance on Portuguese (-2.4% mean) and English (-3.6% mean) general-knowledge benchmarks. The model and checkpoints are publicly available on Hugging Face.

Problem Statement

Can a moderately-sized general LLM be cheaply specialized for Brazilian law using a small, high-quality legal corpus, and what trade-offs (especially forgetting) arise for general knowledge?

Main Contribution

Collected a curated dataset of legal Portuguese texts (≈1.9B BPE tokens) from academic papers, federal laws, and court decisions.

Continued pretraining of Mistral-7B using that dataset to produce Juru-7B and released checkpoints on Hugging Face.

Key Findings

Specialization improves legal-exam accuracy vs base model.

NumbersMean accuracy +4.7% (44.5% → 49.2%) on 8 legal exams

Practical UseIf you need better performance on Brazilian legal MCQs, continue-pretraining a general LLM on focused, high-quality legal texts can help with modest compute.

Evidence RefTable 5

Specialization causes forgetting on Portuguese general tasks.

NumbersMean accuracy −2.4% (50.5% → 48.1%) across 44 Portuguese exams

Practical UseExpect small but measurable drops in non-legal Portuguese capabilities after domain-only continued pretraining; test your target tasks before deployment.

Evidence RefTable 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	Juru 49.2%	Mistral-7B 44.5%	+4.7%	OAB-2023/OAB-2024/ENAM-2024 (638 Qs)	Table 5 compares Mistral-7B and Juru	Table 5
Accuracy	Juru 48.1%	Mistral-7B 50.5%	−2.4%	ENEM/BLUEX/CPNU/BNDES/REVALIDA/MREX/CFCEQ/CFCES (2,123 Qs)	Table 7 reports per-benchmark accuracies	Table 7

What To Try In 7 Days

Download Juru checkpoints from Hugging Face and run targeted legal MCQ evaluation on your own data.

Compare Juru vs base LLM on the customer tasks that matter (legal QA, summaries) to check real gains.

Run sanity checks on general or multilingual tasks to spot catastrophic forgetting early.

Optimization Features

Infra Optimization

TPU v2-256 cluster used; 54.2% MFU (excl. self-attention)

Training Optimization

continued pretraining on domain-specific corpusAdaFactor optimizerauxiliary logit-damping loss

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/roseval/Juru-7B

Data URLs

https://www.lexml.gov.br

Risks & Boundaries

Limitations

Possible contamination from scraped data near exam publication dates (acknowledged by authors).

Evaluation limited to multiple-choice exams, not open-ended legal tasks like drafting or prediction.

When Not To Use

When you need broad general or English knowledge out-of-the-box.

When the target application requires instruction-following or chat-style behavior (model not fine-tuned for that).

Failure Modes

Catastrophic forgetting: worse accuracy on non-legal tasks after specialization.

Potential data contamination yielding overestimated gains on evaluated exams.

Core Entities

Models

Mistral-7BJuru-7B

Metrics

Accuracy

Datasets

Curated Brazilian academic legal papers (scraped)LexML federal laws subsetSakiyama et al. dataset (Supreme Federal Court decisions and judgments)

Benchmarks

OAB (Brazilian Bar exams) 2023–2024ENAM-2024ENEM-2024BLUEX-2024CPNU-2024BNDES-2024MMLU (college + high school subsets)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Specialization improves legal-exam accuracy vs base model.

Specialization causes forgetting on Portuguese general tasks.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Train LLMs on a 103B-token agent corpus to boost API function-calling, planning, and feedback adaptation.

Key finding

MindLLM: 1.3B and 3B bilingual LLMs trained from scratch that match larger open models on several benchmarks

Key finding

Pre-train LLMs to use search tools: mask-and-search task (RAMP) improves multi-step retrieval and reasoning

Key finding

Survey + benchmark of memory- and parameter-efficient LLM pretraining; two small tricks cut memory ~25% while closing the gap to full-rank

Key finding

Survey: how to update LLMs continuously without full retraining

Key finding