Juru: a 7B model specialized on 1.9B Brazilian legal tokens that improves legal exam accuracy but harms general knowledge

March 26, 20246 min

Overview

Production Readiness

0.4

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

2

Authors

Roseval Malaquias Junior, Ramon Pires, Roseli Romero, Rodrigo Nogueira

Links

Abstract / PDF

Why It Matters For Business

You can cheaply improve an LLM for a legal product by continued pretraining on a modest, high-quality legal corpus, but expect trade-offs: general-purpose capabilities can degrade.

Summary TLDR

The authors continued-pretrained Mistral-7B on 1.9 billion tokens drawn from reputable Brazilian legal sources to produce Juru-7B. Juru raises accuracy on Brazilian legal multiple-choice exams (mean +4.7% vs Mistral-7B) but reduces performance on Portuguese (-2.4% mean) and English (-3.6% mean) general-knowledge benchmarks. The model and checkpoints are publicly available on Hugging Face.

Problem Statement

Can a moderately-sized general LLM be cheaply specialized for Brazilian law using a small, high-quality legal corpus, and what trade-offs (especially forgetting) arise for general knowledge?

Main Contribution

Collected a curated dataset of legal Portuguese texts (≈1.9B BPE tokens) from academic papers, federal laws, and court decisions.

Continued pretraining of Mistral-7B using that dataset to produce Juru-7B and released checkpoints on Hugging Face.

Systematic few-shot evaluation on in-domain Brazilian legal exams and out-of-domain general-knowledge suites (Portuguese and English) to quantify gains and forgetting.

Key Findings

Specialization improves legal-exam accuracy vs base model.

NumbersMean accuracy +4.7% (44.5% → 49.2%) on 8 legal exams

Specialization causes forgetting on Portuguese general tasks.

NumbersMean accuracy −2.4% (50.5% → 48.1%) across 44 Portuguese exams

Forgetting is larger on English general knowledge.

NumbersMean accuracy −3.6% (61.6% → 58.0%) on MMLU subsets (4,369 Qs)

Pretraining uses modest compute and dataset size.

Numbers1.9B tokens gathered; 7.96B tokens processed; 30.61 hours on TPU v2-256; MFU 54.2%

Results

Accuracy

ValueJuru 49.2%

BaselineMistral-7B 44.5%

Accuracy

ValueJuru 48.1%

BaselineMistral-7B 50.5%

Accuracy

ValueJuru 58.0%

BaselineMistral-7B 61.6%

Accuracy

Value49.2% at 7.1B tokens

Baselinebase at 0B tokens (Mistral-7B)

Who Should Care

What To Try In 7 Days

Download Juru checkpoints from Hugging Face and run targeted legal MCQ evaluation on your own data.

Compare Juru vs base LLM on the customer tasks that matter (legal QA, summaries) to check real gains.

Run sanity checks on general or multilingual tasks to spot catastrophic forgetting early.

Optimization Features

Infra Optimization

  • TPU v2-256 cluster used; 54.2% MFU (excl. self-attention)

Training Optimization

  • continued pretraining on domain-specific corpus
  • AdaFactor optimizer
  • auxiliary logit-damping loss

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Possible contamination from scraped data near exam publication dates (acknowledged by authors).
  • Evaluation limited to multiple-choice exams, not open-ended legal tasks like drafting or prediction.
  • Dataset covers a specific subset of Brazilian legal texts; other legal subdomains could behave differently.
  • Model was not instruction-tuned; performance on conversational/legal advice tasks is untested.

When Not To Use

  • When you need broad general or English knowledge out-of-the-box.
  • When the target application requires instruction-following or chat-style behavior (model not fine-tuned for that).
  • In high-stakes legal decisions without thorough external validation and human oversight.

Failure Modes

  • Catastrophic forgetting: worse accuracy on non-legal tasks after specialization.
  • Potential data contamination yielding overestimated gains on evaluated exams.
  • Uneven performance across specific knowledge areas (some domains drop more than others).

Core Entities

Models

  • Mistral-7B
  • Juru-7B

Metrics

  • Accuracy

Datasets

  • Curated Brazilian academic legal papers (scraped)
  • LexML federal laws subset
  • Sakiyama et al. dataset (Supreme Federal Court decisions and judgments)

Benchmarks

  • OAB (Brazilian Bar exams) 2023–2024
  • ENAM-2024
  • ENEM-2024
  • BLUEX-2024
  • CPNU-2024
  • BNDES-2024
  • MMLU (college + high school subsets)