Overview
Production Readiness
0.4
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
2
Why It Matters For Business
You can cheaply improve an LLM for a legal product by continued pretraining on a modest, high-quality legal corpus, but expect trade-offs: general-purpose capabilities can degrade.
Summary TLDR
The authors continued-pretrained Mistral-7B on 1.9 billion tokens drawn from reputable Brazilian legal sources to produce Juru-7B. Juru raises accuracy on Brazilian legal multiple-choice exams (mean +4.7% vs Mistral-7B) but reduces performance on Portuguese (-2.4% mean) and English (-3.6% mean) general-knowledge benchmarks. The model and checkpoints are publicly available on Hugging Face.
Problem Statement
Can a moderately-sized general LLM be cheaply specialized for Brazilian law using a small, high-quality legal corpus, and what trade-offs (especially forgetting) arise for general knowledge?
Main Contribution
Collected a curated dataset of legal Portuguese texts (≈1.9B BPE tokens) from academic papers, federal laws, and court decisions.
Continued pretraining of Mistral-7B using that dataset to produce Juru-7B and released checkpoints on Hugging Face.
Systematic few-shot evaluation on in-domain Brazilian legal exams and out-of-domain general-knowledge suites (Portuguese and English) to quantify gains and forgetting.
Key Findings
Specialization improves legal-exam accuracy vs base model.
Specialization causes forgetting on Portuguese general tasks.
Forgetting is larger on English general knowledge.
Pretraining uses modest compute and dataset size.
Results
Accuracy
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Download Juru checkpoints from Hugging Face and run targeted legal MCQ evaluation on your own data.
Compare Juru vs base LLM on the customer tasks that matter (legal QA, summaries) to check real gains.
Run sanity checks on general or multilingual tasks to spot catastrophic forgetting early.
Optimization Features
Infra Optimization
- TPU v2-256 cluster used; 54.2% MFU (excl. self-attention)
Training Optimization
- continued pretraining on domain-specific corpus
- AdaFactor optimizer
- auxiliary logit-damping loss
Reproducibility
Data Urls
Open Source Status
- partial
Risks & Boundaries
Limitations
- Possible contamination from scraped data near exam publication dates (acknowledged by authors).
- Evaluation limited to multiple-choice exams, not open-ended legal tasks like drafting or prediction.
- Dataset covers a specific subset of Brazilian legal texts; other legal subdomains could behave differently.
- Model was not instruction-tuned; performance on conversational/legal advice tasks is untested.
When Not To Use
- When you need broad general or English knowledge out-of-the-box.
- When the target application requires instruction-following or chat-style behavior (model not fine-tuned for that).
- In high-stakes legal decisions without thorough external validation and human oversight.
Failure Modes
- Catastrophic forgetting: worse accuracy on non-legal tasks after specialization.
- Potential data contamination yielding overestimated gains on evaluated exams.
- Uneven performance across specific knowledge areas (some domains drop more than others).
Core Entities
Models
- Mistral-7B
- Juru-7B
Metrics
- Accuracy
Datasets
- Curated Brazilian academic legal papers (scraped)
- LexML federal laws subset
- Sakiyama et al. dataset (Supreme Federal Court decisions and judgments)
Benchmarks
- OAB (Brazilian Bar exams) 2023–2024
- ENAM-2024
- ENEM-2024
- BLUEX-2024
- CPNU-2024
- BNDES-2024
- MMLU (college + high school subsets)

