Overview
The benchmark and model are useful research and evaluation assets. MoZi shows clear gains after IP tuning. However, absolute performance is low on key tasks, so do not deploy without extra verification or task-specific pipelines.
Citations3
Evidence Strength0.70
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Yes
At A Glance
Cost impact: 40%
Production readiness: 30%
Novelty: 60%
Why It Matters For Business
IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.
Who Should Care
Summary TLDR
This paper introduces MoZIP, a multilingual benchmark for intellectual property (IP) tasks (IPQuiz, IPQA, PatentMatch) and an IP-finetuned BLOOMZ-based model called MoZi-7b. MoZi improves substantially over its BLOOMZ-7b base (≈+10.1 percentage points on IPQuiz average) but lags behind ChatGPT. Overall accuracy numbers are low (MoZi ~39% IPQuiz avg, ChatGPT ~49.6%), showing current LLMs still struggle with IP knowledge and long patent texts.
Problem Statement
There is no standardized multilingual benchmark to measure LLM performance on intellectual property tasks. Off-the-shelf LLMs lack reliable IP knowledge and struggle to compare long patent texts. Practitioners need a focused dataset and a domain-tuned model to quantify gaps and guide improvements.
Main Contribution
MoZIP benchmark: three datasets (IPQuiz, IPQA, PatentMatch) covering seven to nine languages and three task types.
MoZi-7b: a BLOOMZ-MT-7B model further trained on 24M patents, 3M general instructions, and ~59k IP-specific instructions.
Key Findings
Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.
Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 39.4% | ChatGPT 49.6% | -10.2 pp vs ChatGPT | IPQuiz (7 languages) | Table 2 shows averages across EN/ZH/XL | Table 2 |
| Accuracy | 27.4% | ChatGPT 38.8% | -11.4 pp vs ChatGPT | PatentMatch (EN & ZH) | Table 3 averages for EN and ZH | Table 3 |
What To Try In 7 Days
Run your IP prompts against MoZi (or a domain-finetuned model) and compare answers to a general LLM to spot differences.
Use the IPQuiz subset as a quick internal test to measure baseline IP knowledge in your models.
Add simple retrieval (BM25) and embeddings as a guardrail before asking an LLM to compare patents.
Reproducibility
Code URLs
Risks & Boundaries
Limitations
MoZIP initial release covers mainly 7–9 major languages; low-resource languages remain scarce.
PatentMatch uses abstracts and long inputs (>1,000 tokens), which many LLMs handle poorly.
When Not To Use
For direct, unverified legal advice or final IP decisions—models are not sufficiently accurate.
As a single-source production classifier for patent similarity without additional retrieval or human review.
Failure Modes
Hallucinated or incorrect legal facts in generated answers.
Poor performance on long patent texts leading to wrong match choices.

