Overview
Production Readiness
0.3
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
3
Why It Matters For Business
IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.
Summary TLDR
This paper introduces MoZIP, a multilingual benchmark for intellectual property (IP) tasks (IPQuiz, IPQA, PatentMatch) and an IP-finetuned BLOOMZ-based model called MoZi-7b. MoZi improves substantially over its BLOOMZ-7b base (≈+10.1 percentage points on IPQuiz average) but lags behind ChatGPT. Overall accuracy numbers are low (MoZi ~39% IPQuiz avg, ChatGPT ~49.6%), showing current LLMs still struggle with IP knowledge and long patent texts.
Problem Statement
There is no standardized multilingual benchmark to measure LLM performance on intellectual property tasks. Off-the-shelf LLMs lack reliable IP knowledge and struggle to compare long patent texts. Practitioners need a focused dataset and a domain-tuned model to quantify gaps and guide improvements.
Main Contribution
MoZIP benchmark: three datasets (IPQuiz, IPQA, PatentMatch) covering seven to nine languages and three task types.
MoZi-7b: a BLOOMZ-MT-7B model further trained on 24M patents, 3M general instructions, and ~59k IP-specific instructions.
Baseline evaluation of five LLMs (MoZi, BLOOMZ-7b, BELLE-7b, ChatGLM-6b, ChatGPT) showing large room for improvement in IP tasks.
Released code, data, and model checkpoints to the public.
Key Findings
Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.
Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.
Matching patents by meaning is hard for current LLMs, especially over long inputs.
Human evaluation on IPQA is reasonably consistent.
Results
Accuracy
Accuracy
IPQA human comparison wins (MoZi vs BLOOMZ-7b)
Who Should Care
What To Try In 7 Days
Run your IP prompts against MoZi (or a domain-finetuned model) and compare answers to a general LLM to spot differences.
Use the IPQuiz subset as a quick internal test to measure baseline IP knowledge in your models.
Add simple retrieval (BM25) and embeddings as a guardrail before asking an LLM to compare patents.
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- MoZIP initial release covers mainly 7–9 major languages; low-resource languages remain scarce.
- PatentMatch uses abstracts and long inputs (>1,000 tokens), which many LLMs handle poorly.
- MoZi training details include proprietary instruction mixes and generated conversations, which may bias behavior.
When Not To Use
- For direct, unverified legal advice or final IP decisions—models are not sufficiently accurate.
- As a single-source production classifier for patent similarity without additional retrieval or human review.
- For low-resource language IP tasks not covered in the dataset.
Failure Modes
- Hallucinated or incorrect legal facts in generated answers.
- Poor performance on long patent texts leading to wrong match choices.
- Overfitting to instruction formats used in finetuning; weaker generalization to unseen IP styles.
Core Entities
Models
- MoZi-7b
- BLOOMZ-7b
- BELLE-7b
- ChatGLM-6b
- ChatGPT
Metrics
- Accuracy
- inter-annotator agreement
Datasets
- MoZIP
- IPQuiz
- IPQA
- PatentMatch
- IPFAQ
- IPACT
- CNIPA patent crawl
- WIPO patent data
Benchmarks
- MoZIP

