MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

February 26, 20246 min

Overview

Decision SnapshotNeeds Validation

The benchmark and model are useful research and evaluation assets. MoZi shows clear gains after IP tuning. However, absolute performance is low on key tasks, so do not deploy without extra verification or task-specific pipelines.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Shiwen Ni, Minghuan Tan, Yuelin Bai, Fuqiang Niu, Min Yang, Bowen Zhang, Ruifeng Xu, Xiaojun Chen, Chengming Li, Xiping Hu, Ye Li, Jianping Fan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.

Who Should Care

Summary TLDR

This paper introduces MoZIP, a multilingual benchmark for intellectual property (IP) tasks (IPQuiz, IPQA, PatentMatch) and an IP-finetuned BLOOMZ-based model called MoZi-7b. MoZi improves substantially over its BLOOMZ-7b base (≈+10.1 percentage points on IPQuiz average) but lags behind ChatGPT. Overall accuracy numbers are low (MoZi ~39% IPQuiz avg, ChatGPT ~49.6%), showing current LLMs still struggle with IP knowledge and long patent texts.

Problem Statement

There is no standardized multilingual benchmark to measure LLM performance on intellectual property tasks. Off-the-shelf LLMs lack reliable IP knowledge and struggle to compare long patent texts. Practitioners need a focused dataset and a domain-tuned model to quantify gaps and guide improvements.

Main Contribution

MoZIP benchmark: three datasets (IPQuiz, IPQA, PatentMatch) covering seven to nine languages and three task types.

MoZi-7b: a BLOOMZ-MT-7B model further trained on 24M patents, 3M general instructions, and ~59k IP-specific instructions.

Key Findings

Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.

NumbersIPQuiz average: MoZi 39.4% vs BLOOMZ-7b 29.3% (+10.1 pp)

Practical UseFinetune a multilingual base model with patent text and IP Q&A to get immediate, measurable gains on IP questions.

Evidence RefTable 2

Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.

NumbersIPQuiz-en: ChatGPT 60.8%; IPQuiz-average: ChatGPT 49.6%

Practical UseDo not rely on general-purpose LLMs alone for IP-critical decisions; add verification or specialist tooling.

Evidence RefTable 2; text

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy39.4%ChatGPT 49.6%-10.2 pp vs ChatGPTIPQuiz (7 languages)Table 2 shows averages across EN/ZH/XLTable 2
Accuracy27.4%ChatGPT 38.8%-11.4 pp vs ChatGPTPatentMatch (EN & ZH)Table 3 averages for EN and ZHTable 3

What To Try In 7 Days

Run your IP prompts against MoZi (or a domain-finetuned model) and compare answers to a general LLM to spot differences.

Use the IPQuiz subset as a quick internal test to measure baseline IP knowledge in your models.

Add simple retrieval (BM25) and embeddings as a guardrail before asking an LLM to compare patents.

Reproducibility

Risks & Boundaries

Limitations

MoZIP initial release covers mainly 7–9 major languages; low-resource languages remain scarce.

PatentMatch uses abstracts and long inputs (>1,000 tokens), which many LLMs handle poorly.

When Not To Use

For direct, unverified legal advice or final IP decisions—models are not sufficiently accurate.

As a single-source production classifier for patent similarity without additional retrieval or human review.

Failure Modes

Hallucinated or incorrect legal facts in generated answers.

Poor performance on long patent texts leading to wrong match choices.

Core Entities

Models

MoZi-7bBLOOMZ-7bBELLE-7bChatGLM-6bChatGPT

Metrics

Accuracyinter-annotator agreement

Datasets

MoZIPIPQuizIPQAPatentMatchIPFAQIPACTCNIPA patent crawlWIPO patent data

Benchmarks

MoZIP