MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

Overview

Decision SnapshotNeeds Validation

The benchmark and model are useful research and evaluation assets. MoZi shows clear gains after IP tuning. However, absolute performance is low on key tasks, so do not deploy without extra verification or task-specific pipelines.

Citations3

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 40%

Production readiness: 30%

Novelty: 60%

Authors

Shiwen Ni, Minghuan Tan, Yuelin Bai, Fuqiang Niu, Min Yang, Bowen Zhang, Ruifeng Xu, Xiaojun Chen, Chengming Li, Xiping Hu, Ye Li, Jianping Fan

Links

Abstract / PDF / Code / Data

Why It Matters For Business

IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.

Who Should Care

CTO Product Manager ML Engineer Data Scientist

Summary TLDR

This paper introduces MoZIP, a multilingual benchmark for intellectual property (IP) tasks (IPQuiz, IPQA, PatentMatch) and an IP-finetuned BLOOMZ-based model called MoZi-7b. MoZi improves substantially over its BLOOMZ-7b base (≈+10.1 percentage points on IPQuiz average) but lags behind ChatGPT. Overall accuracy numbers are low (MoZi ~39% IPQuiz avg, ChatGPT ~49.6%), showing current LLMs still struggle with IP knowledge and long patent texts.

Problem Statement

There is no standardized multilingual benchmark to measure LLM performance on intellectual property tasks. Off-the-shelf LLMs lack reliable IP knowledge and struggle to compare long patent texts. Practitioners need a focused dataset and a domain-tuned model to quantify gaps and guide improvements.

Main Contribution

MoZIP benchmark: three datasets (IPQuiz, IPQA, PatentMatch) covering seven to nine languages and three task types.

MoZi-7b: a BLOOMZ-MT-7B model further trained on 24M patents, 3M general instructions, and ~59k IP-specific instructions.

Key Findings

Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.

NumbersIPQuiz average: MoZi 39.4% vs BLOOMZ-7b 29.3% (+10.1 pp)

Practical UseFinetune a multilingual base model with patent text and IP Q&A to get immediate, measurable gains on IP questions.

Evidence RefTable 2

Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.

NumbersIPQuiz-en: ChatGPT 60.8%; IPQuiz-average: ChatGPT 49.6%

Practical UseDo not rely on general-purpose LLMs alone for IP-critical decisions; add verification or specialist tooling.

Evidence RefTable 2; text

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	39.4%	ChatGPT 49.6%	-10.2 pp vs ChatGPT	IPQuiz (7 languages)	Table 2 shows averages across EN/ZH/XL	Table 2
Accuracy	27.4%	ChatGPT 38.8%	-11.4 pp vs ChatGPT	PatentMatch (EN & ZH)	Table 3 averages for EN and ZH	Table 3

What To Try In 7 Days

Run your IP prompts against MoZi (or a domain-finetuned model) and compare answers to a general LLM to spot differences.

Use the IPQuiz subset as a quick internal test to measure baseline IP knowledge in your models.

Add simple retrieval (BM25) and embeddings as a guardrail before asking an LLM to compare patents.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/AI-for-Science/MoZi https://huggingface.co/datasets/BNNT/IPQuiz https://huggingface.co/datasets/BNNT/IPQA https://huggingface.co/datasets/BNNT/PatentMatch https://huggingface.co/datasets/BNNT/mozi_general_instructions_3m https://huggingface.co/datasets/BNNT/mozi_IP_instructions

Data URLs

https://huggingface.co/datasets/BNNT/IPQuiz https://huggingface.co/datasets/BNNT/IPQA https://huggingface.co/datasets/BNNT/PatentMatch

Risks & Boundaries

Limitations

MoZIP initial release covers mainly 7–9 major languages; low-resource languages remain scarce.

PatentMatch uses abstracts and long inputs (>1,000 tokens), which many LLMs handle poorly.

When Not To Use

For direct, unverified legal advice or final IP decisions—models are not sufficiently accurate.

As a single-source production classifier for patent similarity without additional retrieval or human review.

Failure Modes

Hallucinated or incorrect legal facts in generated answers.

Poor performance on long patent texts leading to wrong match choices.

Core Entities

Models

MoZi-7bBLOOMZ-7bBELLE-7bChatGLM-6bChatGPT

Metrics

Accuracyinter-annotator agreement

Datasets

MoZIPIPQuizIPQAPatentMatchIPFAQIPACTCNIPA patent crawlWIPO patent data

Benchmarks

MoZIP

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.

Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding