MoZIP: a 3-part multilingual benchmark plus an IP-tuned 7B model to test how well LLMs handle patent and IP tasks

February 26, 20246 min

Overview

Production Readiness

0.3

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

3

Authors

Shiwen Ni, Minghuan Tan, Yuelin Bai, Fuqiang Niu, Min Yang, Bowen Zhang, Ruifeng Xu, Xiaojun Chen, Chengming Li, Xiping Hu, Ye Li, Jianping Fan

Links

Abstract / PDF

Why It Matters For Business

IP tasks need factual, language-specific understanding; MoZIP and MoZi show that domain-tuning helps but general LLMs still miss facts—so verify outputs for legal or IP decisions.

Summary TLDR

This paper introduces MoZIP, a multilingual benchmark for intellectual property (IP) tasks (IPQuiz, IPQA, PatentMatch) and an IP-finetuned BLOOMZ-based model called MoZi-7b. MoZi improves substantially over its BLOOMZ-7b base (≈+10.1 percentage points on IPQuiz average) but lags behind ChatGPT. Overall accuracy numbers are low (MoZi ~39% IPQuiz avg, ChatGPT ~49.6%), showing current LLMs still struggle with IP knowledge and long patent texts.

Problem Statement

There is no standardized multilingual benchmark to measure LLM performance on intellectual property tasks. Off-the-shelf LLMs lack reliable IP knowledge and struggle to compare long patent texts. Practitioners need a focused dataset and a domain-tuned model to quantify gaps and guide improvements.

Main Contribution

MoZIP benchmark: three datasets (IPQuiz, IPQA, PatentMatch) covering seven to nine languages and three task types.

MoZi-7b: a BLOOMZ-MT-7B model further trained on 24M patents, 3M general instructions, and ~59k IP-specific instructions.

Baseline evaluation of five LLMs (MoZi, BLOOMZ-7b, BELLE-7b, ChatGLM-6b, ChatGPT) showing large room for improvement in IP tasks.

Released code, data, and model checkpoints to the public.

Key Findings

Domain tuning on patents and IP instructions raises performance versus the BLOOMZ-7b base.

NumbersIPQuiz average: MoZi 39.4% vs BLOOMZ-7b 29.3% (+10.1 pp)

Even the best evaluated model (ChatGPT) falls short of reliable passing performance on IP tasks.

NumbersIPQuiz-en: ChatGPT 60.8%; IPQuiz-average: ChatGPT 49.6%

Matching patents by meaning is hard for current LLMs, especially over long inputs.

NumbersPatentMatch average: MoZi 27.4%, ChatGPT 38.8%; PatentMatch inputs >1,000 tokens

Human evaluation on IPQA is reasonably consistent.

NumbersInter-annotator agreement (tie-discounted) = 81%

Results

Accuracy

Value39.4%

BaselineChatGPT 49.6%

Accuracy

Value27.4%

BaselineChatGPT 38.8%

IPQA human comparison wins (MoZi vs BLOOMZ-7b)

ValueMoZi won 88, lost 3, tied 0 (out of comparisons)

BaselineBLOOMZ-7b

Who Should Care

What To Try In 7 Days

Run your IP prompts against MoZi (or a domain-finetuned model) and compare answers to a general LLM to spot differences.

Use the IPQuiz subset as a quick internal test to measure baseline IP knowledge in your models.

Add simple retrieval (BM25) and embeddings as a guardrail before asking an LLM to compare patents.

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • MoZIP initial release covers mainly 7–9 major languages; low-resource languages remain scarce.
  • PatentMatch uses abstracts and long inputs (>1,000 tokens), which many LLMs handle poorly.
  • MoZi training details include proprietary instruction mixes and generated conversations, which may bias behavior.

When Not To Use

  • For direct, unverified legal advice or final IP decisions—models are not sufficiently accurate.
  • As a single-source production classifier for patent similarity without additional retrieval or human review.
  • For low-resource language IP tasks not covered in the dataset.

Failure Modes

  • Hallucinated or incorrect legal facts in generated answers.
  • Poor performance on long patent texts leading to wrong match choices.
  • Overfitting to instruction formats used in finetuning; weaker generalization to unseen IP styles.

Core Entities

Models

  • MoZi-7b
  • BLOOMZ-7b
  • BELLE-7b
  • ChatGLM-6b
  • ChatGPT

Metrics

  • Accuracy
  • inter-annotator agreement

Datasets

  • MoZIP
  • IPQuiz
  • IPQA
  • PatentMatch
  • IPFAQ
  • IPACT
  • CNIPA patent crawl
  • WIPO patent data

Benchmarks

  • MoZIP