Trainable watermarking that injects more bits, preserves meaning, and resists removal

October 18, 20237 min

Overview

Production Readiness

0.75

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

9

Authors

Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, Farinaz Koushanfar

Links

Abstract / PDF

Why It Matters For Business

A practical watermarking layer lets API owners tag model outputs with recoverable signatures to prove origin, deter plagiarism, and monitor misuse without breaking text quality or adding large latency.

Summary TLDR

REMARK-LLM is a learned watermarking pipeline that embeds binary signatures into LLM outputs while keeping text meaning and readability. It combines a Seq2Seq message encoder, a Gumbel-Softmax reparameterizer to produce sparse token distributions, and a transformer-based decoder to extract signatures. On benchmark datasets the method encodes roughly 2× more bits than prior neural baselines, preserves BERTScore near 0.90, runs in about 1.2s per 80-token segment, and sustains strong statistical proof (z-score ≈ 7.12 for 640 tokens) under editing and paraphrase attacks.

Problem Statement

LLM outputs are valuable IP but easy to reuse or plagiarize. Existing watermarks either break semantics (inference-time green/red lists) or have limited capacity (prior neural schemes). Text is sparse and fragile: few embedding positions and small edits or rephrases can remove marks. We need a watermark that (1) fits more bits, (2) keeps semantics, (3) is efficient and robust to removal/detection attacks.

Main Contribution

A trainable three-module watermark pipeline: message encoding, reparameterization (Gumbel-Softmax), and message decoding.

An optimized beam-search inference that trades readability for extraction accuracy.

Training with simulated malicious edits (add/delete/replace) to improve robustness and transferability to unseen LLMs and datasets.

Key Findings

REMARK-LLM embeds more signature bits per text than prior neural watermarking.

Numbers˜2× more bits vs AWT on evaluated segments

Watermarked text preserves semantic quality.

NumbersAverage BERT-S ≈ 0.90 on evaluated datasets

Robustness under realistic removal attacks is high.

NumbersAverage AUC ≈ 0.85 under edit/rephrase/removal attacks

Strong statistical evidence for long texts.

Numbersz-score ≈ 7.12 for 640 tokens (p≈5.4×10⁻13)

Insertion is practical in time and memory.

Numbers≈1.21 s and 5.83 GB GPU for 8-bit into 80 tokens

Non-watermarked texts do not falsely decode messages.

NumbersWER ≈ 50% on non-watermarked texts

Results

Embed capacity vs prior neural watermarking

Value≈2× more bits per segment compared to AWT in experiments

BaselineAWT

Semantic fidelity (BERT-S)

Value≈0.90 average BERT-S

Baselineunaltered text

Robustness under removal attacks (AUC)

Value≈0.85 average AUC after attacks

BaselineAWT, KGW, EXP comparisons

Statistical strength (z-score)

Value≈7.12 for 640 tokens

Baselinez-score threshold 4 used for strong watermark

Insertion latency and memory

Value≈1.21 s, 5.83 GB GPU for 8-bit into 80 tokens

BaselineKGW and EXP higher times and memory

Who Should Care

What To Try In 7 Days

Run REMARK-LLM on a small subset of your API outputs and measure BERTScore and WER.

Simulate paraphrase and edit attacks (T5-based) to check signature robustness.

Compare insertion latency and GPU memory against any existing token-filtering watermark in your stack.

Optimization Features

Inference Optimization

  • Accuracy

Reproducibility

Data Urls

  • HC3 (referenced)
  • WikiText-2 (referenced)
  • ChatGPT Abstract (referenced)
  • Human Abstract (referenced)
  • Alpaca prompts (referenced)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires inserting watermarks before delivering responses; not usable if you cannot modify output stream.
  • Assumes watermarking model and keys remain private to the provider.
  • Evaluations focus on natural-language datasets; domain texts (code, medical) may need extra tuning.
  • Human-heavy rewriting attacks (manual edits) are not fully evaluated.

When Not To Use

  • If you cannot modify model outputs or add a post-processing step.
  • If you need absolute, human-verifiable forensic marks instead of statistical proof.
  • When the adversary has white-box access to the watermark model.

Failure Modes

  • Aggressive re-watermarking and heavy paraphrasing reduce AUC and extraction accuracy.
  • Higher embedding capacity increases semantic distortion if hyperparameters favor message loss.
  • Extreme temperature or masking choices during training can break one-hot reparameterization and reduce WER.

Core Entities

Models

  • T5-small
  • T5-base
  • T5-large
  • OPT-2.7B
  • LLaMA-2-7B
  • OpenOrca-7B
  • GPT-3.5 Turbo
  • GPT-4
  • AWT
  • KGW
  • EXP
  • CATER

Metrics

  • Watermark Extraction Rate (WER)
  • BERT-S (BERTScore)
  • BLEU-4
  • AUC
  • z-score
  • insertion time (s)
  • GPU memory (GB)

Datasets

  • HC3
  • WikiText-2
  • ChatGPT Abstract
  • Human Abstract
  • Alpaca (2k prompts)

Benchmarks

  • Segment-level watermarking (80 tokens)
  • Long-sequence watermarking (640 tokens)