ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

December 24, 20257 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen, Chris Nugent

Links

Abstract / PDF

Why It Matters For Business

ProUtt lets teams train compact, privacy-preserving models that predict likely user replies by learning intent-level reasoning, improving selection-based UX while avoiding cloud API costs or massive LLM deployments.

Summary TLDR

ProUtt is a data-synthesis method that uses an LLM to convert multi-turn dialogues into hierarchical user intent trees, then generates paired “preferred” and “non-preferred” intent-reasoning traces and next-utterance candidates. Fine-tuning small models (e.g., Qwen3-8B) on ProUtt data improves proactive next-utterance prediction vs larger prompt-based or simulator baselines. Evaluations use LLM-as-a-judge, embedding similarity, and human pairwise judgments. Code and larger synthesized datasets are released.

Problem Statement

Predicting the user's next utterance proactively can improve UX but cloud APIs raise privacy concerns and large LLMs are costly. Existing user simulators mimic surface style and lack explicit intent reasoning. The problem: produce compact, privacy-friendly training data that teaches small models to predict the next user utterance at the intent level, not just copy surface wording.

Main Contribution

ProUtt: a pipeline that converts dialogue history into a hierarchical user intent tree and synthesizes paired preferred/non-preferred reasoning traces for next-utterance prediction.

Empirical demonstration that training small LLMs on ProUtt-synthesized preference data (SFT + alignment) improves next-utterance prediction on four datasets versus simulators, other synthesis methods, and larger prompt-based LLMs.

Public release of the synthesis code and larger ProUtt datasets (LMSYS-ProUtt-10K and CrossWOZ-ProUtt-5K) to enable reproduction and further research.

Key Findings

ProUtt improves small-model performance over the unfine-tuned Qwen3-8B backbone on four test sets under LLM-judge scoring.

NumbersImprovements vs Qwen3-8B: LMSYS +15.28%, ShareGPT +6.26%, WildChat +13.66%, CrossWOZ +13.52%

ProUtt beats the strongest data-synthesis baseline by modest margins on all datasets under LLM-judge evaluation.

NumbersRelative gains vs strongest baseline: LMSYS +4.88%, ShareGPT +3.16%, WildChat +2.46%, CrossWOZ +5.12%

Human and LLM pairwise judgments agree strongly when comparing methods trained with ProUtt.

NumbersAgreement rate >80%; Cohen's κ generally >0.6, >0.8 on CrossWOZ

The explicit intent-tree module is the most impactful component in ablations for open-domain data.

NumbersSFT LMSYS: full ProUtt 58.10 → w/o intent tree 50.80 (−7.3 pts); DPO LMSYS: 60.98 → 52.30 (−8.68 pts)

Results

SFT

ValueLMSYS 58.10%; ShareGPT 50.12%; WildChat 50.80%; CrossWOZ 45.98%

BaselineQwen3-8B (prompted) LMSYS 45.70%; ShareGPT 46.40%; WildChat 38.50%; CrossWOZ 37.70%

SFT

ValueLMSYS 60.98%; ShareGPT 52.66%; WildChat 52.16%; CrossWOZ 51.22%

BaselineBest baseline (varies); see Table I

SFT

ValueLMSYS 76.30%; ShareGPT 69.08%; WildChat 73.22%; CrossWOZ 53.59%

BaselineQwen3-8B embedding sim: 69.90%, 67.06%, 64.10%, 53.91% respectively

Who Should Care

What To Try In 7 Days

Run ProUtt code on a small sample of your chat logs to generate intent-tree preference pairs.

Fine-tune a 4–8B model with LoRA using the 2K ProUtt dataset, evaluate with an LLM-judge and a 100-case human pairwise check.

Use DPO alignment after SFT (authors recommend DPO) to amplify preference signals quickly.

Optimization Features

Training Optimization

  • LoRA
  • SFT

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Cross-domain generalization is weaker: gains on held-out datasets like ShareGPT are smaller.
  • Relies on reliable intent-tree extraction; noisy or very short dialogues reduce benefit.
  • Method encodes intent as trees; graphs or richer relations might be needed for complex dialogues.

When Not To Use

  • You need highly personalized, fine-grained user behavior modeling from single short turns.
  • You cannot run an LLM (even a medium backbone) to synthesize intent trees locally or via trusted infrastructure.
  • When strict low-latency inference is mandatory and fine-tuning a model is infeasible.

Failure Modes

  • Threshold misconfiguration (τ_high/τ_low) blurs positive/negative labels and harms learning.
  • Overfitting to tree-derived intents if real user behavior is more surface- or style-driven.
  • Judge bias: reliance on LLM-as-a-judge can inherit its scoring blind spots.

Core Entities

Models

  • Qwen3-8B
  • Doubao-1.5-Pro
  • Qwen3-Max
  • Doubao-Seed-1.6
  • DeepSeek-V3.2-Exp
  • GLM-4.6
  • Socratic
  • USP
  • LLaMA3.1-8B

Metrics

  • LLM-Judge
  • Embed-Sim
  • Win-Tie-Loss (pairwise)

Datasets

  • LMSYS
  • CrossWOZ
  • ShareGPT
  • WildChat
  • LMSYS-ProUtt-2K
  • LMSYS-ProUtt-10K
  • CrossWOZ-ProUtt-2K
  • CrossWOZ-ProUtt-5K

Context Entities

Models

  • Qwen3-Plus
  • GLM-4.5-Air