ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

December 24, 20257 min

Overview

Decision SnapshotNeeds Validation

ProUtt is a practical synthesis recipe: build intent trees, generate preferred/non-preferred reasoning, and use DPO-aligned fine-tuning to get compact models that predict next utterances with better intent consistency.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 60%

Authors

Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen, Chris Nugent

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ProUtt lets teams train compact, privacy-preserving models that predict likely user replies by learning intent-level reasoning, improving selection-based UX while avoiding cloud API costs or massive LLM deployments.

Who Should Care

Summary TLDR

ProUtt is a data-synthesis method that uses an LLM to convert multi-turn dialogues into hierarchical user intent trees, then generates paired “preferred” and “non-preferred” intent-reasoning traces and next-utterance candidates. Fine-tuning small models (e.g., Qwen3-8B) on ProUtt data improves proactive next-utterance prediction vs larger prompt-based or simulator baselines. Evaluations use LLM-as-a-judge, embedding similarity, and human pairwise judgments. Code and larger synthesized datasets are released.

Problem Statement

Predicting the user's next utterance proactively can improve UX but cloud APIs raise privacy concerns and large LLMs are costly. Existing user simulators mimic surface style and lack explicit intent reasoning. The problem: produce compact, privacy-friendly training data that teaches small models to predict the next user utterance at the intent level, not just copy surface wording.

Main Contribution

ProUtt: a pipeline that converts dialogue history into a hierarchical user intent tree and synthesizes paired preferred/non-preferred reasoning traces for next-utterance prediction.

Empirical demonstration that training small LLMs on ProUtt-synthesized preference data (SFT + alignment) improves next-utterance prediction on four datasets versus simulators, other synthesis methods, and larger prompt-based LLMs.

Key Findings

ProUtt improves small-model performance over the unfine-tuned Qwen3-8B backbone on four test sets under LLM-judge scoring.

NumbersImprovements vs Qwen3-8B: LMSYS +15.28%, ShareGPT +6.26%, WildChat +13.66%, CrossWOZ +13.52%

Practical UseIf you fine-tune a compact model with ProUtt data you can boost intent-level next-utterance accuracy substantially, especially on open-domain data.

Evidence RefTable I; Section IV.A pointwise evaluation

ProUtt beats the strongest data-synthesis baseline by modest margins on all datasets under LLM-judge evaluation.

NumbersRelative gains vs strongest baseline: LMSYS +4.88%, ShareGPT +3.16%, WildChat +2.46%, CrossWOZ +5.12%

Practical UseSwitching to ProUtt-style preference + intent reasoning data gives a measurable advantage over existing synthesis pipelines.

Evidence RefTable I; Section IV.A

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
SFTLMSYS 58.10%; ShareGPT 50.12%; WildChat 50.80%; CrossWOZ 45.98%Qwen3-8B (prompted) LMSYS 45.70%; ShareGPT 46.40%; WildChat 38.50%; CrossWOZ 37.70%Up to +15.28 pts vs backbone (LMSYS)Pointwise test sets (100 samples each)Table I pointwise SFT resultsTable I
SFTLMSYS 60.98%; ShareGPT 52.66%; WildChat 52.16%; CrossWOZ 51.22%Best baseline (varies); see Table IRelative gains vs strongest baseline: 2.465.12% across datasetsPointwise test setsTable I SFT + DPO rowsTable I

What To Try In 7 Days

Run ProUtt code on a small sample of your chat logs to generate intent-tree preference pairs.

Fine-tune a 4–8B model with LoRA using the 2K ProUtt dataset, evaluate with an LLM-judge and a 100-case human pairwise check.

Use DPO alignment after SFT (authors recommend DPO) to amplify preference signals quickly.

Optimization Features

Training Optimization
LoRASFT

Reproducibility

Risks & Boundaries

Limitations

Cross-domain generalization is weaker: gains on held-out datasets like ShareGPT are smaller.

Relies on reliable intent-tree extraction; noisy or very short dialogues reduce benefit.

When Not To Use

You need highly personalized, fine-grained user behavior modeling from single short turns.

You cannot run an LLM (even a medium backbone) to synthesize intent trees locally or via trusted infrastructure.

Failure Modes

Threshold misconfiguration (τ_high/τ_low) blurs positive/negative labels and harms learning.

Overfitting to tree-derived intents if real user behavior is more surface- or style-driven.

Core Entities

Models

Qwen3-8BDoubao-1.5-ProQwen3-MaxDoubao-Seed-1.6DeepSeek-V3.2-ExpGLM-4.6SocraticUSPLLaMA3.1-8B

Metrics

LLM-JudgeEmbed-SimWin-Tie-Loss (pairwise)

Datasets

LMSYSCrossWOZShareGPTWildChatLMSYS-ProUtt-2KLMSYS-ProUtt-10KCrossWOZ-ProUtt-2KCrossWOZ-ProUtt-5K

Context Entities

Models

Qwen3-PlusGLM-4.5-Air