ProUtt: LLM-driven synthesis of preference-labelled intent reasoning to predict users' next utterance

Overview

Decision SnapshotNeeds Validation

ProUtt is a practical synthesis recipe: build intent trees, generate preferred/non-preferred reasoning, and use DPO-aligned fine-tuning to get compact models that predict next utterances with better intent consistency.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 65%

Novelty: 60%

Authors

Jinqiang Wang, Huansheng Ning, Jianguo Ding, Tao Zhu, Liming Chen, Chris Nugent

Links

Abstract / PDF / Code / Data

Why It Matters For Business

ProUtt lets teams train compact, privacy-preserving models that predict likely user replies by learning intent-level reasoning, improving selection-based UX while avoiding cloud API costs or massive LLM deployments.

Who Should Care

ML Engineer Product Manager CTO Data Scientist

Summary TLDR

ProUtt is a data-synthesis method that uses an LLM to convert multi-turn dialogues into hierarchical user intent trees, then generates paired “preferred” and “non-preferred” intent-reasoning traces and next-utterance candidates. Fine-tuning small models (e.g., Qwen3-8B) on ProUtt data improves proactive next-utterance prediction vs larger prompt-based or simulator baselines. Evaluations use LLM-as-a-judge, embedding similarity, and human pairwise judgments. Code and larger synthesized datasets are released.

Problem Statement

Predicting the user's next utterance proactively can improve UX but cloud APIs raise privacy concerns and large LLMs are costly. Existing user simulators mimic surface style and lack explicit intent reasoning. The problem: produce compact, privacy-friendly training data that teaches small models to predict the next user utterance at the intent level, not just copy surface wording.

Main Contribution

ProUtt: a pipeline that converts dialogue history into a hierarchical user intent tree and synthesizes paired preferred/non-preferred reasoning traces for next-utterance prediction.

Empirical demonstration that training small LLMs on ProUtt-synthesized preference data (SFT + alignment) improves next-utterance prediction on four datasets versus simulators, other synthesis methods, and larger prompt-based LLMs.

Key Findings

ProUtt improves small-model performance over the unfine-tuned Qwen3-8B backbone on four test sets under LLM-judge scoring.

NumbersImprovements vs Qwen3-8B: LMSYS +15.28%, ShareGPT +6.26%, WildChat +13.66%, CrossWOZ +13.52%

Practical UseIf you fine-tune a compact model with ProUtt data you can boost intent-level next-utterance accuracy substantially, especially on open-domain data.

Evidence RefTable I; Section IV.A pointwise evaluation

ProUtt beats the strongest data-synthesis baseline by modest margins on all datasets under LLM-judge evaluation.

NumbersRelative gains vs strongest baseline: LMSYS +4.88%, ShareGPT +3.16%, WildChat +2.46%, CrossWOZ +5.12%

Practical UseSwitching to ProUtt-style preference + intent reasoning data gives a measurable advantage over existing synthesis pipelines.

Evidence RefTable I; Section IV.A

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SFT	LMSYS 58.10%; ShareGPT 50.12%; WildChat 50.80%; CrossWOZ 45.98%	Qwen3-8B (prompted) LMSYS 45.70%; ShareGPT 46.40%; WildChat 38.50%; CrossWOZ 37.70%	Up to +15.28 pts vs backbone (LMSYS)	Pointwise test sets (100 samples each)	Table I pointwise SFT results	Table I
SFT	LMSYS 60.98%; ShareGPT 52.66%; WildChat 52.16%; CrossWOZ 51.22%	Best baseline (varies); see Table I	Relative gains vs strongest baseline: 2.46–5.12% across datasets	Pointwise test sets	Table I SFT + DPO rows	Table I

What To Try In 7 Days

Run ProUtt code on a small sample of your chat logs to generate intent-tree preference pairs.

Fine-tune a 4–8B model with LoRA using the 2K ProUtt dataset, evaluate with an LLM-judge and a 100-case human pairwise check.

Use DPO alignment after SFT (authors recommend DPO) to amplify preference signals quickly.

Optimization Features

Training Optimization

LoRASFT

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/jqwangai/ProUtt

Data URLs

https://github.com/jqwangai/ProUtt (authors state LMSYS-ProUtt-10K and CrossWOZ-ProUtt-5K will be released)

Risks & Boundaries

Limitations

Cross-domain generalization is weaker: gains on held-out datasets like ShareGPT are smaller.

Relies on reliable intent-tree extraction; noisy or very short dialogues reduce benefit.

When Not To Use

You need highly personalized, fine-grained user behavior modeling from single short turns.

You cannot run an LLM (even a medium backbone) to synthesize intent trees locally or via trusted infrastructure.

Failure Modes

Threshold misconfiguration (τ_high/τ_low) blurs positive/negative labels and harms learning.

Overfitting to tree-derived intents if real user behavior is more surface- or style-driven.

Core Entities

Models

Qwen3-8BDoubao-1.5-ProQwen3-MaxDoubao-Seed-1.6DeepSeek-V3.2-ExpGLM-4.6SocraticUSPLLaMA3.1-8B

Metrics

LLM-JudgeEmbed-SimWin-Tie-Loss (pairwise)

Datasets

LMSYSCrossWOZShareGPTWildChatLMSYS-ProUtt-2KLMSYS-ProUtt-10KCrossWOZ-ProUtt-2KCrossWOZ-ProUtt-5K

Context Entities

Models

Qwen3-PlusGLM-4.5-Air

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ProUtt improves small-model performance over the unfine-tuned Qwen3-8B backbone on four test sets under LLM-judge scoring.

ProUtt beats the strongest data-synthesis baseline by modest margins on all datasets under LLM-judge evaluation.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

Use multiple LLMs together to auto-generate preference datasets and improve model responses

Key finding

Train detectors by teaching models with high-quality fake answers

Key finding

TarGEN: generate balanced, diverse labeled NLP datasets from task descriptions (no seed examples)

Key finding

Finetune LLMs on synthetic key-value tasks to improve long-context retrieval and reasoning without adding factual hallucinations

Key finding