Tune agents on short, focused conversation segments to improve multi-turn social behavior

January 3, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang

Links

Abstract / PDF

Why It Matters For Business

SDPO makes social agents more effective at multi-turn tasks by focusing training on short key segments, improving goal success and interpersonal outcomes with modest data costs and no RL loop.

Summary TLDR

SDPO (Segment-Level Direct Preference Optimization) is a training procedure that builds and trains on short, key segments of multi-turn social dialogues instead of single turns or whole sessions. By pairing equal-length positive and negative segments and applying a derived SDPO loss, the authors reduce training noise and obtain a principled multi-turn preference objective. On the interactive SOTOPIA benchmark SDPO improves goal completion and relationship scores over single-turn DPO, session-level methods (ETO/DMPO), and several proprietary LLMs. The method uses GPT-4o to locate errors and pick segments, and the released SDPO dataset contains 1,019 segment pairs.

Problem Statement

Standard Direct Preference Optimization (DPO) optimizes single turns and cannot reliably shape multi-turn social behavior. Session-level DPOs use whole dialogues but are coarse: they treat many correct turns as bad (adding noise) and cannot control length differences between positive and negative samples, breaking theoretical guarantees. This paper asks: can we pick short, aligned segments to fix both noise and theory gaps and thereby better align agents for multi-turn social tasks?

Main Contribution

SDPO: a pipeline to construct segment-level positive/negative preference pairs from multi-turn dialogues.

A theoretical derivation showing equal-length segment selection removes the partition function Z and yields a concise SDPO loss.

Empirical validation on SOTOPIA showing consistent gains over DPO, ETO, DMPO, and some proprietary LLMs; plus a public dataset of 1,019 segment pairs.

Key Findings

SDPO improves goal and relationship scores vs base behavioral cloning on Llama-8B.

NumbersSelf-chat Goal +0.75, Relationship +0.64 (Table 1)

SDPO outperforms DPO and session-level methods on the tested benchmark.

NumbersAverage score: SDPO 5.63 vs DPO 5.34 and ETO 5.45 (Table 1)

SDPO generalizes across base models (Llama and Mistral).

NumbersMistral: Goal +0.59, Relationship +0.51 vs BC (Table 2)

Most automatic segments selected are short (length 3) and automatic selection beats fixed-length choices.

Numbers89% segments length=3; auto-selection gives best results (C.1, Table 3)

Unequal-length segments can destabilize training.

NumbersAsymmetric lengths [3,1] and [5,3] cause training collapse (Table 3, Section 4.6)

The SDPO dataset size used in experiments is 1,019 segment pairs.

Numbers1019 pairs (C.1)

Results

Self-chat Goal (Llama-8B+BC -> +SDPO)

ValueBC 7.81 -> SDPO 8.56

BaselineBehavioral Cloning (BC)

Self-chat Relationship (Llama-8B+BC -> +SDPO)

ValueBC 3.05 -> SDPO 3.69

BaselineBehavioral Cloning (BC)

Average score (Llama-8B+BC+DPO vs +SDPO)

ValueDPO 5.34 -> SDPO 5.63

BaselineDPO

Mistral base: Self-chat Goal/Relationship (BC -> SDPO)

ValueGoal 7.89 -> 8.48; Rel 2.98 -> 3.49

BaselineMistral BC

Who Should Care

What To Try In 7 Days

Collect failure sessions and use a powerful judge (e.g., GPT-4o) to mark the first erroneous turn.

Sample a few completions from the preceding history and pick the best positive session.

Extract equal-length segments around the differing turn and form positive/negative pairs (aim ~3 turns). Fine-tune with an SDPO loss on an open model for a few epochs and evaluate

Agent Features

Memory

  • short-term interaction history (segments)

Planning

  • multi-turn dialogue planning

Tool Use

  • self-chat sampling
  • GPT-4o for annotation and segment selection

Frameworks

  • DPO
  • ETO
  • DMPO
  • SDPO

Is Agentic

true

Architectures

  • LLM-based conversational agent

Collaboration

  • interacts with other agents (self-chat and external interlocutors)

Optimization Features

Token Efficiency

  • SDPO uses words more efficiently in interactions (improved scores at similar token budgets, Section

Training Optimization

  • segment-level preference loss (SDPO)
  • ensure equal-length segments to eliminate partition function Z

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • SDPO requires equal-length positive and negative segments; asymmetric segments can collapse training.
  • Experiments are only on SOTOPIA; generalization to other interactive tasks is untested.
  • Relies on GPT-4o for error localization/segment selection; judge errors or bias can affect data quality.
  • Negative segments can still contain irrelevant or error-free turns, leaving residual noise.

When Not To Use

  • Single-turn or static QA tasks where multi-turn alignment is unnecessary.
  • Scenarios where you cannot make equal-length positive/negative segment pairs.
  • When no reliable judge is available to locate errors and pick segments.

Failure Modes

  • Training collapse with asymmetric segment lengths (observed for [3,1], [5,3]).
  • Degraded performance when using out-of-distribution positive samples (GPT-4-turbo positives underperformed self-sampling).
  • Residual noise if negative segments include non-erroneous turns.

Core Entities

Models

  • Llama-3.1-8B-Chat
  • Mistral-Instruct-v0.3
  • Llama-8B (base)
  • GPT-4o
  • GPT-4o-mini
  • GPT-4-turbo
  • GPT-3.5-turbo

Metrics

  • Goal (0-10 int)
  • Relationship (-5 to 5 int)
  • AVG (aggregate score shown in tables)

Datasets

  • SOTOPIAπ (training)
  • SOTOPIA (testing)

Benchmarks

  • SOTOPIA
  • SOTOPIA-Hard