Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.3
Citation Count
0
Why It Matters For Business
How agents phrase decisions affects cooperation and task success; monitoring and nudging tone and explanations reduces coordination failures and builds trust in agentic workflows.
Summary TLDR
The paper introduces a practical framework to measure "Interactional fairness" in multi-agent systems driven by large language models. Interactional fairness splits into Interpersonal fairness (respectful tone) and Informational fairness (explanation quality). The authors adapt human-survey tools (Colquitt's scales, Critical Incident Technique, journaling) into prompt-based tests and a JSON evaluation card. In a controlled negotiation study (24 conditions × 5 runs), respectful tone and clear justification raised acceptance rates and fairness ratings; context changed which signal mattered most (tone in collaborative settings, explanations in competitive ones). The framework is a low-cost, aud
Problem Statement
Existing fairness work for multi-agent systems focuses on outcomes and procedures. As agents talk more, how they speak and explain decisions becomes a separate, measurable fairness axis that can change cooperation and outcomes. We need a practical way to audit and debug communicative fairness in LLM-driven multi-agent systems.
Main Contribution
A conceptual adaptation of Interactional fairness (Interpersonal + Informational) for non-sentient LLM agents, treating fairness as observable communicative behavior.
A mixed-method evaluation pipeline: prompt-based Likert ratings, Critical Incident Technique sketches, Explanation Journaling, and a JSON Interactional Fairness Evaluation Card.
A controlled case study (resource negotiation) showing tone and justification affect acceptance decisions and that the relative importance of those cues shifts with task context.
Key Findings
Respectful tone and clear justification increase proposal acceptance even when resource splits are identical.
Distributional fairness (the proposed split) remains the strongest predictor, but communicative cues can partially offset inequality.
Which interactional signal matters depends on task framing: tone matters more in collaboration, explanations matter more in competition.
Results
Acceptance rate for equal (5:5) proposals under High-High
Decision Tree feature importance (collaborative)
Logistic regression coefficient for split (Ridge, collaborative)
Who Should Care
What To Try In 7 Days
Run a small negotiation test where agents use the Interactional Fairness Evaluation Card to log tone, explanation scores, and accept/reject decisions.
Add a prompt template that enforces a respectful opening line and a 1–2 sentence justification for proposals and measure acceptance change.
Track acceptance rate by context (collaborative vs competitive) to decide whether to emphasize tone or explanation in policies.
Agent Features
Memory
- one-shot / no long-term memory (study)
- supports journaling for longitudinal logging
Tool Use
- prompt templates
- JSON evaluation card
Frameworks
- Colquitt fairness scales (adapted)
- Critical Incident Technique
- Explanation Journaling
Is Agentic
true
Architectures
- LLM-based agent (prompted LLM)
Collaboration
- Agent Communication
- Multi-agent Coordination
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Simple one-shot negotiation setup limits ecological validity for real multi-step systems.
- Agents self-evaluate with prompts; this can introduce judge bias and circularity.
- Small number of runs per condition (five) limits statistical power.
- No human-in-the-loop validation in the reported study.
When Not To Use
- As the only fairness check for complex, long-running multi-agent deployments.
- To infer agent sentience or moral understanding; the framework measures observable behavior only.
- As a substitute for outcome-based fairness audits when resource distribution is the primary risk.
Failure Modes
- Agents may be tuned to game the evaluation prompts without genuine improvement in cooperative behavior.
- Context mismatch: a one-size communication policy harms performance when task framing changes.
- Judge bias: prompted agents used as evaluators can reflect the same stylistic biases as proposers.
Core Entities
Models
- GPT-4
Metrics
- Likert interpersonal rating (1-5)
- Likert informational rating (1-5)
- accept/reject rate
- Interactional fairness composite score
Context Entities
Metrics
- acceptance rate by condition
- feature importance from Decision Tree
- logistic regression coefficients
Datasets
- The Fair Divide (resource negotiation simulation)

