Use a learned manager to steer LLM agents by changing who sees what — raising network cooperation without rewiring links

September 16, 20248 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

2

Authors

Qiliang Chen, Sepehr Ilami, Nunzio Lore, Babak Heydari

Links

Abstract / PDF

Why It Matters For Business

Adaptive control of who sees what is a low-cost governance lever: you can raise coordination among autonomous agents without changing incentives or network wiring, cutting engineering and policy friction.

Summary TLDR

The paper builds a two-layer system where many LLM-based agents play repeated Prisoner's Dilemma games on a fixed network while a reinforcement-learning (RL) manager decides what information each agent sees. By switching between last-action cues and neighborhood cooperation summaries, the RL manager raises cooperation far above static baselines. Key technical pieces: LLaMa3-70B agents (prompted, not fine-tuned), micro-level behavioral validation, and an actor-critic RL manager that maximizes summed rewards. Results (simulated, 20-node networks, 50 random graphs) show the RL manager drives rapid, system-wide cooperation and learns to target well-connected and already-cooperative nodes with “r

Problem Statement

How can designers steer collective behavior in systems of autonomous agents without changing who interacts with whom? The authors ask whether adaptive control of information visibility — which agents see recent actions or neighborhood cooperation rates — can act as a low-cost governance lever to increase cooperation across a fixed interaction network.

Main Contribution

Framework: A two-layer design separating the interaction network (fixed links) from an information network that a learned manager dynamically modulates.

Behavioral modeling: Micro-validation showing LLaMa3-70B agents respond predictably to different prompt information and follow WSLS-like strategies.

Governance synthesis: An actor-critic RL manager that adaptively picks information tiers (LA, LA+AR, LA+NR) to raise social welfare and cooperation compared to static baselines.

Empirical insights: The manager learns phased policies, heterogeneous (asymmetric) information targeting, and targets high-degree and already-cooperative nodes for richer signals.

Key Findings

A learned RL manager drives full network cooperation in the simulated PD runs.

NumbersReach 100% mutual cooperation (CC) by timestep 10 on average (RL method)

The LLaMa3-70B agents show a win-stay/lose-shift (WSLS) style policy and are sensitive to historical context.

NumbersAfter mutual defection, cooperation occurs 49% of the time (micro-validation)

The RL manager favors neighborhood-level information after an early exploratory phase.

NumbersLA+NR dominates ~75% of interventions early and becomes the uniform choice by step 7

Information is targeted: better-connected and more cooperative nodes receive richer network-level signals.

NumbersMean degree 5.68 vs 4.64 (p<0.001, Cohen's d=0.60); pre-intervention cooperation 0.908 vs 0.801 (p<0.001, d=2.61)

Numeric cooperation categories improved signal handling compared to raw numbers because LLaMa3-70B handled qualitative labels more reliably.

NumbersAgents use Rarely/Sometimes/Often buckets (<33%, 33–66%, >66%) for historical rates (micro-validation)

Results

Final cooperation rate (RL manager)

Value100% mutual cooperation by timestep 10 (on average, simulated runs)

BaselineLA, LA+NR, LA+AR baselines (do not reach 100%)

Micro-level cooperation after mutual cooperation (LA+NR content)

Value87% cooperation when coplayer labeled 'Sometimes', 99% when labeled 'Often'

BaselineLA only behavior

Node targeting — mean degree by intervention

ValueMean degree 5.68 for LA+NR vs 4.64 for LA

Pre-intervention cooperation by intervention type

ValueMean 0.908 (LA+NR group) vs 0.801 (LA group)

Who Should Care

What To Try In 7 Days

Run a small simulation (10–50 agents) using your task payoff and an LLM proxy to micro-validate agent prompts.

Implement 2–3 information tiers (last action, agent-history, neighborhood-summary) and measure cooperation rate over 20 steps.

Train a simple actor-critic manager to choose information tiers and compare to fixed baselines.

Agent Features

Memory

  • last-action (short-term)
  • agent cooperation ratio (longer-term summary)
  • neighborhood cooperation ratio (aggregated memory)

Planning

  • repeated interactions
  • POMDP-based manager

Tool Use

  • LangChain
  • Groq

Frameworks

  • Actor-Critic RL
  • POMDP

Is Agentic

true

Architectures

  • LLaMa3-70B

Collaboration

  • multi-agent

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluation limited to repeated Prisoner's Dilemma; generalization to richer tasks is untested.
  • Experiments use LLaMa3-70B agents (prompted but not fine-tuned); human behavior transfer is assumed but not validated.
  • Compute limits restricted number of rounds; reported results rely on 50 random graphs and 20 timesteps.

When Not To Use

  • When you must guarantee specific individual-level actions rather than influence aggregate outcomes.
  • In domains where revealing different levels of information violates privacy or legal constraints.
  • If agents do not respond reliably to prompts (noisy or non-language-based agents).

Failure Modes

  • LLM numeric sensitivity: the model misinterprets raw numeric rates, requiring qualitative buckets.
  • Manager overfits to simulation dynamics and selects interventions that fail with real humans or different game payoffs.
  • Dependence on prompt design: poorly-crafted prompts can produce erratic agent behavior and mislead the manager.

Core Entities

Models

  • LLaMa3-70B

Metrics

  • cooperation rate
  • social welfare (sum of agent payoffs)