Overview
Production Readiness
0.5
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
iAgents lets one agent per user coordinate across private data without centralizing it, enabling multi-user scheduling, concierge and workflow automation—but expect higher token costs and privacy trade-offs.
Summary TLDR
This paper defines information asymmetry for multi-agent systems (each agent only sees its user's private data) and proposes iAgents: a system of one agent per user that proactively requests and exchanges only necessary human information. Two core ideas: InfoNav, an explicit plan that tracks which facts (rationales) are unknown and guides multi-turn questions; and Mixed Memory, combining exact-span 'Clear Memory' with embedding-based 'Fuzzy Memory' for retrieval. The authors release InformativeBench (5 datasets) and show GPT-4 achieves ~50% on average while smaller LLMs perform worse. Ablations show InfoNav is critical for small-network reasoning and mixed memory + recursive communication is
Problem Statement
Multi-agent systems assume shared context but real human collaborations are asymmetric: each agent only sees its user's private information. That breaks coordination. The challenge is to enable agents to acquire and exchange needed facts without centralizing private data, while scaling retrieval over many messages and keeping multi-turn communication focused.
Main Contribution
Formulate the problem of information asymmetry in multi-agent collaboration and shift focus from a single shared virtual entity to agents that mirror users.
Propose iAgents: integrates InfoNav (plan-driven communication) and Mixed Memory (Clear + Fuzzy) to retrieve and exchange human information without centralizing all data.
Release InformativeBench, a benchmark with five datasets (Needle/Reasoning pipelines) to evaluate agent collaboration under information asymmetry and provide code/data.
Key Findings
GPT-4-backed iAgents solved many tasks but performance varies strongly by dataset difficulty.
iAgents scaled to a large simulated social network and retrieved many messages during runs.
Design components have measurable impact: InfoNav, Mixed Memory, and recursive communication improved accuracy.
Privacy and pretraining knowledge affect performance.
Results
Schedule Easy (precision)
Schedule Medium (precision)
Schedule Hard (precision)
Needle in the Persona (precision)
FriendsTV (precision)
Accuracy
Ablation: w/o InfoNav on Schedule
Impact of recursive communication on FriendsTV
Who Should Care
What To Try In 7 Days
Prototype InfoNav prompts on a small 4–6 person calendar use case to test multi-turn info exchange.
Build a mixed memory of exact spans + session summaries and compare retrieval quality.
Run InformativeBench (NP or ScheduleEasy) with your preferred LLM to measure baseline accuracy and token cost.
Agent Features
Memory
- Mixed Memory: Clear Memory (exact spans)
- Mixed Memory: Fuzzy Memory (session summaries + embeddings)
Planning
- InfoNav (explicit plan tracking)
- Consensus reasoning (plan-based merge)
Tool Use
- embedding-based retrieval (ANN)
- LLM summarizer for session-level summaries
Frameworks
- iAgents (InfoNav + Mixed Memory)
- InformativeBench
Is Agentic
true
Architectures
- one-agent-per-user mirroring
- role-play prompt-created agents
Collaboration
- recursive inter-agent communication
- multi-turn autonomous dialogs (max 10 turns in experiments)
Optimization Features
Token Efficiency
- paper reports ~30k input tokens per task as cost concern
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Privacy vs. utility trade-off: stronger privacy restrictions noticeably reduce accuracy.
- High token and latency cost: experiments report ~30,000 tokens per task.
- Dependence on closed LLM backends for top performance; smaller models lag far behind.
- No statistical error bars reported; results are point estimates on the provided datasets.
When Not To Use
- When absolute local-only privacy is required (L3 level) and edge models cannot match performance.
- For tiny tasks where centralizing data is simpler and cheaper.
- When token cost or latency must be minimal.
Failure Modes
- Agents hallucinate 'fake solved' rationales and pass incorrect facts into consensus.
- Pretrained model priors override user-provided evidence, leading to prior-distraction errors.
- Excessive retrieval noise from fuzzy memory if summaries lose critical span details.
Core Entities
Models
- gpt-4-0125-preview
- gpt-3.5-turbo-16k
- gemini-1.0-pro-latest
- claude-sonnet 2
Metrics
- Precision
- F1
- IoU
Datasets
- InformativeBench
- Needle in the Persona (NP)
- FriendsTV
- Schedule (Easy/Medium/Hard)
Benchmarks
- InformativeBench
Context Entities
Models
- role-play prompting agents (prior MAS baselines referenced)

