Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
Designing LLM teams with shared memory and structured communication reduces reasoning failures on complex problems, improving solution quality for data analysis and math tasks while requiring careful tuning to avoid extra coordination cost.
Summary TLDR
LLMs struggle when a task forces them to hold and integrate many interacting facts at once. The authors map human Cognitive Load Theory (CLT) to LLMs (attention as working memory), show diagnostic signals (attention entropy and perplexity), and introduce CoThinker: a multi-agent in‑context system that (1) assigns dynamic thinking styles, (2) keeps a shared transactive memory, and (3) moderates peer communication with a small‑world graph. On challenging benchmarks (LiveBench, CommonGen‑Hard) CoThinker improves math/reasoning and concept‑integration tasks versus single-agent and debate baselines, but it can hurt simple instruction-following due to coordination overhead.
Problem Statement
Large LLMs hit a performance ceiling on multi-faceted tasks because in‑context examples and constraints overload the model's selective attention (its working memory analogue). The paper argues this "cognitive overload" explains degeneration, lack of diversity, and failure to meet multiple constraints, and that multi-agent coordination designed with CLT principles can mitigate the problem.
Main Contribution
Formalized a mapping from human Cognitive Load Theory to LLM attention and in‑context limits, and validated it with attention entropy and perplexity probes.
Designed CoThinker, a CLT‑grounded multi‑agent architecture with dynamic thinking styles, a transactive memory system (TMS), and a communication moderator that enforces a small‑world communication graph.
Empirically validated CoThinker across LiveBench and CommonGen‑Hard on multiple LLMs, with ablations showing how N (references), β (exploration), and M (agents) trade off performance and coordination cost.
Key Findings
Attention entropy rises with task complexity, consistent with higher working‑memory demands.
Structured instructions reduce uncertainty for hard tasks but add cost for easy tasks.
CoThinker improves average scores on high‑CL benchmarks versus single‑agent IO baselines.
Transactive Memory and Styles cut model perplexity (easier processing of peer outputs).
Communication hyperparameters have clear optima: N ≈ 2–3; β trades off echo chambers vs integration cost; M has non‑monotonic returns.
Results
Attention Entropy
Perplexity (Hard tasks)
LiveBench average (normalized)
LiveBench average (normalized)
Perplexity reduction from components
Who Should Care
What To Try In 7 Days
Run a small CoThinker prototype (M=6, N=2–3, β≈0.3) on one high‑complexity task to compare vs single-agent baselines.
Add a concise transactive memory summary step to your agent pipeline to avoid redundant recomputation.
Use style prompts (1–2 sentences) to diversify agent approaches instead of fixed heavy role personas.
Agent Features
Memory
- collective working memory (TMS summary)
- expertise directory ('who knows what')
Planning
- iterative refinement rounds (T max = 3 by default)
- dynamic thinking style orchestration
Tool Use
- LLM APIs (various commercial and open models)
- semantic embeddings for cognitive distance
Frameworks
- can augment AutoGen
- compatible with MetaGPT-style pipelines
Is Agentic
true
Architectures
- multi-agent in-context learning
- small-world communication graph
- transactive memory system (collective WM)
Collaboration
- communication moderator selecting N references
- probabilistic rewiring (β) for diversity
- synthesizer agent for final solution
Optimization Features
Token Efficiency
- fixed in-degree N to cap per-agent input processing
System Optimization
- temperature scheduling (diverse initial round, focused refinement rounds)
- reference selection to limit extraneous load
Reproducibility
Data Urls
- LiveBench (White et al., 2025)
- CommonGen-Hard (Madaan et al., 2023)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Attention entropy and perplexity are diagnostic proxies, not universal test‑time signals.
- CoThinker can add extraneous coordination cost and underperform on low‑intrinsic‑load tasks like simple instruction following.
- TMS benefits depend on model willingness to produce intermediate steps; some models refuse step-by-step outputs.
- Computational and API cost increase with agent count and rounds.
When Not To Use
- Simple execution or instruction‑following tasks with low intrinsic cognitive load.
- When compute or API budget is tight and latency matters.
- When base models refuse to expose intermediate reasoning or steps.
Failure Modes
- Echo chambers if β is too low (agents over‑similar and converge prematurely).
- Overload from too many agents or too large reference sets (extraneous CL outweighs benefits).
- TMS ineffective if agents do not share intermediate reasoning or produce terse outputs.
Core Entities
Models
- Gemini-1.5-Flash-8B
- Gemini-1.5-Flash
- Gemini-1.5-Pro
- GPT5-Nano
- Qwen3-30B-A3B
- GPT-OSS-20B
- Mistral-7B
- Qwen3-8B
Metrics
- normalized score
- attention entropy
- perplexity (PPL)
- 10-dim CommonGen rubric
- task-specific raw scores
Datasets
- LiveBench
- CommonGen-Hard
- AMPS
- FLASK
- AMPS-Hard
Benchmarks
- LiveBench
- CommonGen-Hard

