CoThinker: use Cognitive Load Theory to make LLM teams solve high‑load tasks

June 7, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

1

Authors

HaoYang Shang, Xuan Liu, Zi Liang, Jie Zhang, Haibo Hu, Song Guo

Links

Abstract / PDF

Why It Matters For Business

Designing LLM teams with shared memory and structured communication reduces reasoning failures on complex problems, improving solution quality for data analysis and math tasks while requiring careful tuning to avoid extra coordination cost.

Summary TLDR

LLMs struggle when a task forces them to hold and integrate many interacting facts at once. The authors map human Cognitive Load Theory (CLT) to LLMs (attention as working memory), show diagnostic signals (attention entropy and perplexity), and introduce CoThinker: a multi-agent in‑context system that (1) assigns dynamic thinking styles, (2) keeps a shared transactive memory, and (3) moderates peer communication with a small‑world graph. On challenging benchmarks (LiveBench, CommonGen‑Hard) CoThinker improves math/reasoning and concept‑integration tasks versus single-agent and debate baselines, but it can hurt simple instruction-following due to coordination overhead.

Problem Statement

Large LLMs hit a performance ceiling on multi-faceted tasks because in‑context examples and constraints overload the model's selective attention (its working memory analogue). The paper argues this "cognitive overload" explains degeneration, lack of diversity, and failure to meet multiple constraints, and that multi-agent coordination designed with CLT principles can mitigate the problem.

Main Contribution

Formalized a mapping from human Cognitive Load Theory to LLM attention and in‑context limits, and validated it with attention entropy and perplexity probes.

Designed CoThinker, a CLT‑grounded multi‑agent architecture with dynamic thinking styles, a transactive memory system (TMS), and a communication moderator that enforces a small‑world communication graph.

Empirically validated CoThinker across LiveBench and CommonGen‑Hard on multiple LLMs, with ablations showing how N (references), β (exploration), and M (agents) trade off performance and coordination cost.

Key Findings

Attention entropy rises with task complexity, consistent with higher working‑memory demands.

NumbersAttention entropy: Level1=4.44 → Level3=5.04 → Level4=6.10

Structured instructions reduce uncertainty for hard tasks but add cost for easy tasks.

NumbersPerplexity (Hard): 120.5 → 85.35 as instruction complexity increases (levels 1→3); (Easy) stays ~3.37–3.45

CoThinker improves average scores on high‑CL benchmarks versus single‑agent IO baselines.

NumbersLiveBench Avg: Gemini‑8B IO=1.00 → CoThinker=1.07; Gemini‑Pro Avg IO=1.85 → CoThinker=2.09

Transactive Memory and Styles cut model perplexity (easier processing of peer outputs).

NumbersQwen3-8B PPL: Baseline 6.56 → Styles 3.58 → TMS 1.69

Communication hyperparameters have clear optima: N ≈ 2–3; β trades off echo chambers vs integration cost; M has non‑monotonic returns.

NumbersAblation: optimal N=2–3; moderate β (0.3) recommended; increasing M helps then harms

Results

Attention Entropy

Value4.44 → 6.10 across difficulty levels

BaselineLevel1

Perplexity (Hard tasks)

Value120.50 → 85.35 (instruction levels 1→3)

BaselineLevel1 instruction

LiveBench average (normalized)

ValueGemini‑1.5‑Flash‑8B: IO=1.00 → CoThinker=1.07

BaselineSingle-Agent IO

LiveBench average (normalized)

ValueGemini‑1.5‑Pro: IO=1.85 → CoThinker=2.09

BaselineSingle-Agent IO

Perplexity reduction from components

ValueQwen3-8B PPL: 6.56 → Styles 3.58 → TMS 1.69

BaselineBaseline

Who Should Care

What To Try In 7 Days

Run a small CoThinker prototype (M=6, N=2–3, β≈0.3) on one high‑complexity task to compare vs single-agent baselines.

Add a concise transactive memory summary step to your agent pipeline to avoid redundant recomputation.

Use style prompts (1–2 sentences) to diversify agent approaches instead of fixed heavy role personas.

Agent Features

Memory

  • collective working memory (TMS summary)
  • expertise directory ('who knows what')

Planning

  • iterative refinement rounds (T max = 3 by default)
  • dynamic thinking style orchestration

Tool Use

  • LLM APIs (various commercial and open models)
  • semantic embeddings for cognitive distance

Frameworks

  • can augment AutoGen
  • compatible with MetaGPT-style pipelines

Is Agentic

true

Architectures

  • multi-agent in-context learning
  • small-world communication graph
  • transactive memory system (collective WM)

Collaboration

  • communication moderator selecting N references
  • probabilistic rewiring (β) for diversity
  • synthesizer agent for final solution

Optimization Features

Token Efficiency

  • fixed in-degree N to cap per-agent input processing

System Optimization

  • temperature scheduling (diverse initial round, focused refinement rounds)
  • reference selection to limit extraneous load

Reproducibility

Data Urls

  • LiveBench (White et al., 2025)
  • CommonGen-Hard (Madaan et al., 2023)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Attention entropy and perplexity are diagnostic proxies, not universal test‑time signals.
  • CoThinker can add extraneous coordination cost and underperform on low‑intrinsic‑load tasks like simple instruction following.
  • TMS benefits depend on model willingness to produce intermediate steps; some models refuse step-by-step outputs.
  • Computational and API cost increase with agent count and rounds.

When Not To Use

  • Simple execution or instruction‑following tasks with low intrinsic cognitive load.
  • When compute or API budget is tight and latency matters.
  • When base models refuse to expose intermediate reasoning or steps.

Failure Modes

  • Echo chambers if β is too low (agents over‑similar and converge prematurely).
  • Overload from too many agents or too large reference sets (extraneous CL outweighs benefits).
  • TMS ineffective if agents do not share intermediate reasoning or produce terse outputs.

Core Entities

Models

  • Gemini-1.5-Flash-8B
  • Gemini-1.5-Flash
  • Gemini-1.5-Pro
  • GPT5-Nano
  • Qwen3-30B-A3B
  • GPT-OSS-20B
  • Mistral-7B
  • Qwen3-8B

Metrics

  • normalized score
  • attention entropy
  • perplexity (PPL)
  • 10-dim CommonGen rubric
  • task-specific raw scores

Datasets

  • LiveBench
  • CommonGen-Hard
  • AMPS
  • FLASK
  • AMPS-Hard

Benchmarks

  • LiveBench
  • CommonGen-Hard