Short-term Memory Papers — Parsed & Scored for Practitioners

AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

0.60

0.70

13

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Key finding

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

Numbers: Success rate 91.79% (ALFWorld Table 2).

LLMs are powerful text engines but lack the grounded action and world models needed for true AGI

0.40

0.45

0.40

10

LLMs are strong language tools but not reliable autonomous reasoners; businesses should treat them as assistants, validate critical outputs, and invest in grounded data and robust evaluation before automating decisions.

Key finding

LLMs score highly on many language-style standardized exams but fall behind on reasoning-heavy subjects.

Numbers: GPT-4: SAT Verbal ~169/170 (~99th), SAT Math ~700/800 (~89th); poor on Gaokao/JEE (see AGIEval/JEEBench)

RoleLLM: a dataset and recipe to teach LLMs character-level role-playing

0.60

0.50

8

RoleBench and RoCIT let teams fine-tune open LLMs to mimic character voices and embed role facts, reducing dependence on costly closed-source APIs and long prompts.

Key finding

Context-Instruct substantially boosts role-specific knowledge (SPE metric).

Numbers: SPE: 21.4 -> 38.1

An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

0.50

0.70

0.60

7

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Key finding

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

Numbers: TSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

0.60

0.50

0.60

6

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Key finding

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

Numbers: Overall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Call web search, code execution and a 'Mind‑Map' memory agent to make LLMs do long, research-style reasoning

0.60

0.70

0.60

5

Adding a small set of high-quality agents (search, coding, structured memory) can raise correctness on complex, knowledge‑intensive tasks by ~10 percentage points, enabling faster research and automation at the cost of higher compute and external data reliance.

Key finding

Agentic Reasoning raised Humanity's Last Exam accuracy to 23.8%, improving the base model by 14.4 percentage points.

Numbers: 23.8% (Agentic w/ DeepSeek-R1) vs 9.4% (DeepSeek-R1); +14.4

Proposes a centralized Working Memory Hub plus Episodic Buffer to give LLM agents persistent, episode-level memory

0.30

0.60

0.50

5

Adding a persistent memory layer lets agents remember prior interactions and coordinate across agents, improving consistency in multi-step workflows and reducing repeated user prompts.

Key finding

Most current LLM agent designs treat interactions as isolated episodes without linked episodic memory.

SOCIALGYM 2.0: configurable MARL simulator and benchmark for multi-robot social navigation

0.45

0.60

0.50

3

SOCIALGYM 2.0 shortens iteration time for multi-robot navigation by providing modular, configurable simulations, reducing risky trial-and-error on hardware and enabling targeted benchmarking for specific crowded scenarios.

Key finding

There is no single MARL algorithm that consistently wins across social mini-games.

Numbers: Collision rates and lengths vary widely; example: collision rate range 0–2.8 across Table II

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

0.60

0.70

0.50

2

ToolSandbox tests realistic, multi-turn tool use and highlights where agents hallucinate, fail to sequence dependent actions, or mis-handle time—insights you need before putting LLM agents in customer-facing automation.

Key finding

Proprietary models outperform open-source models by a large margin on ToolSandbox tasks.

Numbers: Top scores: GPT-4o 73.0 vs Hermes 31.4 (Table 5)

RAFA: plan several steps with an LLM, execute only the first, replan — provable √T regret and strong sample efficiency

0.60

0.80

0.60

2

RAFA reduces costly environment trials by using LLMs as in-context model estimators and planning ahead, so you can ship agents that learn faster without fine-tuning models.

Key finding

RAFA achieves state-of-the-art success on ALFWorld.

Numbers: 99.25% total success rate (ALFWorld tasks)

Use an LLM-based agent with memory, world knowledge, and a graph tool to improve zero-shot next-location prediction

0.50

0.70

0.30

1

AgentMove improves zero-shot location ranking across cities without retraining a local model, so companies can prototype personalized recommendations or prefetching where labeled local mobility data is scarce.

Key finding

AgentMove wins most metrics versus baselines in zero-shot tests

Numbers: Best results in 8 of 12 metrics; improvements range 3.33%–8.57%

Train agents to internalize human hints so they stop relying on ever-growing prompts

0.70

0.60

0.80

1

You can convert repeated human guidance into model updates that reduce prompt length, cut inference cost, and raise multi-tool task reliability with modest annotation work.

Key finding

After three rounds MNM achieves 97.9% success on ToolQA.

Numbers: 97.9% success (Table 2, Round 3)

Multi‑agent ReAct Game Master outperforms prompt‑only GM in solo RPGs

0.40

0.60

0.50

1

An agentic ReAct design with a memory agent measurably raises player immersion, coherence, and replay intent; studios can add AI DMs that scale solo-play experiences and increase engagement.

Key finding

Players rated the agentic v2 higher on multiple engagement measures.

Numbers: N=12; 9/14 constructs significant; example: Mastery 0.68→2.33 (p=0.004)

Split planning, decision and reflection agents to boost mobile UI automation and task completion

0.60

0.50

1

Splitting UI automation into planning, decision, and reflection raises task success for real-device workflows, reducing manual scripting and improving coverage for multi-app automation and testing.

Key finding

Multi-agent design raises task completion vs single-agent Mobile-Agent.

Numbers: average SR +27% (across English/Chinese evals)

StreamChat: real-time streaming video QA with hierarchical memory and sub-second latency

0.60

0.65

0.50

1

StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.

Key finding

STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.

Numbers: 64.7% acc on STREAMBENCH (Slow)

Which memory formats and retrievers best help LLM agents reason over long text

0.70

0.50

0.60

1

Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.

Key finding

Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.

Numbers: F1=82.11% on HotPotQA, F1=68.15% on 2Wiki (iterative + mixed)

Giving LLM agents a memory 'norm' can sometimes make their emotions more human-like — but results are mixed

0.30

0.60

0.40

1

Adding an explicit memory-summary step can make agent responses more context-aware and slightly more willing to register negative emotions, which matters for chatbots, virtual characters, and user-state tracking but needs human validation before deployment.

Key finding

Adding the norm increased average negative affect across EmotionBench compared to no-norm agents.

Numbers: +1.6 (overall negative affect increase with norm)

SmartPlay: a multi-game benchmark to test LLMs as interactive agents

0.60

0.50

1

SmartPlay gives a quick, standardized way to test LLMs on interactive tasks that matter to automation: planning, handling randomness, and navigation—use it to find failure modes before deploying agents.

Key finding

GPT-4 variants outperform other LLMs on SmartPlay games.

Numbers: >20% gap vs other proprietary models on most games

BARL: Bayes-adaptive RL that makes LLMs reflectively switch strategies by maintaining and updating hypotheses

0.50

0.70

0.60

0

BARL improves test-time generalization and reduces inference token use, which can lower cloud compute costs for deployed reasoning models while modestly improving accuracy.

Key finding

Bayesian RL can produce policies that outperform any Markovian policy in uncertain tasks.

Numbers: Didactic tree: adaptive return 1.0 vs Markovian 0.25

Train LLMs with compressed KV caches: keep most performance while cutting rollout memory by ~35–53%

0.70

0.60

0.80

0

You can cut rollout memory and enable larger RL batch sizes with minimal accuracy loss, lowering GPU cost and enabling RL experiments on smaller clusters.

Key finding

Sparse-RL keeps most dense performance while saving KV memory.

Numbers: Qwen2.5-7B retains 96.8% of dense avg score (51.4 vs 53.1)

ACE: an agentic Retrieve‑or‑Think loop that keeps context concise and boosts multi-hop QA accuracy

0.60

0.70

0.60

0

ACE gives higher accuracy on complex question answering while avoiding many costly retrieval calls; this can reduce cloud costs and improve product accuracy for knowledge-intensive features.

Key finding

Large accuracy gains on HotpotQA compared to single-step RAG.

Numbers: HotpotQA Acc ACE 62.8% vs RAG 38.9% (+23.9 pp)

Continuum Memory: make agent memory persistent, mutable, and associative

0.60

0.70

0.60

0

CMA makes assistants keep facts up to date, recall what happened around events, and answer multi-hop queries—improving trust and utility for long-running workflows, at the cost of higher latency and added governance needs.

Key finding

Selective retention: CMA surfaces corrected facts instead of stale ones.

Numbers: CMA won 38/40 queries; Cohen's d = 1.84

Design agents around files + code to make them more composable, auditable, and maintainable

0.60

0.50

0.60

0

Treating resources as files and actions as code reduces integration work, makes agent behavior auditable, and lets teams reuse existing DevOps practices to manage agent artifacts.

Key finding

Practitioners are converging on filesystem and code abstractions for agent context and actions.

Deep Research Agents often break earlier content and citations when asked to revise reports

0.30

0.70

0.25

0

If you deploy agents to draft or revise long reports, expect them to follow edits but also to unintentionally remove or weaken unrelated content and citations, so add verification and human review steps.

Key finding

Agents follow requested edits but then break unrelated content.

Numbers: Incorporation rates mostly >90%; break rates average 31% (content) and 21% (format).