Long-term Memory Papers — Parsed & Scored for Practitioners

Survey of LLM-based medical agents: architectures, applications, and safety gaps

0.40

0.60

8

LLM agents can cut clinician workload and improve documentation and training, but current models need workflow-style validation, bias checks, and human oversight before clinical deployment.

Key finding

Surveyed literature size and scope.

Numbers: 60 studies reviewed (from ~300 initial hits, 80 shortlisted).

An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

0.50

0.70

0.60

7

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Key finding

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

Numbers: TSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

A small GPT‑2 with recurrent memory reads 11 million tokens and finds facts big LLMs miss

0.60

0.70

0.60

7

When you must locate rare facts across very long documents, memory‑augmented models scale better and are cheaper than relying on huge LLM windows or naive RAG, so consider memory models for long‑document search and auditing.

Key finding

Recurrent memory model processes record-length inputs.

Numbers: Processed up to ~11,000,000 tokens (paper claims 11M)

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

0.70

0.60

0.80

6

A-MEM cuts token and inference cost by ~85–93% per memory while improving multi-session reasoning, making long-term conversational agents materially cheaper and more capable to run at scale.

Key finding

A-MEM improves DialSim QA accuracy over baselines.

Numbers: DialSim F1: A-MEM 3.45 vs LoCoMo 2.55 (+35%) vs MemGPT 1.18 (+192%)

Add an editable triplet memory to LLMs via a read/write API and vector lookup

0.30

0.60

0.40

6

An external editable memory lets products keep facts up to date, audit what an LLM used to answer, and combine scattered facts without retraining the model.

Key finding

In qualitative examples, RET-LLM produced correct answers while the base Alpaca-7B produced incorrect answers despite having the same contextual text.

Proposes a centralized Working Memory Hub plus Episodic Buffer to give LLM agents persistent, episode-level memory

0.30

0.60

0.50

5

Adding a persistent memory layer lets agents remember prior interactions and coordinate across agents, improving consistency in multi-step workflows and reducing repeated user prompts.

Key finding

Most current LLM agent designs treat interactions as isolated episodes without linked episodic memory.

AriGraph: combine a semantic knowledge graph and episodic memory so an LLM agent remembers and plans across long, partially observed text‑en

0.60

0.75

0.70

4

Structured, updateable graph memory lets LLM agents remember facts and episodes efficiently, improving long-horizon planning while reducing costly prompt tokens compared to heavy RAG systems.

Key finding

On Treasure Hunt (TextWorld) AriGraph achieved full normalized score while Full History scored 0.47.

Numbers: AriGraph 1.0 vs Full History 0.47 (Table 4)

Zep: temporal knowledge-graph memory for agents — faster retrieval and better long-term accuracy

0.80

0.70

0.80

4

Zep returns smaller, temporally-correct context to LLMs, so agents answer complex multi-session and time-sensitive questions more accurately while cutting latency and token costs.

Key finding

Zep edges back to MemGPT on DMR with gpt-4-turbo

Numbers: 94.8% vs 93.4% (DMR, gpt-4-turbo)

EHRAgent — an LLM that writes, runs, and debugs Python to answer complex EHR table queries with four-shot prompts

0.60

0.50

2

EHRAgent reduces dependence on data engineers by letting clinicians ask EHR questions in plain language and getting accurate answers; this can speed workflows but increases runtime/API calls and needs privacy safeguards.

Key finding

EHRAgent substantially improves EHR multi-table QA success rates versus prior LLM agent baselines.

Numbers: Up to +29.6 percentage points success rate (TREQS) vs strongest baseline

LLM agent that perceives landmarks, stores memories, and plans to navigate cities without step-by-step instructions

0.60

0.70

0.50

2

PReP shows you can build autonomous navigation agents that operate without explicit step-by-step instructions and with far less RL data, enabling faster prototyping for navigation assistants, accessibility tools, and search-and-rescue prototypes.

Key finding

PReP substantially improves success rate over reactive and other LLM prompting baselines.

Numbers: Average SR ≈ 54% across four city test sets

ALAS: modular LLM agents with persistent memory that locally repair plans under runtime disruptions

0.60

0.70

0.60

2

ALAS turns LLMs into practical schedulers for dynamic operations by keeping state, validating plans, and repairing disruptions locally—reducing rework, travel, and missed deadlines in logistics and operations.

Key finding

Alas produces shorter ride-sharing routes than standalone LLM baselines on the URS task.

Numbers: Average distance 95.1 km vs 118.9 km (20% reduction, p<0.01)

LOCOMO: a benchmark of very long, multimodal conversations to test LLM memory

0.30

0.70

0.35

2

Memory across many sessions matters for user retention and personalization; current LLMs make many factual and temporal errors, so products should combine retrieval of compact facts with human oversight for critical flows.

Key finding

Humans far outperform models on long-term QA.

Numbers: Human overall F1 87.9 vs best model ~37.8 (gpt-3.5-16k)

Share a recurrent transformer memory so agents coordinate implicitly and solve long narrow‑corridor pathfinding

0.60

0.50

1

SRMT offers a lightweight way to improve decentralized multi-robot coordination without centralized control; it can cut coordination failures and extend policies trained on small maps to larger deployments.

Key finding

SRMT keeps near-perfect cooperative success on long corridors after training on short ones.

Numbers: trained on corridors 3–30 cells; CSR ≈1.0 up to 400 cells, drops to 0.8 beyond 400

A small, formal language that turns vague memory commands into safe, verifiable operations for LLM agents

0.60

0.50

1

Text2Mem makes memory commands predictable and auditable. That reduces bugs from inconsistent agent behavior, improves portability across memory backends, and makes long-running agent behavior testable and repeatable.

Key finding

Text2Mem defines a fixed inventory of twelve memory operations covering encode, storage, and retrieval.

Numbers: 12 operations (Table I; encoding/storage/retrieval split)

StreamChat: real-time streaming video QA with hierarchical memory and sub-second latency

0.60

0.65

0.50

1

StreamChat makes interactive video assistants and robotics feasible by cutting latency below 1s and improving streaming QA accuracy, reducing user wait and raising answer quality in live settings.

Key finding

STREAMCHAT (Slow) achieves 64.7% accuracy on STREAMBENCH in online setting.

Numbers: 64.7% acc on STREAMBENCH (Slow)

LIFESTATE-BENCH: fact-based episodic tests that measure whether LLMs form and keep story-like memory

0.40

0.60

0.35

1

If your product uses chat agents over long sessions, external context or retrieval beats one-off parameter edits for keeping factual state; otherwise agents will lose facts and relationship context as conversations grow.

Key finding

Non-parametric context methods beat parametric tuning for episodic memory tasks

Numbers: DeepSeek-R1 Hamlet direct concat 67.3% vs Llama3.1 LoRA ~25% (on same dataset)

Teach agents reusable web workflows from past traces to boost web-navigation success

0.60

0.65

0.45

1

Inducing and reusing compact workflows turns past agent traces into practical, reusable skills that increase success rates and reduce execution steps on web automation tasks, saving time and API costs.

Key finding

AWM raises overall success rate on WebArena versus a strong autonomous baseline.

Numbers: AWM 35.5 SR vs BrowserGym 23.5 SR; +12.0 abs (+51.1% rel)

Which memory formats and retrievers best help LLM agents reason over long text

0.70

0.50

0.60

1

Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.

Key finding

Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.

Numbers: F1=82.11% on HotPotQA, F1=68.15% on 2Wiki (iterative + mixed)

AI PERSONA: lightweight, retrain-free framework for life‑long LLM personalization

0.70

0.60

0.80

1

Provides scalable personalization that avoids retraining large models: store tiny per‑user configs, update via prompts, and improve satisfaction and reduce conversation length.

Key finding

Updating persona every 3 sessions (k=3) yields near‑golden personalization.

Numbers: Helpfulness 8.29 vs Golden 8.34; Personalization 7.63 vs 7.78 (Table 1)

Open-source projects store agent instructions in special README-like files, but those files focus on how to run code and rarely specify non‑

0.60

0.45

0.60

0

Agent context files control what AI developers do in your codebase. If they lack security or performance rules, agents will likely produce code that works but is vulnerable or inefficient. Treat these files like configuration and governance documents so agents follow team standards.

Key finding

Collected 2,303 agent context files across 1,925 repositories.

Numbers: 2,303 files; 1,925 repos

Make software teams of humans + autonomous, norm-aware AI agents that plan, remember, and self-regulate

0.20

0.60

0.40

0

This design lets teams scale software work with many specialized AI agents while enforcing rules (privacy, security, legal). That reduces manual coordination and speeds routine work, but requires rule design and human oversight.

Key finding

BDIM-SE extends BDI agents by adding persistent memory and direct LLM queries to support longitudinal reasoning.

Keep agent context small forever by storing task state as files — proving more stable long-run behavior for research workflows

0.60

0.70

0.60

0

If your workflows involve long document processing or multi-step knowledge work, state management matters more than raw model size. A file-centric agent design can make smaller, cheaper models far more reliable over long runs and reduce costly re-runs.

Key finding

InfiAgent (gpt-oss-20b) scores 41.45 on DeepResearch using no task-specific fine-tuning.

Numbers: DeepResearch overall = 41.45 (Table 2)

ProMem: iterative self-questioning to recover missing facts and cut downstream errors

0.60

0.50

0

Improving what an agent saves (more complete, grounded memories) raises answer quality and reduces long-term error costs; pay once for extraction, benefit many reads.

Key finding

ProMem raises memory integrity on HaluMem to 73.80%, outperforming common summary baselines.

Numbers: Memory Integrity: ProMem 73.80% vs Mem0/Supermemory ~42%

Amory: build narrative episodic memory that matches full-context quality while halving latency

0.60

0.65

0.50

0

Amory raises long-conversation answer quality substantially while avoiding full-history cost; that improves product usefulness for persistent assistants with acceptable latency.

Key finding

Combining episodic and semantic memory yields large quality gains over prior working-memory baselines.

Numbers: EM+SM overall J-score 87.7% vs Mem0 59.9% (+27.8% abs)