Retrieval Memory Papers — Parsed & Scored for Practitioners

AgentPoison: a stealthy backdoor that poisons agent memories or RAG to hijack LLM agents

0.70

9

If agents fetch data from third-party or writable corpora, an attacker can inject a few poisoned records to trigger dangerous actions while leaving overall accuracy unchanged, creating a low-noise safety and legal risk.

Key finding

AGENTPOISON forces retrieval of poisoned demonstrations with high probability.

Numbers: Average ASR-r ≈ 81.2% (retrieval success)

A-MEM: LLM agents that build and evolve a Zettelkasten-style linked memory

0.70

0.60

0.80

6

A-MEM cuts token and inference cost by ~85–93% per memory while improving multi-session reasoning, making long-term conversational agents materially cheaper and more capable to run at scale.

Key finding

A-MEM improves DialSim QA accuracy over baselines.

Numbers: DialSim F1: A-MEM 3.45 vs LoCoMo 2.55 (+35%) vs MemGPT 1.18 (+192%)

RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

0.60

0.50

0.60

6

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Key finding

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

Numbers: Overall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Add an editable triplet memory to LLMs via a read/write API and vector lookup

0.30

0.60

0.40

6

An external editable memory lets products keep facts up to date, audit what an LLM used to answer, and combine scattered facts without retraining the model.

Key finding

In qualitative examples, RET-LLM produced correct answers while the base Alpaca-7B produced incorrect answers despite having the same contextual text.

Train LLM-based agents end-to-end with RL and let them ask humans for help

0.60

0.70

0.60

4

AGILE lets production agents learn when to call humans and when to act, improving accuracy while controlling human cost. That makes it practical for customer support, medical QA, and recommendation systems where mistakes are costly.

Key finding

AGILE (agile-vic13b-ppo) achieves a higher average total score on ProductQA than the GPT-4 agent.

Numbers: Total score (short answers) 0.784 vs agile-gpt4-prompt 0.718; +9.2% rel. (Table 4)

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

0.60

0.50

3

DrugPilot cuts manual tool switching and context failures by structuring inputs as key-value parameters, improving automation accuracy and runtime for multi-step drug workflows.

Key finding

High task-completion on TCDD tool-calling benchmark.

Numbers: Task completion: 98.0% (simple), 93.5% (multi-tool), 64.0% (multi-turn)

Make LLM-based coding agents learn from and prune past shortcuts to improve code quality and stability

0.45

0.60

0.45

2

Iteratively refining and pruning agent experiences cuts noisy guidance, raises code quality by ~10% on the tested benchmark, and reduces the stored experience set to ~11.5%, saving storage and retrieval costs.

Key finding

IER improves end-to-end software quality compared to prior experience-based methods on SRDD.

Numbers: Quality: IER-Successive 0.6372 vs ECL 0.5775 (+10.3% rel)

Which memory formats and retrievers best help LLM agents reason over long text

0.70

0.50

0.60

1

Choosing the right memory format and retriever raises agent accuracy and robustness for long documents; mixed memories plus iterative retrieval improve multi-hop and noisy scenarios while tuning retrieval size controls cost.

Key finding

Mixed memory (chunks + triples + atomic facts + summaries) gives the most balanced performance across tasks.

Numbers: F1=82.11% on HotPotQA, F1=68.15% on 2Wiki (iterative + mixed)

A governed multi-agent runtime that makes retrieval, tools, and agent roles auditable and safe for lab-scale science

0.70

0.50

0.40

0

AISAC reduces operational risk when deploying agentic AI in regulated lab settings. It enforces auditable tool use, explicit data indexing, and per-agent knowledge scopes so teams can adopt LLM-driven assistants without losing provenance or control.

Key finding

AISAC enforces four structural guarantees for scientific reasoning.

Numbers: 4 guarantees (declared in abstract)

A multi-agent system uses LLM planning, retrieval, and large-scale simulation to design peptide/protein binders for disordered proteins on a

0.60

0.70

0

Automating the end-to-end design loop lets teams generate and triage thousands of candidate biologics quickly. This cuts the early discovery cycle time and lets experimental teams focus on a smaller, higher-quality set for wet-lab testing. The system also shows how to map compute cost vs. value by filtering cheaply and

Key finding

Der f 21: 50.98% of 787 in-silico validated designs had more favorable MM-PBSA binding free energy than the literature reference.

Numbers: 50.98% of 787 designs; 'more favorable' = mean ≤ -145.25 kcal/mol

Train multimodal LLM agents to ask or recall before moving, halving physical search cost in simulation.

0.35

0.70

0.80

0

Robots or service agents that ask and recall before moving save time and energy. Halving navigation cost directly lowers operational expense and extends robot lifespan. Better trade-offs also reduce user annoyance from frequent interruptions.

Key finding

ESearch-R1 cuts average operational cost by about half compared to a strong ReAct baseline.

Numbers: TTC reduced from 3.3 to 1.6 (≈50%) vs ReAct Qwen2.5-VL-32B on ESearch-Bench

Add a semantic timeline and durative summaries so agents recall events at the right time

0.70

0.65

0.40

0

TSM makes assistants recall facts that happened when they actually happened, improving time-sensitive answers and multi-session personalization—this can reduce wrong or stale recommendations in customer support and personal assistants.

Key finding

TSM raises overall QA accuracy on LONGMEMEVAL_S to 74.80%

Numbers: TSM 74.80% vs A-MEM 62.60% (+12.20 pp)

Learn memory as decisions: train an agent to choose Create/Read/Update/Delete

0.60

0.40

0

Dynamic, learnable memory policies reduce wasted retrievals and scale better when inputs are long, shuffled, or noisy. That means more reliable answers and lower latency for applications that process many documents or multi-question sessions.

Key finding

AtomMem with RL (AtomMem-RL) achieves higher end-task exact-match (EM) than prior memory agents on evaluated benchmarks.

Numbers: Average EM: AtomMem-RL 64.0% vs MemAgent 61.7% (Table 1)

Continuum Memory: make agent memory persistent, mutable, and associative

0.60

0.70

0.60

0

CMA makes assistants keep facts up to date, recall what happened around events, and answer multi-hop queries—improving trust and utility for long-running workflows, at the cost of higher latency and added governance needs.

Key finding

Selective retention: CMA surfaces corrected facts instead of stale ones.

Numbers: CMA won 38/40 queries; Cohen's d = 1.84

Trace which memory or tool input actually drove an LLM agent's action.

0.60

0.55

0.35

0

You can audit autonomous agents to see which past memory or tool output caused a decision—useful for compliance, debugging, and fixing business rule violations without needing explicit failures.

Key finding

Prob. Drop&Hold hits the human-labelled top sentence 93.75% of the time (Hit@1).

Numbers: Hit@1 = 0.9375 (Table 1)

ShardMemo: budgeted, scope-correct sharded memory using masked MoE routing

0.70

0.60

0.70

0

ShardMemo reduces retrieval cost and tail latency while improving accuracy for agent workflows, making LLM-based agents faster and more reliable under budgeted memory access.

Key finding

ShardMemo improves LoCoMo QA F1 over the strongest baseline (GAM).

Numbers: Single-hop F1 64.08 vs 58.38 (+5.70) (Table 1)

AgentOS: treat LLM context as addressable memory and orchestrate sync pulses for coherent multi-agent intelligence

0.30

0.80

0.60

0

AgentOS frames LLMs as manageable systems so companies can scale multi-agent workflows with fewer hallucinations and lower token waste, but expect new engineering costs for paging and synchronization.

Key finding

A tiered memory model (L1 attention, L2 semantic RAM, L3 knowledge base) reduces reliance on a single flat context window.

Ground LLM agents' memories at coarse and fine levels to improve planning and recovery

0.60

0

CFGM makes LLM-driven agents more reliable and cheaper in long-horizon interactive tasks by improving success rates and reducing unnecessary steps. This helps automation in web navigation, virtual assistants, and simulated-process tasks where repeated interactions are costly.

Key finding

CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.

Numbers: SR 91.00% vs 80.60% (+10.40%)

Use a replay graph + reflective memory to turn web navigation from guesswork into a searchable map.

0.60

0.50

0

Caching past page states and corrective traces cuts live web interactions and increases task reliability, lowering latency and operational cost for automated customer-service or data-extraction agents.

Key finding

R2D2 improves overall task success on WebArena compared to baselines.

Numbers: Total SR R2D2 27.3% vs Tree-Search 19.0% (Table 1)

Argues that adding episodic (instance-specific, single-shot) memory will enable LLM agents to learn and act reliably over long timescales

0.20

0.60

0

Episodic memory would let agentic systems remember client-specific events, adapt from single interactions, and improve over time without continually growing per-request compute costs.

Key finding

Episodic memory requires five properties beyond working or semantic memory: long-term storage, explicit reasoning, single-shot acquisition, instance specificity, and contextual relations.

Train a single LLM to ask itself clarifying questions and point to exact paragraphs to solve multi‑step questions in 128K contexts

0.60

0.70

0.60

0

AgenticLU cuts long-document QA failures by teaching one model to ask clarifying questions and point to exact paragraphs, yielding large accuracy gains with small inference overhead and reasonable finetuning cost.

Key finding

AgenticLU-8B raises HotpotQA (128K) accuracy from 40.0% to 71.1%.

Numbers: HotpotQA 128K: base 40.0% → AgenticLU 71.1% (+31.1 pts).

Teach coding agents from past runs: extract and reuse 'shortcuts' to speed multi-agent software development

0.50

0.60

0

Reusing vetted past fixes reduces developer iteration time and increases the chance generated prototypes are runnable, cutting manual triage and speeding prototyping.

Key finding

Experience reuse almost doubles the holistic software quality metric versus a strong multi-agent baseline.

Numbers: Quality 0.4267 -> 0.7304 (test set)

Query-aware indexing cuts memory search to ~11ms — 47× faster while keeping competitive accuracy

0.70

0.80

0

Cut memory search latency into the low tens of milliseconds so memory-augmented agents respond in real time while lowering infrastructure and throughput costs.

Key finding

Search latency reduced to ~11 ms per query on LoCoMo.

Numbers: Search = 11 ms (Table 2)

A practical survey of memory in LLMs: implicit weights, external retrieval, and agent memory

0.60

0.50

0.60

0

Memory systems let AI keep facts current, personalize across sessions, and reduce costly retraining; pick simple vector RAG for factual QA and reserve heavy platforms for offline analytics or research.

Key finding

Memory in LLMs is usefully grouped into three families: implicit (weights), explicit (external stores), and agentic (persistent agent memory).