Learn memory as decisions: train an agent to choose Create/Read/Update/Delete

January 13, 20266 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.4

Citation Count

0

Authors

Yupeng Huo, Yaxi Lu, Zhong Zhang, Haotian Chen, Yankai Lin

Links

Abstract / PDF

Why It Matters For Business

Dynamic, learnable memory policies reduce wasted retrievals and scale better when inputs are long, shuffled, or noisy. That means more reliable answers and lower latency for applications that process many documents or multi-question sessions.

Summary TLDR

AtomMem turns agent memory into a learnable policy over atomic CRUD (Create/Read/Update/Delete) actions. After supervised fine-tuning and RL (GRPO), an 8B Qwen3 agent learns when to store, retrieve, revise, or delete memory entries, improving multihop long-context QA by a few percentage points and scaling better to very long inputs.

Problem Statement

Most agent memories use fixed, hand-crafted workflows that assume one pattern fits all tasks. That rigidity wastes retrievals, forces unnecessary updates, and fails when relevant information is sparse or shuffled across very long inputs.

Main Contribution

Reformulate agent memory as a sequential decision problem and expose four atomic operations (Create, Read, Update, Delete) as actions.

Introduce a scratchpad plus external vector DB memory and train policies with a two-stage pipeline: SFT then on-policy RL (GRPO).

Show empirical gains on three multi-hop, long-context QA benchmarks and analyze how the trained policy shifts operation usage (more Create/Update/Delete, fewer Reads).

Provide ablations showing Update is crucial and that scratchpad+storage together are necessary for top performance.

Key Findings

AtomMem with RL (AtomMem-RL) achieves higher end-task exact-match (EM) than prior memory agents on evaluated benchmarks.

NumbersAverage EM: AtomMem-RL 64.0% vs MemAgent 61.7% (Table 1)

Training with RL yields a large performance boost over supervised initialization.

NumbersAverage EM SFT 53.9% -> RL 64.0% (+10.1 percentage points)

Removing the Update operation harms performance substantially.

NumbersHotpotQA EM drops from 77.8% -> 71.4% (-6.4pp) when Update is disabled (Table 2)

Scratchpad and external storage are complementary; removing both collapses performance.

NumbersRemoving both components causes >40 point drop on averaged tasks (Table 2)

Results

Average EM across benchmarks (200/400/800 doc settings)

Value64.0%

BaselineMemAgent 61.7%

SFT

Value53.9% -> 64.0% (+10.1pp)

Impact of removing Update (HotpotQA EM)

Value77.8% -> 71.4% (-6.4pp)

BaselineAtomMem full

Who Should Care

What To Try In 7 Days

Add a simple memory API with Create/Read/Update/Delete tags and log actions used during runs.

Train a small policy via supervised examples to follow that API, then try RL fine-tuning on a tiny task set to see if it reduces retrieval calls.

Implement a scratchpad (always-retrieved short summary) plus a vector DB and compare read counts and end-task accuracy to your current pipeline.

Agent Features

Memory

  • atomic CRUD operations
  • scratchpad (always retrieved)
  • external vector storage (top-k retrieval)

Planning

  • policy over atomic memory ops (CRUD)

Tool Use

  • vector database (semantic retrieval via embeddings)

Frameworks

  • SFT
  • GRPO

Is Agentic

true

Architectures

  • LLM agent (Qwen3-8B) with external vector DB

Optimization Features

Token Efficiency

  • Chunking (4k tokens default) reduces per-step context to manageable size

Training Optimization

  • GRPO
  • Task-level terminal reward (exact match) used for RL

Inference Optimization

  • Stream input in fixed-length chunks to support arbitrarily long contexts
  • Retrieve top-k=6 by default to match ~2-4 hop tasks

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • RL training is compute-heavy: ~2–3 days on an 8-GPU A800 cluster (paper-reported).
  • Evaluations are limited to three multihop QA datasets; generalization to other domains not shown.
  • Read has a one-step latency: retrieval requested at t-1 becomes visible at t, which may complicate tight real-time loops.

When Not To Use

  • If you lack RL training budget or infrastructure.
  • For tasks where a simple retrieval baseline already meets requirements and gains are marginal.
  • When memory storage is extremely constrained and update semantics cannot be supported.

Failure Modes

  • Policy can over-rely on Create/Update and bloat storage if reward does not penalize size.
  • If retrieval quality is poor, learned policies may still make suboptimal Read/Update choices.
  • Task-specific RL may overfit to the training distribution of document shuffling and not transfer well.

Core Entities

Models

  • Qwen3-8B
  • Qwen3-embedding-0.6B

Metrics

  • Exact Match (EM) percentage

Datasets

  • HotpotQA
  • 2WikiMultiHopQA
  • MuSiQue

Benchmarks

  • Needle-in-a-Haystack long-context (RULER-style augmentation)