Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.4
Citation Count
0
Why It Matters For Business
Dynamic, learnable memory policies reduce wasted retrievals and scale better when inputs are long, shuffled, or noisy. That means more reliable answers and lower latency for applications that process many documents or multi-question sessions.
Summary TLDR
AtomMem turns agent memory into a learnable policy over atomic CRUD (Create/Read/Update/Delete) actions. After supervised fine-tuning and RL (GRPO), an 8B Qwen3 agent learns when to store, retrieve, revise, or delete memory entries, improving multihop long-context QA by a few percentage points and scaling better to very long inputs.
Problem Statement
Most agent memories use fixed, hand-crafted workflows that assume one pattern fits all tasks. That rigidity wastes retrievals, forces unnecessary updates, and fails when relevant information is sparse or shuffled across very long inputs.
Main Contribution
Reformulate agent memory as a sequential decision problem and expose four atomic operations (Create, Read, Update, Delete) as actions.
Introduce a scratchpad plus external vector DB memory and train policies with a two-stage pipeline: SFT then on-policy RL (GRPO).
Show empirical gains on three multi-hop, long-context QA benchmarks and analyze how the trained policy shifts operation usage (more Create/Update/Delete, fewer Reads).
Provide ablations showing Update is crucial and that scratchpad+storage together are necessary for top performance.
Key Findings
AtomMem with RL (AtomMem-RL) achieves higher end-task exact-match (EM) than prior memory agents on evaluated benchmarks.
Training with RL yields a large performance boost over supervised initialization.
Removing the Update operation harms performance substantially.
Scratchpad and external storage are complementary; removing both collapses performance.
Results
Average EM across benchmarks (200/400/800 doc settings)
SFT
Impact of removing Update (HotpotQA EM)
Who Should Care
What To Try In 7 Days
Add a simple memory API with Create/Read/Update/Delete tags and log actions used during runs.
Train a small policy via supervised examples to follow that API, then try RL fine-tuning on a tiny task set to see if it reduces retrieval calls.
Implement a scratchpad (always-retrieved short summary) plus a vector DB and compare read counts and end-task accuracy to your current pipeline.
Agent Features
Memory
- atomic CRUD operations
- scratchpad (always retrieved)
- external vector storage (top-k retrieval)
Planning
- policy over atomic memory ops (CRUD)
Tool Use
- vector database (semantic retrieval via embeddings)
Frameworks
- SFT
- GRPO
Is Agentic
true
Architectures
- LLM agent (Qwen3-8B) with external vector DB
Optimization Features
Token Efficiency
- Chunking (4k tokens default) reduces per-step context to manageable size
Training Optimization
- GRPO
- Task-level terminal reward (exact match) used for RL
Inference Optimization
- Stream input in fixed-length chunks to support arbitrarily long contexts
- Retrieve top-k=6 by default to match ~2-4 hop tasks
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- RL training is compute-heavy: ~2–3 days on an 8-GPU A800 cluster (paper-reported).
- Evaluations are limited to three multihop QA datasets; generalization to other domains not shown.
- Read has a one-step latency: retrieval requested at t-1 becomes visible at t, which may complicate tight real-time loops.
When Not To Use
- If you lack RL training budget or infrastructure.
- For tasks where a simple retrieval baseline already meets requirements and gains are marginal.
- When memory storage is extremely constrained and update semantics cannot be supported.
Failure Modes
- Policy can over-rely on Create/Update and bloat storage if reward does not penalize size.
- If retrieval quality is poor, learned policies may still make suboptimal Read/Update choices.
- Task-specific RL may overfit to the training distribution of document shuffling and not transfer well.
Core Entities
Models
- Qwen3-8B
- Qwen3-embedding-0.6B
Metrics
- Exact Match (EM) percentage
Datasets
- HotpotQA
- 2WikiMultiHopQA
- MuSiQue
Benchmarks
- Needle-in-a-Haystack long-context (RULER-style augmentation)

