Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
Improving what an agent saves (more complete, grounded memories) raises answer quality and reduces long-term error costs; pay once for extraction, benefit many reads.
Summary TLDR
The paper introduces ProMem, an iterative memory-extraction pipeline for LLM agents that adds a feedback loop of semantic alignment and self-questioning to the usual one-shot summarization. On HaluMem and LongMemEval, ProMem raises memory recall (integrity) from ~41% (typical baselines) to 73.8% and boosts downstream QA (62.26% on HaluMem, 69.57% on LongMemEval). ProMem is robust to heavy token compression and can run with smaller LLMs (Llama3-8B) to reduce cost.
Problem Statement
Current agent memory systems compress dialogues in a single summarization pass. That 'ahead-of-time' and 'one-off' extraction misses small but important facts and locks in hallucinations. The result: incomplete or incorrect stored memories reduce accuracy on later queries.
Main Contribution
Propose ProMem: make extraction iterative with semantic alignment and self-questioning verification.
Show large gains in memory completeness and downstream QA on HaluMem and LongMemEval.
Demonstrate robustness to token compression and viable use with smaller LLMs to lower cost.
Key Findings
ProMem raises memory integrity on HaluMem to 73.80%, outperforming common summary baselines.
Downstream QA accuracy improves: ProMem gets 62.26% on HaluMem and 69.57% on LongMemEval.
ProMem keeps reasonable performance when input tokens are heavily compressed.
ProMem works with smaller LLMs: Llama3-8B yields better integrity and QA than Mem0 with same small model.
Both memory completion and verification modules matter: removing them drops integrity/QA substantially.
Results
Memory Integrity
Accuracy
Accuracy
Accuracy
Robustness under token compression
SLM experiment (Llama3-8B)
Who Should Care
What To Try In 7 Days
Add a semantic-match step to map summaries back to dialogue turns and re-extract uncovered turns.
Implement a self-questioning loop: generate verification questions for extracted facts and validate against raw turns.
Run a compressed-input test: drop tokens and compare QA before/after adding ProMem steps.
Agent Features
Memory
- proactive extraction
- semantic matching to turns
- recurrent verification loop
Tool Use
- embedding-based retrieval
- self-questioning (auto-generated probes)
Frameworks
- ProMem
Is Agentic
true
Architectures
- LLM-based agent
Optimization Features
Token Efficiency
- robust to heavy token drop (tested to 0.2)
- write-once, read-many tradeoff reduces amortized cost
Inference Optimization
- use SLMs for verification steps
- apply token compression before extraction
Reproducibility
Data Urls
- HaluMem (Chen et al., 2025)
- LongMemEval (Wu et al., 2025)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Higher token and latency cost from iterative verification.
- Effectiveness depends on backbone LLM reasoning quality.
- Not yet integrated with lifelong update/forgetting mechanisms.
When Not To Use
- Hard real-time systems that cannot tolerate extra extraction latency.
- Very small/weaker LMs that cannot generate reliable verification questions.
- Severely compute- or token-constrained deployments without amortization opportunities.
Failure Modes
- Backbone LLM hallucinations produce incorrect verification Q/A and corrupt memory.
- High token cost if extraction is repeated without amortization.
- Incorrect similarity thresholds may either miss uncovered turns or create noisy supplementary entries.
Core Entities
Models
- GPT-4o-mini
- GPT-4o
- Llama3-8B
- Qwen3-Embedding-8B
Metrics
- Memory Integrity
- Accuracy
Datasets
- HaluMem
- LongMemEval
Benchmarks
- HaluMem
- LongMemEval
Context Entities
Models
- Mem0
- LightMem
- Memobase
- Supermemory
- NativeRAG

