ProMem: iterative self-questioning to recover missing facts and cut downstream errors

January 8, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Chengyuan Yang, Zequn Sun, Wei Wei, Wei Hu

Links

Abstract / PDF

Why It Matters For Business

Improving what an agent saves (more complete, grounded memories) raises answer quality and reduces long-term error costs; pay once for extraction, benefit many reads.

Summary TLDR

The paper introduces ProMem, an iterative memory-extraction pipeline for LLM agents that adds a feedback loop of semantic alignment and self-questioning to the usual one-shot summarization. On HaluMem and LongMemEval, ProMem raises memory recall (integrity) from ~41% (typical baselines) to 73.8% and boosts downstream QA (62.26% on HaluMem, 69.57% on LongMemEval). ProMem is robust to heavy token compression and can run with smaller LLMs (Llama3-8B) to reduce cost.

Problem Statement

Current agent memory systems compress dialogues in a single summarization pass. That 'ahead-of-time' and 'one-off' extraction misses small but important facts and locks in hallucinations. The result: incomplete or incorrect stored memories reduce accuracy on later queries.

Main Contribution

Propose ProMem: make extraction iterative with semantic alignment and self-questioning verification.

Show large gains in memory completeness and downstream QA on HaluMem and LongMemEval.

Demonstrate robustness to token compression and viable use with smaller LLMs to lower cost.

Key Findings

ProMem raises memory integrity on HaluMem to 73.80%, outperforming common summary baselines.

NumbersMemory Integrity: ProMem 73.80% vs Mem0/Supermemory ~42%

Downstream QA accuracy improves: ProMem gets 62.26% on HaluMem and 69.57% on LongMemEval.

NumbersQA Accuracy: 62.26% (HaluMem), 69.57% (LongMemEval)

ProMem keeps reasonable performance when input tokens are heavily compressed.

NumbersAt 0.2 compression, QA 37.20% and integrity 57.20%

ProMem works with smaller LLMs: Llama3-8B yields better integrity and QA than Mem0 with same small model.

NumbersWith SLM (Llama3-8B): ProMem integrity 43.09% vs Mem0 30.59%; QA 49.15% vs 38.41%

Both memory completion and verification modules matter: removing them drops integrity/QA substantially.

NumbersAblation: remove MC & MV → integrity 54.03% and QA 50.6% (vs full 73.8% / 62.12%)

Results

Memory Integrity

Value73.80%

BaselineMem0 / Supermemory ~42%

Accuracy

Value89.47%

BaselineMemobase 92.24%

Accuracy

Value62.26%

BaselineLightMem 56.6% (reported) / Mem0 53.02%

Accuracy

Value69.57%

BaselineLightMem (reported lower)

Robustness under token compression

ValueIntegrity 57.20%, QA 37.20% at 0.2 ratio

BaselineMem0 integrity 23.28%, QA 21.34% at 0.2

SLM experiment (Llama3-8B)

ValueIntegrity 43.09%, QA 49.15%

BaselineMem0 with same SLM: integrity 30.59%, QA 38.41%

Who Should Care

What To Try In 7 Days

Add a semantic-match step to map summaries back to dialogue turns and re-extract uncovered turns.

Implement a self-questioning loop: generate verification questions for extracted facts and validate against raw turns.

Run a compressed-input test: drop tokens and compare QA before/after adding ProMem steps.

Agent Features

Memory

  • proactive extraction
  • semantic matching to turns
  • recurrent verification loop

Tool Use

  • embedding-based retrieval
  • self-questioning (auto-generated probes)

Frameworks

  • ProMem

Is Agentic

true

Architectures

  • LLM-based agent

Optimization Features

Token Efficiency

  • robust to heavy token drop (tested to 0.2)
  • write-once, read-many tradeoff reduces amortized cost

Inference Optimization

  • use SLMs for verification steps
  • apply token compression before extraction

Reproducibility

Data Urls

  • HaluMem (Chen et al., 2025)
  • LongMemEval (Wu et al., 2025)

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Higher token and latency cost from iterative verification.
  • Effectiveness depends on backbone LLM reasoning quality.
  • Not yet integrated with lifelong update/forgetting mechanisms.

When Not To Use

  • Hard real-time systems that cannot tolerate extra extraction latency.
  • Very small/weaker LMs that cannot generate reliable verification questions.
  • Severely compute- or token-constrained deployments without amortization opportunities.

Failure Modes

  • Backbone LLM hallucinations produce incorrect verification Q/A and corrupt memory.
  • High token cost if extraction is repeated without amortization.
  • Incorrect similarity thresholds may either miss uncovered turns or create noisy supplementary entries.

Core Entities

Models

  • GPT-4o-mini
  • GPT-4o
  • Llama3-8B
  • Qwen3-Embedding-8B

Metrics

  • Memory Integrity
  • Accuracy

Datasets

  • HaluMem
  • LongMemEval

Benchmarks

  • HaluMem
  • LongMemEval

Context Entities

Models

  • Mem0
  • LightMem
  • Memobase
  • Supermemory
  • NativeRAG