Ground LLM agents' memories at coarse and fine levels to improve planning and recovery

August 21, 20259 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Links

Abstract / PDF

Why It Matters For Business

CFGM makes LLM-driven agents more reliable and cheaper in long-horizon interactive tasks by improving success rates and reducing unnecessary steps. This helps automation in web navigation, virtual assistants, and simulated-process tasks where repeated interactions are costly.

Summary TLDR

The paper introduces CFGM, a memory system for language-model agents that 1) uses LLMs to generate coarse focus points to guide collection of experiences, 2) extracts hybrid-grained (coarse + fine) tips from those experiences, and 3) performs fine-grained key-information extraction and self-QA during inference to correct plans. Across three interactive benchmarks, CFGM improves success rates and reduces interaction turns compared to several memory-augmented baselines.

Problem Statement

Existing memory-augmented LLM agents store knowledge at a single granularity (e.g., whole trajectories or single tips). That limits the diversity and usefulness of memories, hurts exploration, and makes short-term reflection brittle. The authors aim to make memories more useful by grounding them at coarse and fine levels using the LLM itself.

Main Contribution

CFGM framework: a three-stage memory grounding pipeline (coarse focus points → hybrid-grained tips → fine-grained key-info reflection) that uses the LLM to build and use memories.

Empirical validation: shows consistent gains on AlfWorld, WebShop and ScienceWorld versus ReAct and other memory-augmented baselines.

Ablations and analyses: demonstrate each component (focus points, tips, key-info reflection) helps and a moderate top-k retrieval (k=2–3) yields best tips quality.

Key Findings

CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.

NumbersSR 91.00% vs 80.60% (+10.40%)

CFGM improves WebShop success rate from 37% to 57% (and reports a higher reward).

NumbersSR 57% vs 37% (+20%); Reward 0.85 reported

CFGM lifts ScienceWorld success rate from 43% to 74% on evaluated tasks.

NumbersSR 74% vs 43% (+31%)

Each CFGM component adds value and the full system is best: combined FP+ET+KIR yields 91.00% SR on AlfWorld.

NumbersAblation SRs: baseline 80.60 → FP/ET/KIR variants 85.82/86.57/85.82 → full 91.00%

A small, focused retrieval scope gives best tip quality: top-k = 2 or 3 works best on AlfWorld.

NumbersSR at k=0:80.60% → k=1:82.09% → k=2 or 3:86.57% → k=5:81.34%

CFGM reduces average interaction turns vs memory baselines and is token-efficient offline.

NumbersOnline turns 14.32 (CFGM) vs 17.32 (ExpeL) and 19.01 (ReAct); Off tokens 4068.5 (CFGM) comparable to ExpeL 3888.2

Hybrid-grained tips transfer to a related but different shopping environment better than baselines.

NumbersTransferred tips SR 25.1% vs ExpeL 18.5% and ReAct 10.2% on WebArena-Shopping

Key Information Reflection (KIR) outperforms fixed-question and naive self-QA reflection.

NumbersKIR SR 85.82% vs Self-QA 84.33% and QA(Fixed) 84.33%

Results

Success Rate (AlfWorld)

Value91.00% ± 0.82%

BaselineReAct 80.60% ± 0.68%

Success Rate (WebShop)

Value57% ± 3%

BaselineReAct 37% ± 2%

Reward (WebShop)

Value0.85 ± 0.013

BaselineReAct 0.586 ± 0.01 (reported as 58.6 in table)

Success Rate (ScienceWorld)

Value74% ± 2%

BaselineReAct 43% ± 1%

Average online interaction turns

Value14.32

BaselineReAct 19.01

Top-k retrieval effect (AlfWorld SR)

Value86.57% at k=2 or k=3

Baselinek=0 (no tips) 80.60%

Out-of-domain transfer (WebArena-Shopping SR)

Value25.1% ± 0.8%

BaselineExpeL 18.5% ± 0.9%

Who Should Care

What To Try In 7 Days

Implement a simple two-level memory: use LLM prompts to extract 3–5 coarse focus points from your task descriptions and store a small experience pool.

Distill two hybrid-grained tips per successful experience (one coarse, one fine) and test top-k=2 retrieval during inference.

Add a simple key-information extractor that summarizes current trajectory state and run a short self-QA before repeating failing actions.

Agent Features

Memory

  • Coarse-to-fine grounded memory
  • Experience pool (offline trajectories)
  • Tips dictionary (hybrid-grained)
  • Key-information short-term reflection

Planning

  • Planning with LLMs
  • Task Decomposition
  • Short-term Memory
  • Long-term Memory
  • Retrieval Memory

Tool Use

  • Tool Selection
  • Function Calling

Frameworks

  • CFGM

Is Agentic

true

Architectures

  • single-agent

Collaboration

  • self-reflection (single-agent)

Optimization Features

Token Efficiency

  • CFGM reports fewer average interaction turns (14.32) and competitive offline token cost

System Optimization

  • Faiss-based retrieval for fast experience lookup

Training Optimization

  • LoRA

Inference Optimization

  • Top-k experience retrieval (k=2–3 recommended)
  • Key-info triggered reflection to avoid repeated invalid actions

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • CFGM still needs enough training tasks; very small training sets can limit effective experience collection.
  • Long experience trajectories can introduce redundancy or interference; performance drops when similar experiences are very long.
  • Relies on strong LLMs (authors use GPT-4 family); behavior may vary with weaker or unavailable models.

When Not To Use

  • When you only have a handful of training tasks and cannot collect diverse experiences.
  • When experience trajectories are extremely long and you cannot apply dynamic filtering to remove redundant steps.
  • When you cannot access sufficiently capable LLMs for generating focus points and tips.

Failure Modes

  • Retrieving too many experiences or low-quality experiences can add noise and reduce performance.
  • Poorly aligned or conflicting tips can cause the agent to oscillate instead of converging to a plan.
  • Over-reliance on closed-source LLM behavior may reduce reproducibility or increase operational cost.

Core Entities

Models

  • gpt-4-turbo
  • gpt-4o
  • GPT-4.1
  • Qwen2.5-7B-Instruct
  • Qwen

Metrics

  • Success Rate (SR)
  • Reward
  • On. Tokens
  • Off. Tokens
  • On. Turns

Datasets

  • AlfWorld
  • WebShop
  • ScienceWorld
  • WebArena-Shopping

Benchmarks

  • AlfWorld
  • WebShop
  • ScienceWorld
  • WebArena-Shopping

Context Entities

Models

  • gpt-4-turbo-2024-04-09
  • gpt-4o-2024-08-06
  • gpt-4-2024-04-09

Metrics

  • Success Rate reported with mean ± s.e.
  • Reward function used in WebShop

Datasets

  • AlfWorld tasks subset (134 tasks)
  • WebShop 100 tasks
  • ScienceWorld 100 tasks
  • WebArena-Shopping 98 tasks

Benchmarks

  • AlfWorld (household planning)
  • WebShop (shopping navigation)
  • ScienceWorld (text-based science tasks)