Ground LLM agents' memories at coarse and fine levels to improve planning and recovery

August 21, 20259 min

Overview

Decision SnapshotReady For Pilot

Paper evaluates CFGM across three different interactive benchmarks with ablations and transfer tests. Results are consistent but depend on public benchmarks and closed-source LLMs for some components.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Links

Abstract / PDF

Why It Matters For Business

CFGM makes LLM-driven agents more reliable and cheaper in long-horizon interactive tasks by improving success rates and reducing unnecessary steps. This helps automation in web navigation, virtual assistants, and simulated-process tasks where repeated interactions are costly.

Who Should Care

Summary TLDR

The paper introduces CFGM, a memory system for language-model agents that 1) uses LLMs to generate coarse focus points to guide collection of experiences, 2) extracts hybrid-grained (coarse + fine) tips from those experiences, and 3) performs fine-grained key-information extraction and self-QA during inference to correct plans. Across three interactive benchmarks, CFGM improves success rates and reduces interaction turns compared to several memory-augmented baselines.

Problem Statement

Existing memory-augmented LLM agents store knowledge at a single granularity (e.g., whole trajectories or single tips). That limits the diversity and usefulness of memories, hurts exploration, and makes short-term reflection brittle. The authors aim to make memories more useful by grounding them at coarse and fine levels using the LLM itself.

Main Contribution

CFGM framework: a three-stage memory grounding pipeline (coarse focus points → hybrid-grained tips → fine-grained key-info reflection) that uses the LLM to build and use memories.

Empirical validation: shows consistent gains on AlfWorld, WebShop and ScienceWorld versus ReAct and other memory-augmented baselines.

Key Findings

CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.

NumbersSR 91.00% vs 80.60% (+10.40%)

Practical UseIf you add CFGM-style memory grounding to an LLM agent on household planning tasks, expect about a ~10 percentage-point success-rate gain in evaluated AlfWorld tasks.

Evidence RefTable 1 (AlfWorld SR)

CFGM improves WebShop success rate from 37% to 57% (and reports a higher reward).

NumbersSR 57% vs 37% (+20%); Reward 0.85 reported

Practical UseFor web-navigation shopping tasks, CFGM can substantially boost exact-match purchases by using experience tips and focused retrieval.

Evidence RefTable 1 (WebShop SR & reward)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Success Rate (AlfWorld)91.00% ± 0.82%ReAct 80.60% ± 0.68%+10.40%AlfWorld (134 tasks)Main Table 1 comparing methodsTable 1
Success Rate (WebShop)57% ± 3%ReAct 37% ± 2%+20%WebShop (100 tasks)Main Table 1 comparing methodsTable 1

What To Try In 7 Days

Implement a simple two-level memory: use LLM prompts to extract 3–5 coarse focus points from your task descriptions and store a small experience pool.

Distill two hybrid-grained tips per successful experience (one coarse, one fine) and test top-k=2 retrieval during inference.

Add a simple key-information extractor that summarizes current trajectory state and run a short self-QA before repeating failing actions.

Agent Features

Memory
Coarse-to-fine grounded memoryExperience pool (offline trajectories)Tips dictionary (hybrid-grained)Key-information short-term reflection
Planning
Planning with LLMsTask DecompositionShort-term MemoryLong-term MemoryRetrieval Memory
Tool Use
Tool SelectionFunction Calling
Frameworks
CFGM
Is Agentic

Yes

Architectures
single-agent
Collaboration
self-reflection (single-agent)

Optimization Features

Token Efficiency
CFGM reports fewer average interaction turns (14.32) and competitive offline token cost
System Optimization
Faiss-based retrieval for fast experience lookup
Training Optimization
LoRA
Inference Optimization
Top-k experience retrieval (k=2–3 recommended)Key-info triggered reflection to avoid repeated invalid actions

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

CFGM still needs enough training tasks; very small training sets can limit effective experience collection.

Long experience trajectories can introduce redundancy or interference; performance drops when similar experiences are very long.

When Not To Use

When you only have a handful of training tasks and cannot collect diverse experiences.

When experience trajectories are extremely long and you cannot apply dynamic filtering to remove redundant steps.

Failure Modes

Retrieving too many experiences or low-quality experiences can add noise and reduce performance.

Poorly aligned or conflicting tips can cause the agent to oscillate instead of converging to a plan.

Core Entities

Models

gpt-4-turbogpt-4oGPT-4.1Qwen2.5-7B-InstructQwen

Metrics

Success Rate (SR)RewardOn. TokensOff. TokensOn. Turns

Datasets

AlfWorldWebShopScienceWorldWebArena-Shopping

Benchmarks

AlfWorldWebShopScienceWorldWebArena-Shopping

Context Entities

Models

gpt-4-turbo-2024-04-09gpt-4o-2024-08-06gpt-4-2024-04-09

Metrics

Success Rate reported with mean ± s.e.Reward function used in WebShop

Datasets

AlfWorld tasks subset (134 tasks)WebShop 100 tasksScienceWorld 100 tasksWebArena-Shopping 98 tasks

Benchmarks

AlfWorld (household planning)WebShop (shopping navigation)ScienceWorld (text-based science tasks)