Ground LLM agents' memories at coarse and fine levels to improve planning and recovery

Overview

Decision SnapshotReady For Pilot

Paper evaluates CFGM across three different interactive benchmarks with ablations and transfer tests. Results are consistent but depend on public benchmarks and closed-source LLMs for some components.

Citations0

Evidence Strength0.80

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 8/8

Findings with evidence refs: 8/8

Results with explicit delta: 7/7

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Wei Yang, Jinwei Xiao, Hongming Zhang, Qingyang Zhang, Yanna Wang, Bo Xu

Links

Abstract / PDF

Why It Matters For Business

CFGM makes LLM-driven agents more reliable and cheaper in long-horizon interactive tasks by improving success rates and reducing unnecessary steps. This helps automation in web navigation, virtual assistants, and simulated-process tasks where repeated interactions are costly.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO

Summary TLDR

The paper introduces CFGM, a memory system for language-model agents that 1) uses LLMs to generate coarse focus points to guide collection of experiences, 2) extracts hybrid-grained (coarse + fine) tips from those experiences, and 3) performs fine-grained key-information extraction and self-QA during inference to correct plans. Across three interactive benchmarks, CFGM improves success rates and reduces interaction turns compared to several memory-augmented baselines.

Problem Statement

Existing memory-augmented LLM agents store knowledge at a single granularity (e.g., whole trajectories or single tips). That limits the diversity and usefulness of memories, hurts exploration, and makes short-term reflection brittle. The authors aim to make memories more useful by grounding them at coarse and fine levels using the LLM itself.

Main Contribution

CFGM framework: a three-stage memory grounding pipeline (coarse focus points → hybrid-grained tips → fine-grained key-info reflection) that uses the LLM to build and use memories.

Empirical validation: shows consistent gains on AlfWorld, WebShop and ScienceWorld versus ReAct and other memory-augmented baselines.

Key Findings

CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.

NumbersSR 91.00% vs 80.60% (+10.40%)

Practical UseIf you add CFGM-style memory grounding to an LLM agent on household planning tasks, expect about a ~10 percentage-point success-rate gain in evaluated AlfWorld tasks.

Evidence RefTable 1 (AlfWorld SR)

CFGM improves WebShop success rate from 37% to 57% (and reports a higher reward).

NumbersSR 57% vs 37% (+20%); Reward 0.85 reported

Practical UseFor web-navigation shopping tasks, CFGM can substantially boost exact-match purchases by using experience tips and focused retrieval.

Evidence RefTable 1 (WebShop SR & reward)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Success Rate (AlfWorld)	91.00% ± 0.82%	ReAct 80.60% ± 0.68%	+10.40%	AlfWorld (134 tasks)	Main Table 1 comparing methods	Table 1
Success Rate (WebShop)	57% ± 3%	ReAct 37% ± 2%	+20%	WebShop (100 tasks)	Main Table 1 comparing methods	Table 1

What To Try In 7 Days

Implement a simple two-level memory: use LLM prompts to extract 3–5 coarse focus points from your task descriptions and store a small experience pool.

Distill two hybrid-grained tips per successful experience (one coarse, one fine) and test top-k=2 retrieval during inference.

Add a simple key-information extractor that summarizes current trajectory state and run a short self-QA before repeating failing actions.

Agent Features

Memory

Coarse-to-fine grounded memoryExperience pool (offline trajectories)Tips dictionary (hybrid-grained)Key-information short-term reflection

Planning

Planning with LLMsTask DecompositionShort-term MemoryLong-term MemoryRetrieval Memory

Tool Use

Tool SelectionFunction Calling

Frameworks

CFGM

Is Agentic

Yes

Architectures

single-agent

Collaboration

self-reflection (single-agent)

Optimization Features

Token Efficiency

CFGM reports fewer average interaction turns (14.32) and competitive offline token cost

System Optimization

Faiss-based retrieval for fast experience lookup

Training Optimization

LoRA

Inference Optimization

Top-k experience retrieval (k=2–3 recommended)Key-info triggered reflection to avoid repeated invalid actions

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

CFGM still needs enough training tasks; very small training sets can limit effective experience collection.

Long experience trajectories can introduce redundancy or interference; performance drops when similar experiences are very long.

When Not To Use

When you only have a handful of training tasks and cannot collect diverse experiences.

When experience trajectories are extremely long and you cannot apply dynamic filtering to remove redundant steps.

Failure Modes

Retrieving too many experiences or low-quality experiences can add noise and reduce performance.

Poorly aligned or conflicting tips can cause the agent to oscillate instead of converging to a plan.

Core Entities

Models

gpt-4-turbogpt-4oGPT-4.1Qwen2.5-7B-InstructQwen

Metrics

Success Rate (SR)RewardOn. TokensOff. TokensOn. Turns

Datasets

AlfWorldWebShopScienceWorldWebArena-Shopping

Benchmarks

AlfWorldWebShopScienceWorldWebArena-Shopping

Context Entities

Models

gpt-4-turbo-2024-04-09gpt-4o-2024-08-06gpt-4-2024-04-09

Metrics

Success Rate reported with mean ± s.e.Reward function used in WebShop

Datasets

AlfWorld tasks subset (134 tasks)WebShop 100 tasksScienceWorld 100 tasksWebArena-Shopping 98 tasks

Benchmarks

AlfWorld (household planning)WebShop (shopping navigation)ScienceWorld (text-based science tasks)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.

CFGM improves WebShop success rate from 37% to 57% (and reports a higher reward).

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding