Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
CFGM makes LLM-driven agents more reliable and cheaper in long-horizon interactive tasks by improving success rates and reducing unnecessary steps. This helps automation in web navigation, virtual assistants, and simulated-process tasks where repeated interactions are costly.
Summary TLDR
The paper introduces CFGM, a memory system for language-model agents that 1) uses LLMs to generate coarse focus points to guide collection of experiences, 2) extracts hybrid-grained (coarse + fine) tips from those experiences, and 3) performs fine-grained key-information extraction and self-QA during inference to correct plans. Across three interactive benchmarks, CFGM improves success rates and reduces interaction turns compared to several memory-augmented baselines.
Problem Statement
Existing memory-augmented LLM agents store knowledge at a single granularity (e.g., whole trajectories or single tips). That limits the diversity and usefulness of memories, hurts exploration, and makes short-term reflection brittle. The authors aim to make memories more useful by grounding them at coarse and fine levels using the LLM itself.
Main Contribution
CFGM framework: a three-stage memory grounding pipeline (coarse focus points → hybrid-grained tips → fine-grained key-info reflection) that uses the LLM to build and use memories.
Empirical validation: shows consistent gains on AlfWorld, WebShop and ScienceWorld versus ReAct and other memory-augmented baselines.
Ablations and analyses: demonstrate each component (focus points, tips, key-info reflection) helps and a moderate top-k retrieval (k=2–3) yields best tips quality.
Key Findings
CFGM raises AlfWorld success rate to 91.00% versus 80.60% for ReAct.
CFGM improves WebShop success rate from 37% to 57% (and reports a higher reward).
CFGM lifts ScienceWorld success rate from 43% to 74% on evaluated tasks.
Each CFGM component adds value and the full system is best: combined FP+ET+KIR yields 91.00% SR on AlfWorld.
A small, focused retrieval scope gives best tip quality: top-k = 2 or 3 works best on AlfWorld.
CFGM reduces average interaction turns vs memory baselines and is token-efficient offline.
Hybrid-grained tips transfer to a related but different shopping environment better than baselines.
Key Information Reflection (KIR) outperforms fixed-question and naive self-QA reflection.
Results
Success Rate (AlfWorld)
Success Rate (WebShop)
Reward (WebShop)
Success Rate (ScienceWorld)
Average online interaction turns
Top-k retrieval effect (AlfWorld SR)
Out-of-domain transfer (WebArena-Shopping SR)
Who Should Care
What To Try In 7 Days
Implement a simple two-level memory: use LLM prompts to extract 3–5 coarse focus points from your task descriptions and store a small experience pool.
Distill two hybrid-grained tips per successful experience (one coarse, one fine) and test top-k=2 retrieval during inference.
Add a simple key-information extractor that summarizes current trajectory state and run a short self-QA before repeating failing actions.
Agent Features
Memory
- Coarse-to-fine grounded memory
- Experience pool (offline trajectories)
- Tips dictionary (hybrid-grained)
- Key-information short-term reflection
Planning
- Planning with LLMs
- Task Decomposition
- Short-term Memory
- Long-term Memory
- Retrieval Memory
Tool Use
- Tool Selection
- Function Calling
Frameworks
- CFGM
Is Agentic
true
Architectures
- single-agent
Collaboration
- self-reflection (single-agent)
Optimization Features
Token Efficiency
- CFGM reports fewer average interaction turns (14.32) and competitive offline token cost
System Optimization
- Faiss-based retrieval for fast experience lookup
Training Optimization
- LoRA
Inference Optimization
- Top-k experience retrieval (k=2–3 recommended)
- Key-info triggered reflection to avoid repeated invalid actions
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- CFGM still needs enough training tasks; very small training sets can limit effective experience collection.
- Long experience trajectories can introduce redundancy or interference; performance drops when similar experiences are very long.
- Relies on strong LLMs (authors use GPT-4 family); behavior may vary with weaker or unavailable models.
When Not To Use
- When you only have a handful of training tasks and cannot collect diverse experiences.
- When experience trajectories are extremely long and you cannot apply dynamic filtering to remove redundant steps.
- When you cannot access sufficiently capable LLMs for generating focus points and tips.
Failure Modes
- Retrieving too many experiences or low-quality experiences can add noise and reduce performance.
- Poorly aligned or conflicting tips can cause the agent to oscillate instead of converging to a plan.
- Over-reliance on closed-source LLM behavior may reduce reproducibility or increase operational cost.
Core Entities
Models
- gpt-4-turbo
- gpt-4o
- GPT-4.1
- Qwen2.5-7B-Instruct
- Qwen
Metrics
- Success Rate (SR)
- Reward
- On. Tokens
- Off. Tokens
- On. Turns
Datasets
- AlfWorld
- WebShop
- ScienceWorld
- WebArena-Shopping
Benchmarks
- AlfWorld
- WebShop
- ScienceWorld
- WebArena-Shopping
Context Entities
Models
- gpt-4-turbo-2024-04-09
- gpt-4o-2024-08-06
- gpt-4-2024-04-09
Metrics
- Success Rate reported with mean ± s.e.
- Reward function used in WebShop
Datasets
- AlfWorld tasks subset (134 tasks)
- WebShop 100 tasks
- ScienceWorld 100 tasks
- WebArena-Shopping 98 tasks
Benchmarks
- AlfWorld (household planning)
- WebShop (shopping navigation)
- ScienceWorld (text-based science tasks)

