Overview
Production Readiness
0.4
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
G-MemLLM boosts evidence-grounded QA and relation extraction with a tiny, trainable memory add-on, offering notable accuracy gains without full-model finetuning or large parameter increases.
Summary TLDR
G-MemLLM attaches a small trainable latent memory bank to a frozen LLM and uses a GRU-style gate to decide when to write or keep memory. This module improves multi-hop QA and zero-shot relation extraction across model sizes. Example gains: +8.56 F1 (GPT-2 Answer F1), +6.89 F1 (Llama 3.1-8B supporting facts), and +13.3% accuracy on ZsRE (Llama 3.1-8B). The memory adds under ~3% parameters and needs a composite loss (task + sparsity + entropy) to encourage focused, diverse slots.
Problem Statement
Transformers hit a practical ceiling for long-context tasks because attention scales poorly and compressed or recurrent context methods lose specific facts over long horizons. Models need a small, persistent working memory that selectively preserves multi-hop facts without fine-tuning the whole LLM.
Main Contribution
G-MemLLM architecture: frozen LLM + trainable latent memory bank with GRU-style gated updates.
Training objective combining task loss with sparsity and entropy regularizers to focus and diversify memory usage.
Empirical scaling study showing consistent gains on HotpotQA and ZsRE from GPT-2 (124M) to Llama 3.1 (8B).
Key Findings
G-MemLLM raises ZsRE accuracy by 13.3 percentage points on Llama 3.1-8B.
GPT-2 (124M) Answer F1 on HotpotQA improved by 8.56 points after adding G-MemLLM.
Llama 3.1-8B Supporting Fact F1 increased by 6.89 points with the memory module.
Using 1024 memory slots gave near-optimal ZsRE gains; 2048 slots had only +0.28% extra.
Results
Accuracy
Answer F1 (HotpotQA, GPT-2)
Supporting Fact F1 (HotpotQA, Llama 3.1-8B)
ZsRE score vs slot count (Llama 3.1-8B)
Who Should Care
What To Try In 7 Days
Attach a gated latent memory module to a frozen LLM and run a small fine-tune on a downstream task.
Start with 1024 memory slots and the composite loss (task + sparsity + entropy).
Evaluate Supporting Fact F1 and answer F1 on your multi-hop QA data to measure benefit.
Agent Features
Memory
- trainable latent memory bank (slot-based)
- GRU-style gated update to prevent memory drift
- encoder/decoder for dimension mapping
Planning
- memory loop: extract→retrieve→inject→consolidate
Frameworks
- cross-attention retrieval
- gated injection layer
Is Agentic
true
Architectures
- frozen LLM backbone
- trainable latent memory bank
Optimization Features
Token Efficiency
- reduces need to include all context tokens in the LLM context window
Model Optimization
- adds <3% extra parameters (memory-only trainable)
System Optimization
- decouples language processing and state storage for cheaper updates
Training Optimization
- composite loss: task cross-entropy + sparsity (L1) + entropy regularizer
Inference Optimization
- memory retrieval via cross-attention limits full-context attention costs
Reproducibility
Data Urls
- HotpotQA (public)
- ZsRE (public)
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments limited to HotpotQA and ZsRE; generality to other tasks untested.
- Adds compute and implementation complexity at training and inference.
- No public code release noted in paper; reproduction details limited.
- Memory benefits saturate with very large slot counts (diminishing returns).
When Not To Use
- When you only need short-context inference and cannot afford extra latency.
- When external retrieval of large corpora (RAG) is the required capability.
- When you require fully open-source, production-ready implementations not provided here.
Failure Modes
- Gate mislearning could cause useful facts to be overwritten or never stored.
- Memory saturation or slot noise can dilute stored facts if sparsity/entropy weights are mis-tuned.
- Performance gains may not transfer beyond the tested tasks or domains.
Core Entities
Models
- GPT-2 (124M)
- Llama 3.1 (8B)
- G-MemLLM module
Metrics
- Answer EM
- Answer F1
- Supporting Fact EM
- Supporting Fact F1
- Joint EM
- Joint F1
- Accuracy
Datasets
- HotpotQA
- ZsRE
Benchmarks
- HotpotQA
- ZsRE
Context Entities
Models
- Recurrent Memory Transformer (RMT)
- MemoryLLM
- M+
Metrics
- Context rot (qualitative)
Datasets
- LOCCO (cited for context rot)
Benchmarks
- LOCCO (related literature)

