Add a small gated latent memory to frozen LLMs to improve multi-hop reasoning and relation extraction

Overview

Production Readiness

0.4

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

Authors

Xun Xu

Links

Abstract / PDF

Why It Matters For Business

G-MemLLM boosts evidence-grounded QA and relation extraction with a tiny, trainable memory add-on, offering notable accuracy gains without full-model finetuning or large parameter increases.

Summary TLDR

G-MemLLM attaches a small trainable latent memory bank to a frozen LLM and uses a GRU-style gate to decide when to write or keep memory. This module improves multi-hop QA and zero-shot relation extraction across model sizes. Example gains: +8.56 F1 (GPT-2 Answer F1), +6.89 F1 (Llama 3.1-8B supporting facts), and +13.3% accuracy on ZsRE (Llama 3.1-8B). The memory adds under ~3% parameters and needs a composite loss (task + sparsity + entropy) to encourage focused, diverse slots.

Problem Statement

Transformers hit a practical ceiling for long-context tasks because attention scales poorly and compressed or recurrent context methods lose specific facts over long horizons. Models need a small, persistent working memory that selectively preserves multi-hop facts without fine-tuning the whole LLM.

Main Contribution

G-MemLLM architecture: frozen LLM + trainable latent memory bank with GRU-style gated updates.

Training objective combining task loss with sparsity and entropy regularizers to focus and diversify memory usage.

Empirical scaling study showing consistent gains on HotpotQA and ZsRE from GPT-2 (124M) to Llama 3.1 (8B).

Key Findings

G-MemLLM raises ZsRE accuracy by 13.3 percentage points on Llama 3.1-8B.

NumbersZsRE: 55.63 -> 63.03 (+13.3%)

GPT-2 (124M) Answer F1 on HotpotQA improved by 8.56 points after adding G-MemLLM.

NumbersAnswer F1: 45.52 -> 54.08 (+8.56)

Llama 3.1-8B Supporting Fact F1 increased by 6.89 points with the memory module.

NumbersSup Fact F1: 76.53 -> 83.42 (+6.89)

Using 1024 memory slots gave near-optimal ZsRE gains; 2048 slots had only +0.28% extra.

NumbersSlots 1024 -> Score 63.03; 2048 -> 63.21 (+0.28%)

Results

Accuracy

Value63.03 (G-MemLLM)

Baseline55.63 (Vanilla Llama 3.1-8B)

Answer F1 (HotpotQA, GPT-2)

Value54.08 (G-MemLLM)

Baseline45.52 (Vanilla GPT-2)

Supporting Fact F1 (HotpotQA, Llama 3.1-8B)

Value83.42 (G-MemLLM)

Baseline76.53 (Vanilla Llama 3.1-8B)

ZsRE score vs slot count (Llama 3.1-8B)

Value63.03 (1024 slots)

Baseline58.53 (0 slots, vanilla)

Who Should Care

Ml EngineerData ScientistProduct ManagerEngineering LeadCto

What To Try In 7 Days

Attach a gated latent memory module to a frozen LLM and run a small fine-tune on a downstream task.

Start with 1024 memory slots and the composite loss (task + sparsity + entropy).

Evaluate Supporting Fact F1 and answer F1 on your multi-hop QA data to measure benefit.

Agent Features

Memory

trainable latent memory bank (slot-based)
GRU-style gated update to prevent memory drift
encoder/decoder for dimension mapping

Planning

memory loop: extract→retrieve→inject→consolidate

Frameworks

cross-attention retrieval
gated injection layer

Is Agentic

true

Architectures

frozen LLM backbone
trainable latent memory bank

Optimization Features

Token Efficiency

reduces need to include all context tokens in the LLM context window

Model Optimization

adds <3% extra parameters (memory-only trainable)

System Optimization

decouples language processing and state storage for cheaper updates

Training Optimization

composite loss: task cross-entropy + sparsity (L1) + entropy regularizer

Inference Optimization

memory retrieval via cross-attention limits full-context attention costs

Reproducibility

Data Urls

HotpotQA (public)
ZsRE (public)

Data Available

Open Source Status

unknown

Risks & Boundaries

Limitations

Experiments limited to HotpotQA and ZsRE; generality to other tasks untested.
Adds compute and implementation complexity at training and inference.
No public code release noted in paper; reproduction details limited.
Memory benefits saturate with very large slot counts (diminishing returns).

When Not To Use

When you only need short-context inference and cannot afford extra latency.
When external retrieval of large corpora (RAG) is the required capability.
When you require fully open-source, production-ready implementations not provided here.

Failure Modes

Gate mislearning could cause useful facts to be overwritten or never stored.
Memory saturation or slot noise can dilute stored facts if sparsity/entropy weights are mis-tuned.
Performance gains may not transfer beyond the tested tasks or domains.

Core Entities

Models

GPT-2 (124M)
Llama 3.1 (8B)
G-MemLLM module

Metrics

Answer EM
Answer F1
Supporting Fact EM
Supporting Fact F1
Joint EM
Joint F1
Accuracy

Datasets

HotpotQA
ZsRE

Benchmarks

HotpotQA
ZsRE

Context Entities

Models

Recurrent Memory Transformer (RMT)
MemoryLLM
M+

Metrics

Context rot (qualitative)

Datasets

LOCCO (cited for context rot)

Benchmarks

LOCCO (related literature)

Overview

Production Readiness

Novelty Score

Cost Impact Score

Citation Count

Authors

Links

Why It Matters For Business

Summary TLDR

Problem Statement

Main Contribution

Key Findings

G-MemLLM raises ZsRE accuracy by 13.3 percentage points on Llama 3.1-8B.

GPT-2 (124M) Answer F1 on HotpotQA improved by 8.56 points after adding G-MemLLM.

Llama 3.1-8B Supporting Fact F1 increased by 6.89 points with the memory module.

Using 1024 memory slots gave near-optimal ZsRE gains; 2048 slots had only +0.28% extra.

Results

Accuracy

Answer F1 (HotpotQA, GPT-2)

Supporting Fact F1 (HotpotQA, Llama 3.1-8B)

ZsRE score vs slot count (Llama 3.1-8B)

Who Should Care

What To Try In 7 Days

Agent Features

Memory

Planning

Frameworks

Is Agentic

Architectures

Optimization Features

Token Efficiency

Model Optimization

System Optimization

Training Optimization

Inference Optimization

Reproducibility

Data Urls

Data Available

Open Source Status

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

Related Papers