Hierarchical Cognitive Caching lets an agent sustain multi-day ML experiments and improve results

January 15, 20267 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

0

Authors

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, Yanfeng Wang

Links

Abstract / PDF

Why It Matters For Business

Hierarchical caching cuts token costs and raises success rates for long-running ML automation, reducing expensive manual cycles and accelerating model development.

Summary TLDR

ML-Master 2.0 introduces Hierarchical Cognitive Caching (HCC): a three-tier cache (L1 evolving experience, L2 refined knowledge, L3 prior wisdom) plus policies to promote, retrieve, and prefetch context. HCC compresses transient traces into stable summaries so an agent can run multi-day ML engineering loops without context overload. On OpenAI's MLE-Bench (75 Kaggle-like tasks, 24-hour budgets) it reaches a 56.44% average medal rate (75.8% low / 50.9% medium / 42.2% high), reduces peak context from >200k to ~70k tokens, and shows ablation evidence that each cache layer meaningfully improves results.

Problem Statement

LLM agents hit a bottleneck on ultra-long-horizon experiments because execution logs and trial-and-error explode context size and break strategic coherence. The paper reframes context management as "cognitive accumulation": distill raw traces into phase-level knowledge and task-level wisdom so agents can plan and transfer over hours to days.

Main Contribution

A conceptual framework called cognitive accumulation that views long-horizon autonomy as evolving experience → validated knowledge → transferable wisdom.

Hierarchical Cognitive Caching (HCC): a three-tier cache (L1 evolving experience, L2 refined knowledge, L3 prior wisdom) plus policies for prefetching, hits, and promotion using LLM-based summarization and embeddings.

Empirical validation on MLE-Bench (24h budgets): state-of-the-art medal rates and ablations showing each cache tier contributes to long-horizon performance.

Key Findings

ML-Master 2.0 achieves a 56.44% average medal rate on MLE-Bench under a 24-hour budget.

Numbers56.44% avg medal rate (MLE-Bench, 24h)

Performance roughly doubled vs the original ML-Master baseline (29.3% → 56.4%).

Numbers29.3% → 56.4% avg (≈92.7% relative)

HCC reduces peak context tokens from more than 200k to about 70k while retaining useful signals.

Numbers>200k → ~70k tokens (peak)

Ablations show each cache layer matters: removing L1 or L2 or L3 degrades key metrics.

NumbersRemoving L1: valid 54.5%, any-medal 22.7%; full system any-medal 72.7% (MLE-Bench-Lite)

Results

Avg medal rate

Value56.44%

Baseline29.3% (ML-Master)

Low complexity medal rate

Value75.8%

Baseline48.5% (ML-Master variant Deepseek-R1)

Medium complexity medal rate

Value50.9%

Baseline20.2% (ML-Master variant Deepseek-R1)

High complexity medal rate

Value42.2%

Baseline24.4% (ML-Master variant Deepseek-R1)

Peak context token length (example task)

Value~70k tokens (with HCC)

Baseline>200k tokens (no HCC)

Who Should Care

What To Try In 7 Days

Add a three-tier context store: working traces, phase summaries, and cross-task wisdom.

Use an embedding-based prefetch step to warm-start new tasks from similar past tasks.

Implement phase-end summarization prompts to compress logs into compact strategy notes.

Agent Features

Memory

  • L1 evolving experience (raw traces)
  • L2 refined knowledge (phase summaries)
  • L3 prior wisdom (task-level distilled strategies)

Planning

  • hierarchical research planning (m directions × q suggestions)
  • LoRA
  • phase-boundary consolidation

Tool Use

  • LLMs for code generation and summarization
  • semantic embeddings for retrieval (cosine similarity)
  • GPUs for model training and evaluation

Frameworks

  • HCC (Hierarchical Cognitive Caching)
  • context promotion and prefetch operators

Is Agentic

true

Architectures

  • Hierarchical Cognitive Caching (L1-L2-L3)
  • phase-based hierarchical research planning
  • LLM-driven context promotion pipeline

Collaboration

  • not emphasized

Optimization Features

Token Efficiency

  • phase-level summarization to limit active context
  • fallback to compact L2 summaries when raw traces absent

Infra Optimization

  • parallel evaluation of multiple implementation directions

System Optimization

  • prefetching relevant prior wisdom to reduce cold-start costs

Reproducibility

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Requires substantial compute and GPU resources for 24h per task evaluations.
  • Relies on LLM summarization and embedding quality; hallucinations or poor summaries can harm promotions.
  • No public code release or exact prompts/weights provided in paper to fully reproduce results.

When Not To Use

  • When compute budget or token budget is very tight.
  • When experiments require physical lab steps or non-simulatable validation.
  • When strict reproducibility is mandatory and closed-source models are disallowed.

Failure Modes

  • Context promotion may omit low-level debugging details needed later.
  • Embedding retrieval threshold misfires could prefetch irrelevant wisdom and bias exploration.
  • LLM summarization may hallucinate conclusions that mislead later planning.

Core Entities

Models

  • Deepseek-V3.2-Speciale
  • Deepseek-V3.2
  • Deepseek-R1
  • gpt-4o-24-08
  • gpt-5
  • Gemini-2.5-Pro

Metrics

  • avg medal rate
  • medal rate (low/medium/high)
  • valid submission rate
  • above-median rate
  • silver+ rate
  • gold rate
  • peak context token length

Datasets

  • MLE-Bench (75 Kaggle tasks)
  • MLE-Bench-Lite
  • 407 external Kaggle competitions (warmup)

Benchmarks

  • MLE-Bench