Train agents to internalize human hints so they stop relying on ever-growing prompts

February 3, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

1

Authors

Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, Harri Valpola

Links

Abstract / PDF

Why It Matters For Business

You can convert repeated human guidance into model updates that reduce prompt length, cut inference cost, and raise multi-tool task reliability with modest annotation work.

Summary TLDR

The paper introduces Memento No More (MNM), an iterative training recipe that turns human-written ‘hints’ (short corrective guidance) into model weights so an LLM agent can solve many tool-using tasks without long prompts. They run a Llama-3.1-70B agent through three rounds of context-distillation training (LoRA adapters + forward-KL distillation) on ToolQA and OfficeBench. After three rounds the agent reaches 97.9% success on ToolQA and 90.3% on OfficeBench, outperforms or matches larger prompting-based agents (GPT-4o, DeepSeek-V3) on these benchmarks, reduces input-token usage by ~10x, and shows no drop on standard coding/reasoning tests. The method uses small sets of targeted corrective “

Problem Statement

Current LLM agents often keep appending hints and tool docs into prompts to handle many tasks. Long prompts slow inference, cause information overload, and don’t let the model ‘internalize’ skills. The paper asks: can we teach a single LLM agent to absorb tool usage rules and targeted corrective guidance into its weights so it performs many tool-using tasks without huge per-task prompts?

Main Contribution

A practical iterative coaching pipeline that converts human corrective hints into model weights via context distillation and LoRA adapters.

A working implementation on Llama-3.1-70B that internalizes tool docs and targeted fixes in three training rounds, reducing prompt dependency.

Demonstration on two multi-tool benchmarks (ToolQA and OfficeBench) with strong gains vs long-prompt baselines and comparable or better performance than larger prompting-based models.

Operational elements: automated LLM-based reviewers to find mistake states, a small set of reusable corrective hints, probabilistic hint dropout (p=0.9), and data balancing to avoid forgetting.

Key Findings

After three rounds MNM achieves 97.9% success on ToolQA.

Numbers97.9% success (Table 2, Round 3)

MNM reaches 90.3% success on OfficeBench after three rounds, slightly above GPT-4o on multi-app tasks.

Numbers90.3% vs GPT-4o 89.9% (Table 4, Round 3)

MNM reduces input prompt token usage dramatically while improving speed.

NumbersToolQA input tokens: MNM 5,564 vs GPT-4o 77,736 (Table 9); inference 3–4x faster (Appendix D)

Only a small set of corrective hints was required for strong gains.

Numbers36 corrective hints for ToolQA, 24 for OfficeBench (Section 6)

No measurable degradation on standard coding and math benchmarks after training.

NumbersHumanEval and GSM8K scores unchanged (Table 6)

Results

ToolQA success rate (MNM, after Round 3)

Value97.9%

BaselineCombined-prompt Llama: 61.0%; task-specific hint Llama: 96.9%; GPT-4o: 92.8%; DeepSeek-V3: 87.5%

OfficeBench success rate (MNM, after Round 3)

Value90.3%

BaselineCombined-prompt Llama: 14.6%; GPT-4o: 89.9%; DeepSeek-V3: 86.9%

Input tokens per test task (ToolQA)

Value5,564 tokens (MNM Round 3)

BaselineGPT-4o: 77,736; Llama-3.1-70B: 74,950

Inference speed & cost

Value3–4x faster inference; large token cost reduction

BaselineUntrained Llama agent and prompting baselines

Human annotation effort (corrective hints)

Value36 hints (ToolQA) and 24 hints (OfficeBench)

Backbone benchmark stability

ValueNo degradation on HumanEval / GSM8K

BaselineLlama-3.1-70B original

Who Should Care

What To Try In 7 Days

Run teacher-student distillation: collect few teacher trajectories with task-specific hints and distill into a LoRA adapter.

Automate error localization: implement simple filters or an LLM reviewer to find recurring failure states and write short corrective hints.

Measure token and latency gains: compare input-token counts and latency before/after adapter merge to evaluate cost savings.

Agent Features

Memory

  • Internalize hints into weights via context distillation (weights act as memory)
  • Probabilistic hint dropout (p=0.9) to preserve prompt attention

Planning

  • Generates inner monologue for step planning
  • Uses Python code actions to execute tools

Tool Use

  • Tools invoked as Python functions
  • complete_task(report, answer) finalizer function

Frameworks

  • Context distillation (teacher sees hints, student does not)
  • LoRA
  • LLM-based automated reviewers to localize mistakes

Is Agentic

true

Architectures

  • ReAct (reasoning trace + executable code actions)
  • LLM instruction-tuned backbone (Llama-3.1-70B)

Optimization Features

Token Efficiency

  • Input tokens reduced to ~7–10% of prompting baselines (Table 9)

Infra Optimization

  • Rank and adapter size chosen to fit training on a 4-node GPU setup (MI250X)

Model Optimization

  • LoRA
  • Adapters merged to base weights after training

Training Optimization

  • Forward-KL distillation from teacher to student
  • Per-hint dropout p=0.9 to prevent prompt attention collapse
  • Data balancing to preserve underrepresented tasks

Inference Optimization

  • Smaller prompts after internalization → fewer input tokens
  • Faster runtime observed (3–4×) on test tasks

Reproducibility

Data Urls

  • ToolQA (Zhuang et al., 2023)
  • OfficeBench (Wang et al., 2024b)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Human supervision required: the pipeline depends on writing and reusing targeted hints, which may scale with task variety.
  • Experiments limited to two related benchmarks; cross-domain continual training (transfer between domains) was not evaluated.
  • Human reviewers in experiments were the authors; real-world annotation quality may vary.

When Not To Use

  • When no fine-tuning of the deployed model is possible (closed API only)
  • When you lack any labeled or checkable training tasks to detect failures
  • When tasks require broad open-web access without safe sandboxing

Failure Modes

  • Overfitting to hint heuristics if dropout or balancing are misconfigured
  • Missing rare failure modes if reviewers or filters do not catch them
  • Hint conflicts accumulating across rounds if hints are inconsistent

Core Entities

Models

  • Llama-3.1-70B-Instruct (used as base)
  • GPT-4o (prompting baseline)
  • DeepSeek-V3 (baseline)

Metrics

  • Success rate (task solved)
  • Input tokens (per task)
  • Inference speed (relative)
  • Human annotation count

Datasets

  • ToolQA (Zhuang et al., 2023)
  • OfficeBench (Wang et al., 2024b)

Benchmarks

  • HumanEval
  • GSM8K