Train agents to internalize human hints so they stop relying on ever-growing prompts

Overview

Decision SnapshotReady For Pilot

The method is practical: it trains adapters with a small number of corrective hints, shows large gains on multi-tool benchmarks, and reduces token and latency costs; risk comes from human effort scaling and deployment guardrails.

Citations1

Evidence Strength0.85

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 5/6

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 60%

Authors

Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, Harri Valpola

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can convert repeated human guidance into model updates that reduce prompt length, cut inference cost, and raise multi-tool task reliability with modest annotation work.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces Memento No More (MNM), an iterative training recipe that turns human-written ‘hints’ (short corrective guidance) into model weights so an LLM agent can solve many tool-using tasks without long prompts. They run a Llama-3.1-70B agent through three rounds of context-distillation training (LoRA adapters + forward-KL distillation) on ToolQA and OfficeBench. After three rounds the agent reaches 97.9% success on ToolQA and 90.3% on OfficeBench, outperforms or matches larger prompting-based agents (GPT-4o, DeepSeek-V3) on these benchmarks, reduces input-token usage by ~10x, and shows no drop on standard coding/reasoning tests. The method uses small sets of targeted corrective “

Problem Statement

Current LLM agents often keep appending hints and tool docs into prompts to handle many tasks. Long prompts slow inference, cause information overload, and don’t let the model ‘internalize’ skills. The paper asks: can we teach a single LLM agent to absorb tool usage rules and targeted corrective guidance into its weights so it performs many tool-using tasks without huge per-task prompts?

Main Contribution

A practical iterative coaching pipeline that converts human corrective hints into model weights via context distillation and LoRA adapters.

A working implementation on Llama-3.1-70B that internalizes tool docs and targeted fixes in three training rounds, reducing prompt dependency.

Key Findings

After three rounds MNM achieves 97.9% success on ToolQA.

Numbers97.9% success (Table 2, Round 3)

Practical UseYou can train a single LLM to handle diverse retrieval-and-tool tasks with minimal prompt hints by running a few targeted distillation rounds.

Evidence RefTable 2 (ToolQA results)

MNM reaches 90.3% success on OfficeBench after three rounds, slightly above GPT-4o on multi-app tasks.

Numbers90.3% vs GPT-4o 89.9% (Table 4, Round 3)

Practical UseFor office-style workflows, coaching via hints can match or beat larger prompting-first systems while keeping the model localizable and trainable.

Evidence RefTable 4 (OfficeBench results)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
ToolQA success rate (MNM, after Round 3)	97.9%	Combined-prompt Llama: 61.0%; task-specific hint Llama: 96.9%; GPT-4o: 92.8%; DeepSeek-V3: 87.5%	↑36.9 pp vs combined-prompt Llama; ↑5.1 pp vs GPT-4o	ToolQA test tasks	Table 2 (Round 3)	Table 2
OfficeBench success rate (MNM, after Round 3)	90.3%	Combined-prompt Llama: 14.6%; GPT-4o: 89.9%; DeepSeek-V3: 86.9%	↑75.7 pp vs combined-prompt Llama; ≈+0.4 pp vs GPT-4o	OfficeBench test tasks	Table 4 (Round 3)	Table 4

What To Try In 7 Days

Run teacher-student distillation: collect few teacher trajectories with task-specific hints and distill into a LoRA adapter.

Automate error localization: implement simple filters or an LLM reviewer to find recurring failure states and write short corrective hints.

Measure token and latency gains: compare input-token counts and latency before/after adapter merge to evaluate cost savings.

Agent Features

Memory

Internalize hints into weights via context distillation (weights act as memory)Probabilistic hint dropout (p=0.9) to preserve prompt attention

Planning

Generates inner monologue for step planningUses Python code actions to execute tools

Tool Use

Tools invoked as Python functionscomplete_task(report, answer) finalizer function

Frameworks

Context distillation (teacher sees hints, student does not)LoRALLM-based automated reviewers to localize mistakes

Is Agentic

Yes

Architectures

ReAct (reasoning trace + executable code actions)LLM instruction-tuned backbone (Llama-3.1-70B)

Optimization Features

Token Efficiency

Input tokens reduced to ~7–10% of prompting baselines (Table 9)

Infra Optimization

Rank and adapter size chosen to fit training on a 4-node GPU setup (MI250X)

Model Optimization

LoRAAdapters merged to base weights after training

Training Optimization

Forward-KL distillation from teacher to studentPer-hint dropout p=0.9 to prevent prompt attention collapseData balancing to preserve underrepresented tasks

Inference Optimization

Smaller prompts after internalization → fewer input tokensFaster runtime observed (3–4×) on test tasks

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/minttusofia/memento-no-more

Data URLs

ToolQA (Zhuang et al., 2023)OfficeBench (Wang et al., 2024b)

Risks & Boundaries

Limitations

Human supervision required: the pipeline depends on writing and reusing targeted hints, which may scale with task variety.

Experiments limited to two related benchmarks; cross-domain continual training (transfer between domains) was not evaluated.

When Not To Use

When no fine-tuning of the deployed model is possible (closed API only)

When you lack any labeled or checkable training tasks to detect failures

Failure Modes

Overfitting to hint heuristics if dropout or balancing are misconfigured

Missing rare failure modes if reviewers or filters do not catch them

Core Entities

Models

Llama-3.1-70B-Instruct (used as base)GPT-4o (prompting baseline)DeepSeek-V3 (baseline)

Metrics

Success rate (task solved)Input tokens (per task)Inference speed (relative)Human annotation count

Datasets

ToolQA (Zhuang et al., 2023)OfficeBench (Wang et al., 2024b)

Benchmarks

HumanEvalGSM8K

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

After three rounds MNM achieves 97.9% success on ToolQA.

MNM reaches 90.3% success on OfficeBench after three rounds, slightly above GPT-4o on multi-app tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

Key finding

Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

Key finding

AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

Key finding

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

Key finding