RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

Overview

Decision SnapshotNeeds Validation

The architecture and training recipe are practical and show gains on a specific internal dataset, but evidence is limited to a single domain, small eval set, and no public code or data.

Citations6

Evidence Strength0.40

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, Ming Cui

Links

Abstract / PDF

Why It Matters For Business

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Who Should Care

Product Manager ML Engineer Founder Engineering Lead

Summary TLDR

RAISE is an agent architecture that pairs a scratchpad (short-term memory) with an examples pool (retrieval long-term memory) and trains LLMs on small, high-quality agent dialogues. On an in-house real-estate chat dataset (948 scenes, 848 train / 100 eval) fine-tuned Qwen-14B-Chat in the RAISE setup achieved the best overall quality (7.71/10) and fewer action steps (0.26) compared to ablated variants. The paper argues fine-tuning plus memory gives more controllable, efficient agents than prompt-only methods. Results are limited to a single domain and an internal dataset.

Problem Statement

LLMs are strong at isolated tasks, but building conversational agents that keep context and act reliably in long, multi-turn dialogs is hard. The paper aims to add memory and targeted fine-tuning so agents stay context-aware, controllable, and efficient in domain dialogues.

Main Contribution

RAISE architecture: combines a scratchpad (short-term memory) and retrieved examples (long-term memory) to keep context and guide actions.

A pipeline to build small, high-quality agent training data: Conversation Selection, Scene Extraction, CoT (chain-of-thought) completion, and Scene Augmentation.

Key Findings

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

NumbersOverall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Practical UseFor domain chatbots, combine scratchpad+examples and fine-tune an LLM to raise response quality on real multi-turn dialogs.

Evidence RefTable 3 (Fine-tuning section)

Fine-tuning the same model beat prompting on measured quality and efficiency.

NumbersQwen prompting overall 6.68 -> fine-tuned Qwen overall 7.71; action steps 1.2 -> 0.26

Practical UseIf you can invest in ~1k labeled scenes, fine-tuning yields better and cheaper inference than relying on prompts alone.

Evidence RefTable 4 (Prompting vs. Fine-tuning)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall Quality Score (RAISE, fine-tuned)	7.71	ReAct (fine-tuned) 7.52	+0.19	in-house 100 eval scenes	RAISE fine-tuned overall quality 7.71 in Table 3	Table 3
Action Steps (RAISE, fine-tuned)	0.26	ReAct (fine-tuned) 0.88	-0.62	in-house 100 eval scenes	RAISE action steps 0.26 in Table 3	Table 3

What To Try In 7 Days

Collect and anonymize ~200–1000 real multi-turn dialogs for your domain.

Add a scratchpad that logs recent observations and a small examples pool with vector retrieval.

Fine-tune an open-source 14B chat model (LoRA or full SFT) on the curated scenes and measure overall quality and action steps.

Agent Features

Memory

Scratchpad (short-term scratch memory for recent interactions)Examples pool retrieved by vector search (long-term memory)Conversation historyTask trajectory (decision steps, tool outcomes)

Planning

Chain-of-Thought (CoT) reasoning (explain planning in steps)Task planning with template-based prompts

Tool Use

Tool invocation with named parameters12 domain tools (house info, market analysis, recommend listings, value report)

Frameworks

ReActRAISE

Is Agentic

Yes

Architectures

RAISEReActAct-OnlyReAct+ScratchpadReAct+Examples

Collaboration

Possible multi-LLM/tool collaboration (described as supported)

Optimization Features

Infra Optimization

Inference performed on NVIDIA A100 80GB for Qwen-14B-Chat

Training Optimization

SFTUse GPT-4 to generate CoT then manual validation

Inference Optimization

Reduced action steps (fewer tool calls) after fine-tuningLower plan steps leads to fewer internal loops

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Experiments use a single in-house real-estate dataset; generalization to other domains is untested.

Hallucination risks: role and knowledge hallucination are noted and partially addressed via data augmentation.

When Not To Use

When you need a general-purpose agent across many domains without domain data.

In safety-critical systems where hallucinations cannot be tolerated and external audits are required.

Failure Modes

Role hallucination: agent performs tasks outside its intended role.

Knowledge hallucination: agent fabricates facts when working memory or tools lack correct data.

Core Entities

Models

Qwen-14B-ChatGPT-4GPT-3.5

Metrics

Overall Quality ScoreSpecificityFactualityCoherenceNaturalnessPlan StepsAction StepsInference Speed (s)

RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

Fine-tuning the same model beat prompting on measured quality and efficiency.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

Fine-tuning the same model beat prompting on measured quality and efficiency.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

You May Also Want to Read

A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Key finding

Replace flat context with a graph memory (TME) to cut hallucinations and save tokens in multi-step LLM agents

Key finding

Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

Key finding

AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

Key finding

A stateful, conversational benchmark that tests LLMs using tools in live multi-turn dialogs

Key finding