RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

January 5, 20247 min

Overview

Decision SnapshotNeeds Validation

The architecture and training recipe are practical and show gains on a specific internal dataset, but evidence is limited to a single domain, small eval set, and no public code or data.

Citations6

Evidence Strength0.40

Confidence0.75

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, Ming Cui

Links

Abstract / PDF

Why It Matters For Business

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Who Should Care

Summary TLDR

RAISE is an agent architecture that pairs a scratchpad (short-term memory) with an examples pool (retrieval long-term memory) and trains LLMs on small, high-quality agent dialogues. On an in-house real-estate chat dataset (948 scenes, 848 train / 100 eval) fine-tuned Qwen-14B-Chat in the RAISE setup achieved the best overall quality (7.71/10) and fewer action steps (0.26) compared to ablated variants. The paper argues fine-tuning plus memory gives more controllable, efficient agents than prompt-only methods. Results are limited to a single domain and an internal dataset.

Problem Statement

LLMs are strong at isolated tasks, but building conversational agents that keep context and act reliably in long, multi-turn dialogs is hard. The paper aims to add memory and targeted fine-tuning so agents stay context-aware, controllable, and efficient in domain dialogues.

Main Contribution

RAISE architecture: combines a scratchpad (short-term memory) and retrieved examples (long-term memory) to keep context and guide actions.

A pipeline to build small, high-quality agent training data: Conversation Selection, Scene Extraction, CoT (chain-of-thought) completion, and Scene Augmentation.

Key Findings

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

NumbersOverall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Practical UseFor domain chatbots, combine scratchpad+examples and fine-tune an LLM to raise response quality on real multi-turn dialogs.

Evidence RefTable 3 (Fine-tuning section)

Fine-tuning the same model beat prompting on measured quality and efficiency.

NumbersQwen prompting overall 6.68 -> fine-tuned Qwen overall 7.71; action steps 1.2 -> 0.26

Practical UseIf you can invest in ~1k labeled scenes, fine-tuning yields better and cheaper inference than relying on prompts alone.

Evidence RefTable 4 (Prompting vs. Fine-tuning)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall Quality Score (RAISE, fine-tuned)7.71ReAct (fine-tuned) 7.52+0.19in-house 100 eval scenesRAISE fine-tuned overall quality 7.71 in Table 3Table 3
Action Steps (RAISE, fine-tuned)0.26ReAct (fine-tuned) 0.88-0.62in-house 100 eval scenesRAISE action steps 0.26 in Table 3Table 3

What To Try In 7 Days

Collect and anonymize ~200–1000 real multi-turn dialogs for your domain.

Add a scratchpad that logs recent observations and a small examples pool with vector retrieval.

Fine-tune an open-source 14B chat model (LoRA or full SFT) on the curated scenes and measure overall quality and action steps.

Agent Features

Memory
Scratchpad (short-term scratch memory for recent interactions)Examples pool retrieved by vector search (long-term memory)Conversation historyTask trajectory (decision steps, tool outcomes)
Planning
Chain-of-Thought (CoT) reasoning (explain planning in steps)Task planning with template-based prompts
Tool Use
Tool invocation with named parameters12 domain tools (house info, market analysis, recommend listings, value report)
Frameworks
ReActRAISE
Is Agentic

Yes

Architectures
RAISEReActAct-OnlyReAct+ScratchpadReAct+Examples
Collaboration
Possible multi-LLM/tool collaboration (described as supported)

Optimization Features

Infra Optimization
Inference performed on NVIDIA A100 80GB for Qwen-14B-Chat
Training Optimization
SFTUse GPT-4 to generate CoT then manual validation
Inference Optimization
Reduced action steps (fewer tool calls) after fine-tuningLower plan steps leads to fewer internal loops

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Experiments use a single in-house real-estate dataset; generalization to other domains is untested.

Hallucination risks: role and knowledge hallucination are noted and partially addressed via data augmentation.

When Not To Use

When you need a general-purpose agent across many domains without domain data.

In safety-critical systems where hallucinations cannot be tolerated and external audits are required.

Failure Modes

Role hallucination: agent performs tasks outside its intended role.

Knowledge hallucination: agent fabricates facts when working memory or tools lack correct data.

Core Entities

Models

Qwen-14B-ChatGPT-4GPT-3.5

Metrics

Overall Quality ScoreSpecificityFactualityCoherenceNaturalnessPlan StepsAction StepsInference Speed (s)

Datasets

in-house real-estate IM dataset (948 scenes; 848 train / 100 eval)

Context Entities

Models

OpenAI GPT-4OpenAI GPT-3.5