RAISE: add a short-term scratchpad and retrieved examples, then fine-tune an LLM for better multi-turn agents

January 5, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

6

Authors

Na Liu, Liangyu Chen, Xiaoyu Tian, Wei Zou, Kaijiang Chen, Ming Cui

Links

Abstract / PDF

Why It Matters For Business

RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.

Summary TLDR

RAISE is an agent architecture that pairs a scratchpad (short-term memory) with an examples pool (retrieval long-term memory) and trains LLMs on small, high-quality agent dialogues. On an in-house real-estate chat dataset (948 scenes, 848 train / 100 eval) fine-tuned Qwen-14B-Chat in the RAISE setup achieved the best overall quality (7.71/10) and fewer action steps (0.26) compared to ablated variants. The paper argues fine-tuning plus memory gives more controllable, efficient agents than prompt-only methods. Results are limited to a single domain and an internal dataset.

Problem Statement

LLMs are strong at isolated tasks, but building conversational agents that keep context and act reliably in long, multi-turn dialogs is hard. The paper aims to add memory and targeted fine-tuning so agents stay context-aware, controllable, and efficient in domain dialogues.

Main Contribution

RAISE architecture: combines a scratchpad (short-term memory) and retrieved examples (long-term memory) to keep context and guide actions.

A pipeline to build small, high-quality agent training data: Conversation Selection, Scene Extraction, CoT (chain-of-thought) completion, and Scene Augmentation.

Empirical comparison showing fine-tuning with RAISE (on Qwen-14B-Chat) improves agent quality and reduces action steps versus prompting and ablations on an in-house real-estate test set.

Key Findings

RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.

NumbersOverall quality 7.71 (RAISE, fine-tuned Qwen-14B-Chat) on 100 eval scenes

Fine-tuning the same model beat prompting on measured quality and efficiency.

NumbersQwen prompting overall 6.68 -> fine-tuned Qwen overall 7.71; action steps 1.2 -> 0.26

A small, high-quality dataset sufficed for improvement.

NumbersTotal 948 scenes produced; 848 train / 100 eval used for fine-tuning

Memory mechanisms reduce agent actions and planning steps.

NumbersRAISE (fine-tuned) Action Steps 0.26, Plan Steps 1.26 vs Act-Only Action Steps 0.66 (fine-tuned) or 1.29 (prompting)

Results

Overall Quality Score (RAISE, fine-tuned)

Value7.71

BaselineReAct (fine-tuned) 7.52

Action Steps (RAISE, fine-tuned)

Value0.26

BaselineReAct (fine-tuned) 0.88

Overall quality: prompting vs. fine-tuning (Qwen-14B-Chat)

ValuePrompting 6.68 -> Fine-tuning 7.71

BaselinePrompting (Qwen) 6.68

Dataset size used for fine-tuning

Value948 scenes (848 train / 100 eval)

Who Should Care

What To Try In 7 Days

Collect and anonymize ~200–1000 real multi-turn dialogs for your domain.

Add a scratchpad that logs recent observations and a small examples pool with vector retrieval.

Fine-tune an open-source 14B chat model (LoRA or full SFT) on the curated scenes and measure overall quality and action steps.

Agent Features

Memory

  • Scratchpad (short-term scratch memory for recent interactions)
  • Examples pool retrieved by vector search (long-term memory)
  • Conversation history
  • Task trajectory (decision steps, tool outcomes)

Planning

  • Chain-of-Thought (CoT) reasoning (explain planning in steps)
  • Task planning with template-based prompts

Tool Use

  • Tool invocation with named parameters
  • 12 domain tools (house info, market analysis, recommend listings, value report)

Frameworks

  • ReAct
  • RAISE

Is Agentic

true

Architectures

  • RAISE
  • ReAct
  • Act-Only
  • ReAct+Scratchpad
  • ReAct+Examples

Collaboration

  • Possible multi-LLM/tool collaboration (described as supported)

Optimization Features

Infra Optimization

  • Inference performed on NVIDIA A100 80GB for Qwen-14B-Chat

Training Optimization

  • SFT
  • Use GPT-4 to generate CoT then manual validation

Inference Optimization

  • Reduced action steps (fewer tool calls) after fine-tuning
  • Lower plan steps leads to fewer internal loops

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments use a single in-house real-estate dataset; generalization to other domains is untested.
  • Hallucination risks: role and knowledge hallucination are noted and partially addressed via data augmentation.
  • Complex logical reasoning remains a challenge; paper notes difficulty with hard logic problems.
  • No public release of code or dataset limits reproducibility and independent verification.

When Not To Use

  • When you need a general-purpose agent across many domains without domain data.
  • In safety-critical systems where hallucinations cannot be tolerated and external audits are required.
  • When you cannot collect at least a few hundred high-quality domain dialogs for fine-tuning.

Failure Modes

  • Role hallucination: agent performs tasks outside its intended role.
  • Knowledge hallucination: agent fabricates facts when working memory or tools lack correct data.
  • Tool execution failures: incorrect parameters or tool outputs leading to wrong actions.
  • Long-range logic errors: incorrect multi-step planning or inconsistent chains of thought.

Core Entities

Models

  • Qwen-14B-Chat
  • GPT-4
  • GPT-3.5

Metrics

  • Overall Quality Score
  • Specificity
  • Factuality
  • Coherence
  • Naturalness
  • Plan Steps
  • Action Steps
  • Inference Speed (s)

Datasets

  • in-house real-estate IM dataset (948 scenes; 848 train / 100 eval)

Context Entities

Models

  • OpenAI GPT-4
  • OpenAI GPT-3.5