Overview
The architecture and training recipe are practical and show gains on a specific internal dataset, but evidence is limited to a single domain, small eval set, and no public code or data.
Citations6
Evidence Strength0.40
Confidence0.75
Risk Signals11
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.
Who Should Care
Summary TLDR
RAISE is an agent architecture that pairs a scratchpad (short-term memory) with an examples pool (retrieval long-term memory) and trains LLMs on small, high-quality agent dialogues. On an in-house real-estate chat dataset (948 scenes, 848 train / 100 eval) fine-tuned Qwen-14B-Chat in the RAISE setup achieved the best overall quality (7.71/10) and fewer action steps (0.26) compared to ablated variants. The paper argues fine-tuning plus memory gives more controllable, efficient agents than prompt-only methods. Results are limited to a single domain and an internal dataset.
Problem Statement
LLMs are strong at isolated tasks, but building conversational agents that keep context and act reliably in long, multi-turn dialogs is hard. The paper aims to add memory and targeted fine-tuning so agents stay context-aware, controllable, and efficient in domain dialogues.
Main Contribution
RAISE architecture: combines a scratchpad (short-term memory) and retrieved examples (long-term memory) to keep context and guide actions.
A pipeline to build small, high-quality agent training data: Conversation Selection, Scene Extraction, CoT (chain-of-thought) completion, and Scene Augmentation.
Key Findings
RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.
Fine-tuning the same model beat prompting on measured quality and efficiency.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall Quality Score (RAISE, fine-tuned) | 7.71 | ReAct (fine-tuned) 7.52 | +0.19 | in-house 100 eval scenes | RAISE fine-tuned overall quality 7.71 in Table 3 | Table 3 |
| Action Steps (RAISE, fine-tuned) | 0.26 | ReAct (fine-tuned) 0.88 | -0.62 | in-house 100 eval scenes | RAISE action steps 0.26 in Table 3 | Table 3 |
What To Try In 7 Days
Collect and anonymize ~200–1000 real multi-turn dialogs for your domain.
Add a scratchpad that logs recent observations and a small examples pool with vector retrieval.
Fine-tune an open-source 14B chat model (LoRA or full SFT) on the curated scenes and measure overall quality and action steps.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Infra Optimization
Training Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Experiments use a single in-house real-estate dataset; generalization to other domains is untested.
Hallucination risks: role and knowledge hallucination are noted and partially addressed via data augmentation.
When Not To Use
When you need a general-purpose agent across many domains without domain data.
In safety-critical systems where hallucinations cannot be tolerated and external audits are required.
Failure Modes
Role hallucination: agent performs tasks outside its intended role.
Knowledge hallucination: agent fabricates facts when working memory or tools lack correct data.

