Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
RAISE shows you can get better, cheaper domain chatbots by adding a short-term scratchpad and retrieved examples, and then fine-tuning on under 1k curated scenes.
Summary TLDR
RAISE is an agent architecture that pairs a scratchpad (short-term memory) with an examples pool (retrieval long-term memory) and trains LLMs on small, high-quality agent dialogues. On an in-house real-estate chat dataset (948 scenes, 848 train / 100 eval) fine-tuned Qwen-14B-Chat in the RAISE setup achieved the best overall quality (7.71/10) and fewer action steps (0.26) compared to ablated variants. The paper argues fine-tuning plus memory gives more controllable, efficient agents than prompt-only methods. Results are limited to a single domain and an internal dataset.
Problem Statement
LLMs are strong at isolated tasks, but building conversational agents that keep context and act reliably in long, multi-turn dialogs is hard. The paper aims to add memory and targeted fine-tuning so agents stay context-aware, controllable, and efficient in domain dialogues.
Main Contribution
RAISE architecture: combines a scratchpad (short-term memory) and retrieved examples (long-term memory) to keep context and guide actions.
A pipeline to build small, high-quality agent training data: Conversation Selection, Scene Extraction, CoT (chain-of-thought) completion, and Scene Augmentation.
Empirical comparison showing fine-tuning with RAISE (on Qwen-14B-Chat) improves agent quality and reduces action steps versus prompting and ablations on an in-house real-estate test set.
Key Findings
RAISE with fine-tuning gave the highest overall quality among tested frameworks on the in-house real-estate eval.
Fine-tuning the same model beat prompting on measured quality and efficiency.
A small, high-quality dataset sufficed for improvement.
Memory mechanisms reduce agent actions and planning steps.
Results
Overall Quality Score (RAISE, fine-tuned)
Action Steps (RAISE, fine-tuned)
Overall quality: prompting vs. fine-tuning (Qwen-14B-Chat)
Dataset size used for fine-tuning
Who Should Care
What To Try In 7 Days
Collect and anonymize ~200–1000 real multi-turn dialogs for your domain.
Add a scratchpad that logs recent observations and a small examples pool with vector retrieval.
Fine-tune an open-source 14B chat model (LoRA or full SFT) on the curated scenes and measure overall quality and action steps.
Agent Features
Memory
- Scratchpad (short-term scratch memory for recent interactions)
- Examples pool retrieved by vector search (long-term memory)
- Conversation history
- Task trajectory (decision steps, tool outcomes)
Planning
- Chain-of-Thought (CoT) reasoning (explain planning in steps)
- Task planning with template-based prompts
Tool Use
- Tool invocation with named parameters
- 12 domain tools (house info, market analysis, recommend listings, value report)
Frameworks
- ReAct
- RAISE
Is Agentic
true
Architectures
- RAISE
- ReAct
- Act-Only
- ReAct+Scratchpad
- ReAct+Examples
Collaboration
- Possible multi-LLM/tool collaboration (described as supported)
Optimization Features
Infra Optimization
- Inference performed on NVIDIA A100 80GB for Qwen-14B-Chat
Training Optimization
- SFT
- Use GPT-4 to generate CoT then manual validation
Inference Optimization
- Reduced action steps (fewer tool calls) after fine-tuning
- Lower plan steps leads to fewer internal loops
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Experiments use a single in-house real-estate dataset; generalization to other domains is untested.
- Hallucination risks: role and knowledge hallucination are noted and partially addressed via data augmentation.
- Complex logical reasoning remains a challenge; paper notes difficulty with hard logic problems.
- No public release of code or dataset limits reproducibility and independent verification.
When Not To Use
- When you need a general-purpose agent across many domains without domain data.
- In safety-critical systems where hallucinations cannot be tolerated and external audits are required.
- When you cannot collect at least a few hundred high-quality domain dialogs for fine-tuning.
Failure Modes
- Role hallucination: agent performs tasks outside its intended role.
- Knowledge hallucination: agent fabricates facts when working memory or tools lack correct data.
- Tool execution failures: incorrect parameters or tool outputs leading to wrong actions.
- Long-range logic errors: incorrect multi-step planning or inconsistent chains of thought.
Core Entities
Models
- Qwen-14B-Chat
- GPT-4
- GPT-3.5
Metrics
- Overall Quality Score
- Specificity
- Factuality
- Coherence
- Naturalness
- Plan Steps
- Action Steps
- Inference Speed (s)
Datasets
- in-house real-estate IM dataset (948 scenes; 848 train / 100 eval)
Context Entities
Models
- OpenAI GPT-4
- OpenAI GPT-3.5

