ChatCRS: add a knowledge retriever and a goal planner to make LLMs useful conversational recommenders

May 3, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.4

Citation Count

2

Authors

Chuang Li, Yang Deng, Hengchang Hu, Min-Yen Kan, Haizhou Li

Links

Abstract / PDF

Why It Matters For Business

If you want LLMs to make real product recommendations in a specific domain, wrap them with a KB retriever and a goal planner; that combination turns an LLM from brittle zero-shot text generator into a materially better recommender on evaluated datasets.

Summary TLDR

LLMs alone struggle for domain-specific conversational recommendation. ChatCRS is a modular framework that wraps an LLM with (1) a relation-based knowledge retrieval agent and (2) a goal-planning agent (LoRA fine-tuned). Both agents feed external inputs into few-shot in-context prompts. On two multi-goal Chinese CRS datasets (DuRecDial, TG-Redial) ChatCRS raises human-rated informativeness (~+17%) and proactivity (~+27%) and improves recommendation NDCG/MRR over few-shot LLM baselines by roughly an order of magnitude, approaching fully trained baselines.

Problem Statement

Large LLMs produce fluent text but lack reliable domain facts and explicit dialogue goals needed for conversational recommendation. Without external knowledge and goal guidance they give wrong facts, poor recommendations, or unproductive dialog turns in domain-specific CRS.

Main Contribution

Empirical study showing external knowledge and explicit goals are necessary to make LLMs work for conversational recommendation in a domain (Chinese movies).

ChatCRS: a three-agent design—relation-based knowledge retriever, LoRA-based goal planner, and an LLM conversational agent—that adds knowledge and goals without heavy LLM fine-tuning.

Demonstration on DuRecDial and TG-Redial that ChatCRS improves automatic metrics and human-evaluated informativeness/proactivity and boosts recommendation accuracy over few-shot LLM baselines.

Key Findings

External knowledge massively improves recommendation ranking for LLMs on DuRecDial.

NumbersChatGPT NDCG@10: DG 0.024 -> Oracle 0.617 (DuRecDial, Table 1)

Goal guidance and knowledge together improve response quality and dialog flow.

NumbersHuman scores: ChatCRS Info 1.76 vs ChatGPT 1.50 (+0.26, ≈17%); Pro 1.69 vs 1.30 (+0.39, ≈30%) (Table 6)

Both factual triples and item-based triples are needed; removing either harms both tasks.

NumbersRecommendation NDCG@10 (ChatGPT + knowledge): both 0.617; -w/o factual 0.272; -w/o item-based 0.376 (Table 3)

Results

NDCG@10

ValueChatCRS 0.549

BaselineChatGPT 0.024 (3-shot)

MRR@10

ValueChatCRS 0.543

BaselineChatGPT 0.018 (3-shot)

Human Informativeness score

ValueChatCRS 1.76

BaselineChatGPT 1.50

Human Proactivity score

ValueChatCRS 1.69

BaselineChatGPT 1.30

Knowledge retrieval F1

ValueChatCRS 0.553

BaselineChatGPT 0.015

Who Should Care

What To Try In 7 Days

Add a lightweight relation-based KB retriever that returns entity-relation triples and feed top triples into the LLM prompt.

Fine-tune a small goal planner via LoRA on your dialog-goal labels and use it to steer LLM replies.

Run a small human evaluation (100 dialogs) measuring informativeness and proactivity before/after adding KB+goals.

Agent Features

Memory

  • short-term dialog history via prompts

Planning

  • goal planning for next-turn dialogue goal
  • relation-based planning to choose KB relation

Tool Use

  • KB retrieval agent
  • LoRA
  • ICL prompting as tool interface

Frameworks

  • tool-augmented LLM (LLM calls agents)
  • in-context learning (ICL) orchestration

Is Agentic

true

Architectures

  • multi-agent (retriever + planner + conversational LLM)

Collaboration

  • agents coordinate: planner and retriever feed LLM

Optimization Features

Token Efficiency

  • limit to 50 item-based triples due to prompt token length

Infra Optimization

  • runs evaluated on single A100 or OpenAI API (cost ≈ US$20 for dataset)

Model Optimization

  • LoRA

System Optimization

  • agent decomposition to reduce LLM input load

Training Optimization

  • LoRA

Inference Optimization

  • few-shot in-context learning (3-shot) to avoid full LLM fine-tune

Reproducibility

Code Urls

  • Git4ChatCRS (paper states code publicly available at Git4ChatCRS)

Data Urls

  • DuRecDial
  • TG-Redial
  • KBCNpedia (cited for TG-Redial)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Experiments focus on Chinese movie datasets; results may not generalize to other domains.
  • Knowledge retrieval is single-hop only; multi-hop needs are untested.
  • Study uses few-shot ICL and small/closed LLMs (ChatGPT, LLaMA-7b/13b); no full LLM fine-tuning baseline comparisons except UniMIND.
  • Token-length constraint forces sampling up to 50 item triples; this truncation may drop key facts.

When Not To Use

  • If you require multi-hop reasoning across many KB hops.
  • For production systems that require full, collaborative-filtering recommendations based on rich user logs without KB signals.
  • When no structured KB of domain facts/items exists.

Failure Modes

  • Incorrect relation selection yields wrong retrieved facts and factual errors in replies.
  • Goal planner misprediction leads to unproductive dialog turns or wrong recommendations.
  • KB coverage gaps still leave LLMs guessing and hallucinating.
  • Token limits (50 items) can omit relevant items and degrade recommendations.

Core Entities

Models

  • ChatCRS
  • ChatGPT (gpt-3.5-turbo variants)
  • LLaMA-7b
  • LLaMA-13b
  • UniMIND
  • MGCG
  • TPNet
  • SASRec

Metrics

  • NDCG@10
  • NDCG@50
  • MRR@10
  • MRR@50
  • BLEU-1
  • BLEU-2
  • F1
  • Dist-1/2
  • Human: Fluency/Coherence/Informativeness/Proactivity

Datasets

  • DuRecDial
  • TG-Redial
  • KBCNpedia (used for TG-Redial)

Benchmarks

  • DuRecDial CRS benchmark
  • TG-Redial CRS benchmark