Overview
The system was tested on real chats, deployed as an API, and load-tested; results show clear gains but trade-offs in latency and remaining hallucination risks.
Citations5
Evidence Strength0.80
Confidence0.80
Risk Signals9
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 5/6
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 40%
Why It Matters For Business
RAG-grounded LLMs give agents more accurate, relevant replies than a BERT pair-matching system, cutting agent search time and likely reducing handling time.
Who Should Care
Summary TLDR
The authors build and deploy a Retrieval-Augmented Generation (RAG) system to suggest customer responses for contact-center agents. They test embedding and retrieval choices, a retrieval threshold, and prompting strategies. On internal company chat data, their RAG+PaLM2 setup beats an existing BERT-based suggestion system on multiple automated and human metrics (large gains in accuracy, relevance, specificity). ReAct and multi-step verification cuts hallucinations but adds several seconds of latency, making it impractical for real-time agent assist in this deployment.
Problem Statement
Contact-center LLMs often hallucinate or miss company policy details. The paper asks which embeddings, retrievers, thresholds, and prompting strategies make RAG reliably useful for live agent suggestions while staying fast enough for production.
Main Contribution
End-to-end RAG pipeline for agent-facing response suggestions using company KB and chat history.
Systematic comparison of embeddings (Vertex AI, SBERT, USE) and retrievers (ScaNN, HNSW) with Recall@k results.
Key Findings
RAG responses scored much higher on human-evaluated accuracy than BERT.
Automated measures show RAG improves semantic match and reduces AI-detection rates.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | +10.15% | Existing BERT-based system | +10.15% | Internal company data (automated eval) | Table 3 shows averaged automated metric improvements | Table 3 |
| Accuracy | +45.69% | Existing BERT-based system | +45.69% | Human annotation on 1,000 chats | Table 4 human evaluation comparison | Table 4 |
What To Try In 7 Days
Index a small KB with Vertex AI embeddings and ScaNN and measure Recall@1/3
Set a cosine threshold at 0.7 to skip retrieval for generic queries
A/B test RAG vs your current suggestions on a sample of real chats with human raters (accuracy, relevance, preference)
Agent Features
Memory
Tool Use
Frameworks
Architectures
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
LLMs still produce inaccurate answers despite grounding; paper acknowledges residual hallucination risk
Does not address prompt injection, multilingual KBs, or KB quality impacts in depth
When Not To Use
Real-time low-latency scenarios where sub-second tail latency is required and multi-step reasoning is needed
Domains without a reliable and up-to-date company KB
Failure Modes
Wrong KB article retrieval leads to hallucinated but fluent answers
Retrieval misses (low Recall) produce missing or incorrect answers

