Overview
Production Readiness
0.7
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
RAG-grounded LLMs give agents more accurate, relevant replies than a BERT pair-matching system, cutting agent search time and likely reducing handling time.
Summary TLDR
The authors build and deploy a Retrieval-Augmented Generation (RAG) system to suggest customer responses for contact-center agents. They test embedding and retrieval choices, a retrieval threshold, and prompting strategies. On internal company chat data, their RAG+PaLM2 setup beats an existing BERT-based suggestion system on multiple automated and human metrics (large gains in accuracy, relevance, specificity). ReAct and multi-step verification cuts hallucinations but adds several seconds of latency, making it impractical for real-time agent assist in this deployment.
Problem Statement
Contact-center LLMs often hallucinate or miss company policy details. The paper asks which embeddings, retrievers, thresholds, and prompting strategies make RAG reliably useful for live agent suggestions while staying fast enough for production.
Main Contribution
End-to-end RAG pipeline for agent-facing response suggestions using company KB and chat history.
Systematic comparison of embeddings (Vertex AI, SBERT, USE) and retrievers (ScaNN, HNSW) with Recall@k results.
Automated and human evaluations showing RAG outperforms an existing BERT-based system on accuracy and relevance.
Operational findings: a 0.7 cosine threshold to skip retrieval for out-of-domain queries and a latency vs accuracy trade-off for ReAct and multi-step prompts.
Key Findings
RAG responses scored much higher on human-evaluated accuracy than BERT.
Automated measures show RAG improves semantic match and reduces AI-detection rates.
Vertex AI embeddings + ScaNN produced the best retrieval recall on company data.
A cosine-similarity threshold of 0.7 separated relevant from out-of-domain retrievals.
ReAct reduced hallucinations but greatly increased tail latency.
CoVe and CoTP prompting did not improve practical accuracy for company data.
Results
Accuracy
Accuracy
Human Preference (RAG preferred)
Embedding recall improvement (R@1)
Retrieval threshold separability
ReAct latency (95th / 99th pct)
Who Should Care
What To Try In 7 Days
Index a small KB with Vertex AI embeddings and ScaNN and measure Recall@1/3
Set a cosine threshold at 0.7 to skip retrieval for generic queries
A/B test RAG vs your current suggestions on a sample of real chats with human raters (accuracy, relevance, preference)
Agent Features
Memory
- retrieval memory (KB articles)
- short-term chat context passed to LLM
Tool Use
- ScaNN
- HNSW KNN
- Vertex AI embeddings
- PaLM2 generation
Frameworks
- Flask API
- Gunicorn
- Locust load testing
Architectures
- retriever+generator RAG
Optimization Features
Token Efficiency
- pass only top-k retrieved docs and recent chat context
System Optimization
- deployed model as API endpoint; load tested with Locust
Inference Optimization
- use retrieval threshold to skip costly retrieval and LLM calls
- prefer ScaNN for faster large-scale nearest-neighbor search
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLMs still produce inaccurate answers despite grounding; paper acknowledges residual hallucination risk
- Does not address prompt injection, multilingual KBs, or KB quality impacts in depth
- ReAct and multi-step prompts add unacceptable latency for live agent assist
When Not To Use
- Real-time low-latency scenarios where sub-second tail latency is required and multi-step reasoning is needed
- Domains without a reliable and up-to-date company KB
- Multilingual deployments (not evaluated)
Failure Modes
- Wrong KB article retrieval leads to hallucinated but fluent answers
- Retrieval misses (low Recall) produce missing or incorrect answers
- ReAct or CoVe causes high latency spikes, harming agent UX
Core Entities
Models
- PaLM2 (text-bison, text-unicorn)
- SBERT-all-mpnet-base-v2
- Universal Sentence Encoder (USE)
- Vertex AI textembedding-gecko@001
- BERT-based production system
- ChatGPT-3.5-turbo (evaluator)
Metrics
- Accuracy
- Hallucination rate
- Missing rate
- Recall@k
- AlignScore
- Semantic similarity
- Latency (95th/99th pct)
Datasets
- Internal company KB (1205 docs)
- Internal contact-center chat transcripts (1,000 chats)
- MS-MARCO
- SQuAD
- TriviaQA
Benchmarks
- Recall@K
- AlignScore
- Semantic similarity (LongFormer embeddings)
- Human preference A/B

