RAG-grounded LLMs improve agent reply suggestions vs BERT and expose a retrieval/latency trade-off

September 5, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

5

Authors

Sriram Veturi, Saurabh Vaichal, Reshma Lal Jagadheesh, Nafis Irtiza Tripto, Nian Yan

Links

Abstract / PDF

Why It Matters For Business

RAG-grounded LLMs give agents more accurate, relevant replies than a BERT pair-matching system, cutting agent search time and likely reducing handling time.

Summary TLDR

The authors build and deploy a Retrieval-Augmented Generation (RAG) system to suggest customer responses for contact-center agents. They test embedding and retrieval choices, a retrieval threshold, and prompting strategies. On internal company chat data, their RAG+PaLM2 setup beats an existing BERT-based suggestion system on multiple automated and human metrics (large gains in accuracy, relevance, specificity). ReAct and multi-step verification cuts hallucinations but adds several seconds of latency, making it impractical for real-time agent assist in this deployment.

Problem Statement

Contact-center LLMs often hallucinate or miss company policy details. The paper asks which embeddings, retrievers, thresholds, and prompting strategies make RAG reliably useful for live agent suggestions while staying fast enough for production.

Main Contribution

End-to-end RAG pipeline for agent-facing response suggestions using company KB and chat history.

Systematic comparison of embeddings (Vertex AI, SBERT, USE) and retrievers (ScaNN, HNSW) with Recall@k results.

Automated and human evaluations showing RAG outperforms an existing BERT-based system on accuracy and relevance.

Operational findings: a 0.7 cosine threshold to skip retrieval for out-of-domain queries and a latency vs accuracy trade-off for ReAct and multi-step prompts.

Key Findings

RAG responses scored much higher on human-evaluated accuracy than BERT.

NumbersAccuracy +45.69% (human eval, Table 4)

Automated measures show RAG improves semantic match and reduces AI-detection rates.

NumbersSemantic similarity +20.01%; AI-detected -40.17% (Table 3)

Vertex AI embeddings + ScaNN produced the best retrieval recall on company data.

NumbersR@1 improvement vs USE: +21.55% (Table 2)

A cosine-similarity threshold of 0.7 separated relevant from out-of-domain retrievals.

Numbers98.59% of out-of-domain retrievals <0.7; 88.96% of relevant retrievals >0.7 (Fig.4)

ReAct reduced hallucinations but greatly increased tail latency.

NumbersAccuracy +7.08%; hallucination -13.48% vs non-ReAct, but 95th pct latency 4.09s vs 0.885s (Tables 5,6)

CoVe and CoTP prompting did not improve practical accuracy for company data.

NumbersCoVe -43.65% accuracy; CoTP -3.45% accuracy (Table 5)

Results

Accuracy

Value+10.15%

BaselineExisting BERT-based system

Accuracy

Value+45.69%

BaselineExisting BERT-based system

Human Preference (RAG preferred)

ValueRAG preferred 75% of the time

BaselineExisting BERT-based system

Embedding recall improvement (R@1)

ValueVertex AI better than USE by +21.55%

BaselineUSE

Retrieval threshold separability

Value0.7 cosine threshold

ReAct latency (95th / 99th pct)

Value95th: 4.0942s; 99th: 6.2084s

Baselinenon-ReAct 95th: 0.885s; 99th:1.1678s

Who Should Care

What To Try In 7 Days

Index a small KB with Vertex AI embeddings and ScaNN and measure Recall@1/3

Set a cosine threshold at 0.7 to skip retrieval for generic queries

A/B test RAG vs your current suggestions on a sample of real chats with human raters (accuracy, relevance, preference)

Agent Features

Memory

  • retrieval memory (KB articles)
  • short-term chat context passed to LLM

Tool Use

  • ScaNN
  • HNSW KNN
  • Vertex AI embeddings
  • PaLM2 generation

Frameworks

  • Flask API
  • Gunicorn
  • Locust load testing

Architectures

  • retriever+generator RAG

Optimization Features

Token Efficiency

  • pass only top-k retrieved docs and recent chat context

System Optimization

  • deployed model as API endpoint; load tested with Locust

Inference Optimization

  • use retrieval threshold to skip costly retrieval and LLM calls
  • prefer ScaNN for faster large-scale nearest-neighbor search

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLMs still produce inaccurate answers despite grounding; paper acknowledges residual hallucination risk
  • Does not address prompt injection, multilingual KBs, or KB quality impacts in depth
  • ReAct and multi-step prompts add unacceptable latency for live agent assist

When Not To Use

  • Real-time low-latency scenarios where sub-second tail latency is required and multi-step reasoning is needed
  • Domains without a reliable and up-to-date company KB
  • Multilingual deployments (not evaluated)

Failure Modes

  • Wrong KB article retrieval leads to hallucinated but fluent answers
  • Retrieval misses (low Recall) produce missing or incorrect answers
  • ReAct or CoVe causes high latency spikes, harming agent UX

Core Entities

Models

  • PaLM2 (text-bison, text-unicorn)
  • SBERT-all-mpnet-base-v2
  • Universal Sentence Encoder (USE)
  • Vertex AI textembedding-gecko@001
  • BERT-based production system
  • ChatGPT-3.5-turbo (evaluator)

Metrics

  • Accuracy
  • Hallucination rate
  • Missing rate
  • Recall@k
  • AlignScore
  • Semantic similarity
  • Latency (95th/99th pct)

Datasets

  • Internal company KB (1205 docs)
  • Internal contact-center chat transcripts (1,000 chats)
  • MS-MARCO
  • SQuAD
  • TriviaQA

Benchmarks

  • Recall@K
  • AlignScore
  • Semantic similarity (LongFormer embeddings)
  • Human preference A/B