Probabilistic federated RAG that routes across product domains to boost multi-product QA

Overview

Decision SnapshotNeeds Validation

Paper provides a concrete method, datasets and Azure-based evaluations. Results are consistent across uni- and cross-domain tests but code and public data release are pending, and exact numeric improvements are shown only in figures.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 2/3

Findings with evidence refs: 3/3

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 60%

Authors

Parshin Shojaee, Sai Sree Harsha, Dan Luo, Akash Maharaj, Tong Yu, Yunyao Li

Links

Abstract / PDF

Why It Matters For Business

If product support queries span multiple products, probabilistic federated retrieval increases correct-document retrieval and improves answer quality without per-product LLM finetuning.

Who Should Care

Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

The paper introduces MKP-QA, a multi-product RAG system that combines a learned domain router, stochastic gating, and a dense bi-encoder retriever to federate search across product domains. The authors also build Adobe-focused uni- and cross-product datasets (AEP, Target, CJA). MKP-QA consistently outperforms baselines in top-1 retrieval accuracy and in LLM-judged relevancy and faithfulness on these datasets, with larger gains for cross-domain queries. Datasets and deployment notes are provided; code and public data release are pending Adobe approval.

Problem Statement

Enterprise product questions often span multiple products and require cross-product knowledge. Existing RAG pipelines either search every domain (slow, more hallucination) or pick one domain (can miss cross-product info). There is also no suitable public benchmark for multi-product product QA.

Main Contribution

MKP-QA: a probabilistic federated RAG pipeline that combines a learned query-domain router, stochastic gating for exploration-exploitation, and a dense bi-encoder retriever to rank documents across product domains.

A stochastic gating mechanism that samples domains based on router likelihoods and adaptive entropy-based thresholds to reduce selection errors and enable exploration.

Key Findings

MKP-QA outperforms baselines on retrieval and response quality.

Practical UseUse probabilistic federated routing plus dense retrieval when queries may require cross-product knowledge; it raises correct-document retrieval and improves generated answers versus single-index or hard router methods.

Evidence RefFig.2, Fig.3

Large synthetic dataset per product was created with GPT-4 assistance.

NumbersSLA pairs: AEP 28,860; CJA 27,820; Target 29,610

Practical UseYou can train and evaluate multi-domain retrievers on tens of thousands of synthetic, SME-vetted query-doc pairs per product.

Evidence RefTable 1 (Section 4.4)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
SLA dataset size per product	AEP 28,860; CJA 27,820; Target 29,610 query-doc pairs	—	—	SLA uni-domain	Table 1 (Section 4.4)	Table 1
% positive pairs (SLA)	AEP 17.53%; CJA 18.28%; Target 20.26%	—	—	SLA uni-domain	Table 1 (Section 4.4)	Table 1

What To Try In 7 Days

Run a small federated retrieval prototype: train a domain router and a Sentence-BERT retriever on existing product docs, compare top-1 retrieval against unified search.

Implement entropy-based adaptive gating to allow low-confidence domains to be sampled and measure cross-product recall lift.

Use GPT-4 (or internal judge) to cheaply evaluate relevancy and faithfulness on a held-out sample before full deployment.

Agent Features

Tool Use

Uses GPT-4/GPT-3.5 for query generation and evaluation

Optimization Features

Infra Optimization

LoRA

System Optimization

Federated domain selection reduces the number of domains searched per query

Training Optimization

Contrastive fine-tuning of bi-encoder with symmetric InfoNCE

Inference Optimization

Offline document embedding and vector DB for fast retrievalPlanned: parallel domain routing and caching (deployment)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Dataset and code release are pending Adobe approval, so exact replication is currently limited.

Performance depends on quality of domain router; misclassification can still remove needed domains despite stochastic gating.

When Not To Use

If you cannot afford vector DB or offline embedding infrastructure for retrieval at scale.

If queries are strictly single-domain and a simple index yields sufficient accuracy.

Failure Modes

Router assigns near-zero probability to relevant domain and gating fails to sample it, causing missed evidence.

Too many active domains (low threshold) increases latency and may introduce irrelevant context that hurts LLM faithfulness.

Core Entities

Models

BERT variant (domain router)Sentence-BERT bi-encoder (retriever)GPT-3.5-turbo-1106 (generation/eval)GPT-4-0314 (generation/eval)GPT-4 (query generation and annotation assistance)

Metrics

AccuracyRelevancy (LLM judged)Faithfulness (RAGAS2 + GPT-4 judged)

Datasets

Adobe Experience Platform (AEP) multi-product datasetAdobe Target multi-product datasetAdobe Customer Journey Analytics (CJA) multi-product datasetCross-domain combinations: AEP+CJA, AEP+Target, CJA+Target

Benchmarks

Adobe multi-product uni-domain and cross-domain RAG datasets (new, pending release)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MKP-QA outperforms baselines on retrieval and response quality.

Large synthetic dataset per product was created with GPT-4 assistance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Cross-encoder re-ranking boosts faithfulness of RAG for CDC policy Q&A

Key finding

DomainRAG: a Chinese benchmark testing how RAG helps LLMs solve college-enrollment questions

Key finding

Practical survey of retrieval-augmented generation (RAG): how retrievers, fusion methods, training and benchmarks fit together

Key finding

Domain-specific RAG cuts hallucinated citations in ophthalmology long-form answers

Key finding

A public end-to-end benchmark showing retrieval quality—not the LLM—mostly determines legal RAG performance

Key finding