Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
A weighted RAG plus self-evaluation can cut misdiagnoses and speed resolution on large enterprise knowledge bases, improving service SLAs and reducing human time-to-fix.
Summary TLDR
This paper presents a practical Retrieval-Augmented Generation (RAG) system that assigns context-dependent weights to multiple enterprise data sources (product manuals, FAQs, guides, internal KBs), uses FAISS + all-MiniLM-L6-v2 for dense search, and validates outputs with a LLaMA-based self-evaluator. On the authors' enterprise dataset the full pipeline reaches 90.8% accuracy and 0.89 relevance versus 85.2%/0.75 for a standard (equal-weight) RAG and 76.1%/0.61 for keyword search. The design focuses on modular source weighting, threshold filtering to reduce hallucination, and a final self-check step; it is intended for single-agent troubleshooting services rather than multi-agent workflows.
Problem Statement
Enterprise troubleshooting needs fast, accurate answers from many scattered sources. Keyword search misses context and manuals; static RAG treats all sources equally. The result is slower, less precise fixes. The paper proposes a dynamically weighted RAG that prioritizes sources by query context and validates outputs to reduce hallucinations.
Main Contribution
A dynamic weighting mechanism that adjusts retrieval importance per data source based on query context (e.g., boost manuals for SKU queries).
A threshold-based filtering and multi-index aggregation pipeline over FAISS indices to reduce weak matches before generation.
Integration of a LLaMA-3.1(70B) self-evaluator to score and suppress low-confidence generated responses.
An end-to-end system design (preprocessing, weighted retrieval, generation, validation) and experiments on a large internal troubleshooting corpus.
Key Findings
Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines
LLaMA-based self-evaluator improves correctness over standard RAG
Approach works at enterprise scale across multiple source types
Results
Accuracy
Relevance Score
Accuracy
Who Should Care
What To Try In 7 Days
Index your manuals, FAQs, and KBs into separate FAISS indices.
Prototype rule-based source weights (e.g., boost manuals for SKU queries).
Add per-index threshold filtering to drop weak matches before generation. Fine-tune thresholds empirically on a labeled sample.
Agent Features
Memory
- retrieval memory via indexed embeddings
Planning
- iterative retrieval and validation loop
Tool Use
- FAISS for nearest-neighbor search
- LLaMA for generation and self-eval
Frameworks
- Weighted RAG
- Facade pattern for data sources
Is Agentic
true
Architectures
- single-agent retrieval-generation-evaluation pipeline
Optimization Features
Token Efficiency
- chunking and top-K filtering to reduce generator input
Infra Optimization
- GPU-based FAISS and large-model inference (A100 GPUs used)
System Optimization
- index-per-source design for selective thresholds
Training Optimization
- RL
Inference Optimization
- parallel FAISS index searches
- top-K selection to limit generator context
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Dataset appears proprietary; results may not generalize to other enterprises.
- Self-evaluation and generation use a 70B LLaMA model; cost and latency are high for many deployments.
- Weighting strategy is rule-based in experiments rather than learned from feedback.
- Paper focuses on single-turn queries; multi-turn conversational troubleshooting is future work.
When Not To Use
- If you lack GPU capacity for large LLaMA inference and FAISS at scale.
- When strict data locality or privacy rules forbid moving sensitive KBs into shared embeddings.
- If you need a certified deterministic decision process rather than a validated natural-language answer
Failure Modes
- Over-weighting one source can bias answers toward that source even if it's outdated.
- Poor threshold settings can filter out the correct document or allow weak matches, harming accuracy.
- Dependency on a large LLaMA model can fail silently if the model hallucinates and the self-eval threshold is too low.
Core Entities
Models
- all-MiniLM-L6-v2
- LLaMA-3.1(70B)
Metrics
- Accuracy
- Relevance Score
Datasets
- Product manuals (1,200)
- FAQs (40,000)
- Troubleshooting guides
- Internal knowledge bases
Context Entities
Models
- Sentence embedding model (all-MiniLM-L6-v2)
- Generative LLaMA for response
- LLaMA self-evaluator
Metrics
- Top-K retrieval
- Threshold-based filtering
Datasets
- Enterprise troubleshooting corpus built by authors

