Weighted RAG plus LLaMA self-evaluation to speed and improve enterprise troubleshooting

December 16, 20247 min

Overview

Decision SnapshotNeeds Validation

Scores reflect a practical, engineered system with promising internal results but limited public code or dataset release to fully validate across enterprises.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Rajat Khanda

Links

Abstract / PDF

Why It Matters For Business

A weighted RAG plus self-evaluation can cut misdiagnoses and speed resolution on large enterprise knowledge bases, improving service SLAs and reducing human time-to-fix.

Who Should Care

Summary TLDR

This paper presents a practical Retrieval-Augmented Generation (RAG) system that assigns context-dependent weights to multiple enterprise data sources (product manuals, FAQs, guides, internal KBs), uses FAISS + all-MiniLM-L6-v2 for dense search, and validates outputs with a LLaMA-based self-evaluator. On the authors' enterprise dataset the full pipeline reaches 90.8% accuracy and 0.89 relevance versus 85.2%/0.75 for a standard (equal-weight) RAG and 76.1%/0.61 for keyword search. The design focuses on modular source weighting, threshold filtering to reduce hallucination, and a final self-check step; it is intended for single-agent troubleshooting services rather than multi-agent workflows.

Problem Statement

Enterprise troubleshooting needs fast, accurate answers from many scattered sources. Keyword search misses context and manuals; static RAG treats all sources equally. The result is slower, less precise fixes. The paper proposes a dynamically weighted RAG that prioritizes sources by query context and validates outputs to reduce hallucinations.

Main Contribution

A dynamic weighting mechanism that adjusts retrieval importance per data source based on query context (e.g., boost manuals for SKU queries).

A threshold-based filtering and multi-index aggregation pipeline over FAISS indices to reduce weak matches before generation.

Key Findings

Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines

NumbersAccuracy: 90.8% (proposed) vs 85.2% (standard RAG) vs 76.1% (BM25)

Practical UseIf you add source weighting and a self-check, expect materially fewer incorrect answers on similar enterprise corpora.

Evidence RefTable 1, Sec 5.1

LLaMA-based self-evaluator improves correctness over standard RAG

NumbersAccuracy +5.6% versus standard RAG

Practical UseAdd a lightweight generative self-eval step to your pipeline to cut error rate without changing retrieval models.

Evidence RefSec 5.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy90.8%85.2% (Standard RAG)+5.6%Author enterprise troubleshooting datasetTable 1; Sec 5.1Table 1
Relevance Score0.890.75 (Standard RAG)+0.14Author enterprise troubleshooting datasetTable 1; Sec 5.1Table 1

What To Try In 7 Days

Index your manuals, FAQs, and KBs into separate FAISS indices.

Prototype rule-based source weights (e.g., boost manuals for SKU queries).

Add per-index threshold filtering to drop weak matches before generation. Fine-tune thresholds empirically on a labeled sample.

Agent Features

Memory
retrieval memory via indexed embeddings
Planning
iterative retrieval and validation loop
Tool Use
FAISS for nearest-neighbor searchLLaMA for generation and self-eval
Frameworks
Weighted RAGFacade pattern for data sources
Is Agentic

Yes

Architectures
single-agent retrieval-generation-evaluation pipeline

Optimization Features

Token Efficiency
chunking and top-K filtering to reduce generator input
Infra Optimization
GPU-based FAISS and large-model inference (A100 GPUs used)
System Optimization
index-per-source design for selective thresholds
Training Optimization
RL
Inference Optimization
parallel FAISS index searchestop-K selection to limit generator context

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Dataset appears proprietary; results may not generalize to other enterprises.

Self-evaluation and generation use a 70B LLaMA model; cost and latency are high for many deployments.

When Not To Use

If you lack GPU capacity for large LLaMA inference and FAISS at scale.

When strict data locality or privacy rules forbid moving sensitive KBs into shared embeddings.

Failure Modes

Over-weighting one source can bias answers toward that source even if it's outdated.

Poor threshold settings can filter out the correct document or allow weak matches, harming accuracy.

Core Entities

Models

all-MiniLM-L6-v2LLaMA-3.1(70B)

Metrics

AccuracyRelevance Score

Datasets

Product manuals (1,200)FAQs (40,000)Troubleshooting guidesInternal knowledge bases

Context Entities

Models

Sentence embedding model (all-MiniLM-L6-v2)Generative LLaMA for responseLLaMA self-evaluator

Metrics

Top-K retrievalThreshold-based filtering

Datasets

Enterprise troubleshooting corpus built by authors