Weighted RAG plus LLaMA self-evaluation to speed and improve enterprise troubleshooting

Overview

Decision SnapshotNeeds Validation

Scores reflect a practical, engineered system with promising internal results but limited public code or dataset release to fully validate across enterprises.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Rajat Khanda

Links

Abstract / PDF

Why It Matters For Business

A weighted RAG plus self-evaluation can cut misdiagnoses and speed resolution on large enterprise knowledge bases, improving service SLAs and reducing human time-to-fix.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper presents a practical Retrieval-Augmented Generation (RAG) system that assigns context-dependent weights to multiple enterprise data sources (product manuals, FAQs, guides, internal KBs), uses FAISS + all-MiniLM-L6-v2 for dense search, and validates outputs with a LLaMA-based self-evaluator. On the authors' enterprise dataset the full pipeline reaches 90.8% accuracy and 0.89 relevance versus 85.2%/0.75 for a standard (equal-weight) RAG and 76.1%/0.61 for keyword search. The design focuses on modular source weighting, threshold filtering to reduce hallucination, and a final self-check step; it is intended for single-agent troubleshooting services rather than multi-agent workflows.

Problem Statement

Enterprise troubleshooting needs fast, accurate answers from many scattered sources. Keyword search misses context and manuals; static RAG treats all sources equally. The result is slower, less precise fixes. The paper proposes a dynamically weighted RAG that prioritizes sources by query context and validates outputs to reduce hallucinations.

Main Contribution

A dynamic weighting mechanism that adjusts retrieval importance per data source based on query context (e.g., boost manuals for SKU queries).

A threshold-based filtering and multi-index aggregation pipeline over FAISS indices to reduce weak matches before generation.

Key Findings

Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines

NumbersAccuracy: 90.8% (proposed) vs 85.2% (standard RAG) vs 76.1% (BM25)

Practical UseIf you add source weighting and a self-check, expect materially fewer incorrect answers on similar enterprise corpora.

Evidence RefTable 1, Sec 5.1

LLaMA-based self-evaluator improves correctness over standard RAG

NumbersAccuracy +5.6% versus standard RAG

Practical UseAdd a lightweight generative self-eval step to your pipeline to cut error rate without changing retrieval models.

Evidence RefSec 5.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	90.8%	85.2% (Standard RAG)	+5.6%	Author enterprise troubleshooting dataset	Table 1; Sec 5.1	Table 1
Relevance Score	0.89	0.75 (Standard RAG)	+0.14	Author enterprise troubleshooting dataset	Table 1; Sec 5.1	Table 1

What To Try In 7 Days

Index your manuals, FAQs, and KBs into separate FAISS indices.

Prototype rule-based source weights (e.g., boost manuals for SKU queries).

Add per-index threshold filtering to drop weak matches before generation. Fine-tune thresholds empirically on a labeled sample.

Agent Features

Memory

retrieval memory via indexed embeddings

Planning

iterative retrieval and validation loop

Tool Use

FAISS for nearest-neighbor searchLLaMA for generation and self-eval

Frameworks

Weighted RAGFacade pattern for data sources

Is Agentic

Yes

Architectures

single-agent retrieval-generation-evaluation pipeline

Optimization Features

Token Efficiency

chunking and top-K filtering to reduce generator input

Infra Optimization

GPU-based FAISS and large-model inference (A100 GPUs used)

System Optimization

index-per-source design for selective thresholds

Training Optimization

Inference Optimization

parallel FAISS index searchestop-K selection to limit generator context

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Dataset appears proprietary; results may not generalize to other enterprises.

Self-evaluation and generation use a 70B LLaMA model; cost and latency are high for many deployments.

When Not To Use

If you lack GPU capacity for large LLaMA inference and FAISS at scale.

When strict data locality or privacy rules forbid moving sensitive KBs into shared embeddings.

Failure Modes

Over-weighting one source can bias answers toward that source even if it's outdated.

Poor threshold settings can filter out the correct document or allow weak matches, harming accuracy.

Core Entities

Models

all-MiniLM-L6-v2LLaMA-3.1(70B)

Metrics

AccuracyRelevance Score

Datasets

Product manuals (1,200)FAQs (40,000)Troubleshooting guidesInternal knowledge bases

Context Entities

Models

Sentence embedding model (all-MiniLM-L6-v2)Generative LLaMA for responseLLaMA self-evaluator

Metrics

Top-K retrievalThreshold-based filtering

Datasets

Enterprise troubleshooting corpus built by authors

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines

LLaMA-based self-evaluator improves correctness over standard RAG

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Metrics

Datasets

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding