Weighted RAG plus LLaMA self-evaluation to speed and improve enterprise troubleshooting

December 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

1

Authors

Rajat Khanda

Links

Abstract / PDF

Why It Matters For Business

A weighted RAG plus self-evaluation can cut misdiagnoses and speed resolution on large enterprise knowledge bases, improving service SLAs and reducing human time-to-fix.

Summary TLDR

This paper presents a practical Retrieval-Augmented Generation (RAG) system that assigns context-dependent weights to multiple enterprise data sources (product manuals, FAQs, guides, internal KBs), uses FAISS + all-MiniLM-L6-v2 for dense search, and validates outputs with a LLaMA-based self-evaluator. On the authors' enterprise dataset the full pipeline reaches 90.8% accuracy and 0.89 relevance versus 85.2%/0.75 for a standard (equal-weight) RAG and 76.1%/0.61 for keyword search. The design focuses on modular source weighting, threshold filtering to reduce hallucination, and a final self-check step; it is intended for single-agent troubleshooting services rather than multi-agent workflows.

Problem Statement

Enterprise troubleshooting needs fast, accurate answers from many scattered sources. Keyword search misses context and manuals; static RAG treats all sources equally. The result is slower, less precise fixes. The paper proposes a dynamically weighted RAG that prioritizes sources by query context and validates outputs to reduce hallucinations.

Main Contribution

A dynamic weighting mechanism that adjusts retrieval importance per data source based on query context (e.g., boost manuals for SKU queries).

A threshold-based filtering and multi-index aggregation pipeline over FAISS indices to reduce weak matches before generation.

Integration of a LLaMA-3.1(70B) self-evaluator to score and suppress low-confidence generated responses.

An end-to-end system design (preprocessing, weighted retrieval, generation, validation) and experiments on a large internal troubleshooting corpus.

Key Findings

Weighted RAG plus self-evaluation achieves higher troubleshooting accuracy than baselines

NumbersAccuracy: 90.8% (proposed) vs 85.2% (standard RAG) vs 76.1% (BM25)

LLaMA-based self-evaluator improves correctness over standard RAG

NumbersAccuracy +5.6% versus standard RAG

Approach works at enterprise scale across multiple source types

NumbersDataset includes 1,200 manuals and 40,000 FAQs indexed into separate FAISS indices

Results

Accuracy

Value90.8%

Baseline85.2% (Standard RAG)

Relevance Score

Value0.89

Baseline0.75 (Standard RAG)

Accuracy

Value76.1%

Who Should Care

What To Try In 7 Days

Index your manuals, FAQs, and KBs into separate FAISS indices.

Prototype rule-based source weights (e.g., boost manuals for SKU queries).

Add per-index threshold filtering to drop weak matches before generation. Fine-tune thresholds empirically on a labeled sample.

Agent Features

Memory

  • retrieval memory via indexed embeddings

Planning

  • iterative retrieval and validation loop

Tool Use

  • FAISS for nearest-neighbor search
  • LLaMA for generation and self-eval

Frameworks

  • Weighted RAG
  • Facade pattern for data sources

Is Agentic

true

Architectures

  • single-agent retrieval-generation-evaluation pipeline

Optimization Features

Token Efficiency

  • chunking and top-K filtering to reduce generator input

Infra Optimization

  • GPU-based FAISS and large-model inference (A100 GPUs used)

System Optimization

  • index-per-source design for selective thresholds

Training Optimization

  • RL

Inference Optimization

  • parallel FAISS index searches
  • top-K selection to limit generator context

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Dataset appears proprietary; results may not generalize to other enterprises.
  • Self-evaluation and generation use a 70B LLaMA model; cost and latency are high for many deployments.
  • Weighting strategy is rule-based in experiments rather than learned from feedback.
  • Paper focuses on single-turn queries; multi-turn conversational troubleshooting is future work.

When Not To Use

  • If you lack GPU capacity for large LLaMA inference and FAISS at scale.
  • When strict data locality or privacy rules forbid moving sensitive KBs into shared embeddings.
  • If you need a certified deterministic decision process rather than a validated natural-language answer

Failure Modes

  • Over-weighting one source can bias answers toward that source even if it's outdated.
  • Poor threshold settings can filter out the correct document or allow weak matches, harming accuracy.
  • Dependency on a large LLaMA model can fail silently if the model hallucinates and the self-eval threshold is too low.

Core Entities

Models

  • all-MiniLM-L6-v2
  • LLaMA-3.1(70B)

Metrics

  • Accuracy
  • Relevance Score

Datasets

  • Product manuals (1,200)
  • FAQs (40,000)
  • Troubleshooting guides
  • Internal knowledge bases

Context Entities

Models

  • Sentence embedding model (all-MiniLM-L6-v2)
  • Generative LLaMA for response
  • LLaMA self-evaluator

Metrics

  • Top-K retrieval
  • Threshold-based filtering

Datasets

  • Enterprise troubleshooting corpus built by authors