Use multiple LLM agents to filter noisy retrieved documents and improve RAG accuracy without any training

December 31, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.65

Cost Impact Score

0.55

Citation Count

1

Authors

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou

Links

Abstract / PDF

Why It Matters For Business

MAIN-RAG adds a low-cost layer to existing RAG systems that reduces noisy context and often raises answer accuracy without model retraining, lowering compute waste and speeding deployment.

Summary TLDR

MAIN-RAG is a training-free Retrieval-Augmented Generation (RAG) pipeline that uses three LLM agents (Predictor, Judge, Final-Predictor) to filter and rank retrieved documents before answering. The Judge scores each Doc–Query–Answer triplet by the log-probability difference of “Yes” vs “No”, and an adaptive judge bar (per-query average ± n·std) selects documents to keep. Across four QA benchmarks MAIN-RAG improved answer accuracy by about 2–11% on evaluated datasets (up to 6.1% with Mistral7B and 12.0% with Llama3-8B in comparisons) while reducing irrelevant documents, all without fine-tuning.

Problem Statement

Retrieved documents often contain irrelevant or noisy content. That noise lowers RAG answer accuracy, raises compute cost, and undermines reliability. We need a simple, training-free way to filter and order retrieved passages so LLMs get cleaner context.

Main Contribution

Training-free multi-agent filtering: a three-agent RAG pipeline (Predictor, Judge, Final-Predictor) that filters and ranks retrieved docs without fine-tuning.

Adaptive judge bar: a per-query threshold based on the mean and standard deviation of Judge scores (τ_q = mean ± n·std) to keep recall while removing noise.

Empirical validation: experiments on four QA benchmarks show consistent accuracy gains and higher response consistency versus standard training-free baselines.

Key Findings

MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.

Numbers2–11% overall improvement; up to +6.1% (Mistral7B) and +12.0% (Llama3-8B) reported

A simple per-query average score as the judge bar (τ_q) is effective.

NumbersDefault τ_q ranks at least 2nd across ablations on three benchmarks

Sorting kept documents in descending Judge score helps final answers.

NumbersExample: Mistral7B MAIN-RAG Decs. 71.0 vs Asc. 70.2 on TriviaQA (acc)

Judge uses log-probability difference between “Yes” and “No” to create a continuous relevance score.

Results

Accuracy

Value71.0% (MAIN-RAG, Mistral7B)

Baseline69.4% (Mistral7B with docs, training-free)

Accuracy

Value58.9% (MAIN-RAG, Mistral7B)

Baseline55.5% (Mistral7B with docs, training-free)

Accuracy

Value58.9% (MAIN-RAG, Mistral7B)

Baseline57.1% (Mistral7B with docs, training-free)

Who Should Care

What To Try In 7 Days

Run your retriever to return top-N (e.g., 20) docs and instantiate three LLM calls: Predictor, Judge, Final-Predictor.

Implement Judge as a Yes/No prompt and compute score = logprob(Yes) - logprob(No).

Set τ_q = mean(scores) as default; try τ_q - 0.5·σ if recall is critical, then sort kept docs descending and pass to Final-Predictor.

Agent Features

Memory

  • retrieval memory (external documents)

Tool Use

  • external retriever (Contriever-MS MARCO)
  • LLM-based Yes/No judge scoring

Frameworks

  • RAG (Retrieval-Augmented Generation)
  • multi-agent filtering pipeline

Is Agentic

true

Architectures

  • pretrained LLMs (decoder-only: Mistral7B, Llama3-8B)

Collaboration

  • agent consensus via Judge scoring and Final-Predictor consumption

Optimization Features

Token Efficiency

  • fewer irrelevant tokens are passed to Final-Predictor after filtering

Inference Optimization

  • reduces number of irrelevant docs fed to final model, lowering inference cost

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluated with a limited set of pretrained LLMs (Mistral7B, Llama3-8B) and four QA datasets.
  • Does not explore different retrievers or rerankers; retriever choice is left orthogonal.
  • Judge misjudgments on low-confidence queries can filter out useful passages, causing wrong answers.
  • Increased inference calls (three agents) raise carbon footprint and latency compared to single-pass RAG.

When Not To Use

  • When you already have a high-quality, task-specific retriever or trained reranker.
  • When ultra-low latency is required and additional LLM calls are not acceptable.
  • When Judge LLM is known to be unreliable for your domain (low confidence scores).

Failure Modes

  • Judge assigns low or noisy scores and removes supportive documents, yielding incorrect final answers (case studies show this).
  • Adaptive τ_q set too high can drop needed context; set n carefully to preserve recall.
  • Judge prompt sensitivity or tokenization differences can alter log-prob scores and sorting.

Core Entities

Models

  • Mistral7B
  • Llama3-8B
  • Llama2-chat-13B
  • Llama2-7B
  • Alpaca-7B

Metrics

  • Accuracy
  • exact match (em)
  • rouge (rg)
  • MAUVE (mau)

Datasets

  • TriviaQA-unfiltered
  • PopQA (long-tail subset)
  • ARC-Challenge (ARC-C)
  • ASQA / ALCE-ASQA
  • RGB (document ordering experiments)

Benchmarks

  • TriviaQA
  • PopQA
  • ARC-Challenge
  • ASQA/ALCE-ASQA
  • RGB (ordering)