Use multiple LLM agents to filter noisy retrieved documents and improve RAG accuracy without any training

December 31, 20247 min

Overview

Decision SnapshotNeeds Validation

The method is straightforward and plug-and-play: use existing LLMs as agents, a retriever, and a single hyperparameter n; evidence comes from multi-benchmark zero-shot experiments but lacks large-scale deployment studies.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 65%

Authors

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou

Links

Abstract / PDF

Why It Matters For Business

MAIN-RAG adds a low-cost layer to existing RAG systems that reduces noisy context and often raises answer accuracy without model retraining, lowering compute waste and speeding deployment.

Who Should Care

Summary TLDR

MAIN-RAG is a training-free Retrieval-Augmented Generation (RAG) pipeline that uses three LLM agents (Predictor, Judge, Final-Predictor) to filter and rank retrieved documents before answering. The Judge scores each Doc–Query–Answer triplet by the log-probability difference of “Yes” vs “No”, and an adaptive judge bar (per-query average ± n·std) selects documents to keep. Across four QA benchmarks MAIN-RAG improved answer accuracy by about 2–11% on evaluated datasets (up to 6.1% with Mistral7B and 12.0% with Llama3-8B in comparisons) while reducing irrelevant documents, all without fine-tuning.

Problem Statement

Retrieved documents often contain irrelevant or noisy content. That noise lowers RAG answer accuracy, raises compute cost, and undermines reliability. We need a simple, training-free way to filter and order retrieved passages so LLMs get cleaner context.

Main Contribution

Training-free multi-agent filtering: a three-agent RAG pipeline (Predictor, Judge, Final-Predictor) that filters and ranks retrieved docs without fine-tuning.

Adaptive judge bar: a per-query threshold based on the mean and standard deviation of Judge scores (τ_q = mean ± n·std) to keep recall while removing noise.

Key Findings

MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.

Numbers211% overall improvement; up to +6.1% (Mistral7B) and +12.0% (Llama3-8B) reported

Practical UseIf you add MAIN-RAG to an existing RAG pipeline you can often get a few percent absolute accuracy gain without re-training models.

Evidence RefAbstract, Sec.4.4, Table 1

A simple per-query average score as the judge bar (τ_q) is effective.

NumbersDefault τ_q ranks at least 2nd across ablations on three benchmarks

Practical UseSet τ_q = mean(score) as a strong default; tune the single hyperparameter n for recall-sensitive tasks.

Evidence RefSec.3.3, Sec.4.5, Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy71.0% (MAIN-RAG, Mistral7B)69.4% (Mistral7B with docs, training-free)+1.6%TriviaQA (test/val split used by prior work)Table 1, Sec.4.4Table 1
Accuracy58.9% (MAIN-RAG, Mistral7B)55.5% (Mistral7B with docs, training-free)+3.4%PopQA (long-tail subset)Table 1, Sec.4.4Table 1

What To Try In 7 Days

Run your retriever to return top-N (e.g., 20) docs and instantiate three LLM calls: Predictor, Judge, Final-Predictor.

Implement Judge as a Yes/No prompt and compute score = logprob(Yes) - logprob(No).

Set τ_q = mean(scores) as default; try τ_q - 0.5·σ if recall is critical, then sort kept docs descending and pass to Final-Predictor.

Agent Features

Memory
retrieval memory (external documents)
Tool Use
external retriever (Contriever-MS MARCO)LLM-based Yes/No judge scoring
Frameworks
RAG (Retrieval-Augmented Generation)multi-agent filtering pipeline
Is Agentic

Yes

Architectures
pretrained LLMs (decoder-only: Mistral7B, Llama3-8B)
Collaboration
agent consensus via Judge scoring and Final-Predictor consumption

Optimization Features

Token Efficiency
fewer irrelevant tokens are passed to Final-Predictor after filtering
Inference Optimization
reduces number of irrelevant docs fed to final model, lowering inference cost

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluated with a limited set of pretrained LLMs (Mistral7B, Llama3-8B) and four QA datasets.

Does not explore different retrievers or rerankers; retriever choice is left orthogonal.

When Not To Use

When you already have a high-quality, task-specific retriever or trained reranker.

When ultra-low latency is required and additional LLM calls are not acceptable.

Failure Modes

Judge assigns low or noisy scores and removes supportive documents, yielding incorrect final answers (case studies show this).

Adaptive τ_q set too high can drop needed context; set n carefully to preserve recall.

Core Entities

Models

Mistral7BLlama3-8BLlama2-chat-13BLlama2-7BAlpaca-7B

Metrics

Accuracyexact match (em)rouge (rg)MAUVE (mau)

Datasets

TriviaQA-unfilteredPopQA (long-tail subset)ARC-Challenge (ARC-C)ASQA / ALCE-ASQARGB (document ordering experiments)

Benchmarks

TriviaQAPopQAARC-ChallengeASQA/ALCE-ASQARGB (ordering)