Use multiple LLM agents to filter noisy retrieved documents and improve RAG accuracy without any training

Overview

Decision SnapshotNeeds Validation

The method is straightforward and plug-and-play: use existing LLMs as agents, a retriever, and a single hyperparameter n; evidence comes from multi-benchmark zero-shot experiments but lacks large-scale deployment studies.

Citations1

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 65%

Authors

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou

Links

Abstract / PDF

Why It Matters For Business

MAIN-RAG adds a low-cost layer to existing RAG systems that reduces noisy context and often raises answer accuracy without model retraining, lowering compute waste and speeding deployment.

Who Should Care

ML Engineer Data Scientist Engineering Lead Product Manager

Summary TLDR

MAIN-RAG is a training-free Retrieval-Augmented Generation (RAG) pipeline that uses three LLM agents (Predictor, Judge, Final-Predictor) to filter and rank retrieved documents before answering. The Judge scores each Doc–Query–Answer triplet by the log-probability difference of “Yes” vs “No”, and an adaptive judge bar (per-query average ± n·std) selects documents to keep. Across four QA benchmarks MAIN-RAG improved answer accuracy by about 2–11% on evaluated datasets (up to 6.1% with Mistral7B and 12.0% with Llama3-8B in comparisons) while reducing irrelevant documents, all without fine-tuning.

Problem Statement

Retrieved documents often contain irrelevant or noisy content. That noise lowers RAG answer accuracy, raises compute cost, and undermines reliability. We need a simple, training-free way to filter and order retrieved passages so LLMs get cleaner context.

Main Contribution

Training-free multi-agent filtering: a three-agent RAG pipeline (Predictor, Judge, Final-Predictor) that filters and ranks retrieved docs without fine-tuning.

Adaptive judge bar: a per-query threshold based on the mean and standard deviation of Judge scores (τ_q = mean ± n·std) to keep recall while removing noise.

Key Findings

MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.

Numbers2–11% overall improvement; up to +6.1% (Mistral7B) and +12.0% (Llama3-8B) reported

Practical UseIf you add MAIN-RAG to an existing RAG pipeline you can often get a few percent absolute accuracy gain without re-training models.

Evidence RefAbstract, Sec.4.4, Table 1

A simple per-query average score as the judge bar (τ_q) is effective.

NumbersDefault τ_q ranks at least 2nd across ablations on three benchmarks

Practical UseSet τ_q = mean(score) as a strong default; tune the single hyperparameter n for recall-sensitive tasks.

Evidence RefSec.3.3, Sec.4.5, Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	71.0% (MAIN-RAG, Mistral7B)	69.4% (Mistral7B with docs, training-free)	+1.6%	TriviaQA (test/val split used by prior work)	Table 1, Sec.4.4	Table 1
Accuracy	58.9% (MAIN-RAG, Mistral7B)	55.5% (Mistral7B with docs, training-free)	+3.4%	PopQA (long-tail subset)	Table 1, Sec.4.4	Table 1

What To Try In 7 Days

Run your retriever to return top-N (e.g., 20) docs and instantiate three LLM calls: Predictor, Judge, Final-Predictor.

Implement Judge as a Yes/No prompt and compute score = logprob(Yes) - logprob(No).

Set τ_q = mean(scores) as default; try τ_q - 0.5·σ if recall is critical, then sort kept docs descending and pass to Final-Predictor.

Agent Features

Memory

retrieval memory (external documents)

Tool Use

external retriever (Contriever-MS MARCO)LLM-based Yes/No judge scoring

Frameworks

RAG (Retrieval-Augmented Generation)multi-agent filtering pipeline

Is Agentic

Yes

Architectures

pretrained LLMs (decoder-only: Mistral7B, Llama3-8B)

Collaboration

agent consensus via Judge scoring and Final-Predictor consumption

Optimization Features

Token Efficiency

fewer irrelevant tokens are passed to Final-Predictor after filtering

Inference Optimization

reduces number of irrelevant docs fed to final model, lowering inference cost

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluated with a limited set of pretrained LLMs (Mistral7B, Llama3-8B) and four QA datasets.

Does not explore different retrievers or rerankers; retriever choice is left orthogonal.

When Not To Use

When you already have a high-quality, task-specific retriever or trained reranker.

When ultra-low latency is required and additional LLM calls are not acceptable.

Failure Modes

Judge assigns low or noisy scores and removes supportive documents, yielding incorrect final answers (case studies show this).

Adaptive τ_q set too high can drop needed context; set n carefully to preserve recall.

Core Entities

Models

Mistral7BLlama3-8BLlama2-chat-13BLlama2-7BAlpaca-7B

Metrics

Accuracyexact match (em)rouge (rg)MAUVE (mau)

Datasets

TriviaQA-unfilteredPopQA (long-tail subset)ARC-Challenge (ARC-C)ASQA / ALCE-ASQARGB (document ordering experiments)

Benchmarks

TriviaQAPopQAARC-ChallengeASQA/ALCE-ASQARGB (ordering)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.

A simple per-query average score as the judge bar (τ_q) is effective.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Create, customize, and run multi-step LLM agents from plain language — no code needed

Key finding

COMPASS: a multi-agent orchestration that uses RAG and an LLM-as-judge to enforce sovereignty, carbon-awareness, compliance, and ethics in实时

Key finding

AgentAuditor: memory‑augmented RAG + CoT that makes LLM evaluators reach human-level accuracy on agent safety

Key finding

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

Key finding