Overview
Production Readiness
0.6
Novelty Score
0.65
Cost Impact Score
0.55
Citation Count
1
Why It Matters For Business
MAIN-RAG adds a low-cost layer to existing RAG systems that reduces noisy context and often raises answer accuracy without model retraining, lowering compute waste and speeding deployment.
Summary TLDR
MAIN-RAG is a training-free Retrieval-Augmented Generation (RAG) pipeline that uses three LLM agents (Predictor, Judge, Final-Predictor) to filter and rank retrieved documents before answering. The Judge scores each Doc–Query–Answer triplet by the log-probability difference of “Yes” vs “No”, and an adaptive judge bar (per-query average ± n·std) selects documents to keep. Across four QA benchmarks MAIN-RAG improved answer accuracy by about 2–11% on evaluated datasets (up to 6.1% with Mistral7B and 12.0% with Llama3-8B in comparisons) while reducing irrelevant documents, all without fine-tuning.
Problem Statement
Retrieved documents often contain irrelevant or noisy content. That noise lowers RAG answer accuracy, raises compute cost, and undermines reliability. We need a simple, training-free way to filter and order retrieved passages so LLMs get cleaner context.
Main Contribution
Training-free multi-agent filtering: a three-agent RAG pipeline (Predictor, Judge, Final-Predictor) that filters and ranks retrieved docs without fine-tuning.
Adaptive judge bar: a per-query threshold based on the mean and standard deviation of Judge scores (τ_q = mean ± n·std) to keep recall while removing noise.
Empirical validation: experiments on four QA benchmarks show consistent accuracy gains and higher response consistency versus standard training-free baselines.
Key Findings
MAIN-RAG improves QA accuracy over training-free baselines on evaluated datasets.
A simple per-query average score as the judge bar (τ_q) is effective.
Sorting kept documents in descending Judge score helps final answers.
Judge uses log-probability difference between “Yes” and “No” to create a continuous relevance score.
Results
Accuracy
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run your retriever to return top-N (e.g., 20) docs and instantiate three LLM calls: Predictor, Judge, Final-Predictor.
Implement Judge as a Yes/No prompt and compute score = logprob(Yes) - logprob(No).
Set τ_q = mean(scores) as default; try τ_q - 0.5·σ if recall is critical, then sort kept docs descending and pass to Final-Predictor.
Agent Features
Memory
- retrieval memory (external documents)
Tool Use
- external retriever (Contriever-MS MARCO)
- LLM-based Yes/No judge scoring
Frameworks
- RAG (Retrieval-Augmented Generation)
- multi-agent filtering pipeline
Is Agentic
true
Architectures
- pretrained LLMs (decoder-only: Mistral7B, Llama3-8B)
Collaboration
- agent consensus via Judge scoring and Final-Predictor consumption
Optimization Features
Token Efficiency
- fewer irrelevant tokens are passed to Final-Predictor after filtering
Inference Optimization
- reduces number of irrelevant docs fed to final model, lowering inference cost
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluated with a limited set of pretrained LLMs (Mistral7B, Llama3-8B) and four QA datasets.
- Does not explore different retrievers or rerankers; retriever choice is left orthogonal.
- Judge misjudgments on low-confidence queries can filter out useful passages, causing wrong answers.
- Increased inference calls (three agents) raise carbon footprint and latency compared to single-pass RAG.
When Not To Use
- When you already have a high-quality, task-specific retriever or trained reranker.
- When ultra-low latency is required and additional LLM calls are not acceptable.
- When Judge LLM is known to be unreliable for your domain (low confidence scores).
Failure Modes
- Judge assigns low or noisy scores and removes supportive documents, yielding incorrect final answers (case studies show this).
- Adaptive τ_q set too high can drop needed context; set n carefully to preserve recall.
- Judge prompt sensitivity or tokenization differences can alter log-prob scores and sorting.
Core Entities
Models
- Mistral7B
- Llama3-8B
- Llama2-chat-13B
- Llama2-7B
- Alpaca-7B
Metrics
- Accuracy
- exact match (em)
- rouge (rg)
- MAUVE (mau)
Datasets
- TriviaQA-unfiltered
- PopQA (long-tail subset)
- ARC-Challenge (ARC-C)
- ASQA / ALCE-ASQA
- RGB (document ordering experiments)
Benchmarks
- TriviaQA
- PopQA
- ARC-Challenge
- ASQA/ALCE-ASQA
- RGB (ordering)

