Hybrid Retrieval Papers — Parsed & Scored for Practitioners

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

0.60

0.50

0.60

3

RAGElo cuts expert labeling cost by using synthetic queries and LLM judges to rank retrieval-augmented systems, so teams can iterate and pick retrieval or fusion strategies faster while keeping a small human calibration step.

Key finding

LLM-as-a-judge moderately matches human experts.

Numbers: Kendall τ ≈ 0.56, p < 0.01; Spearman ρ ≈ 0.59

VideoRAG: index and search unlimited‑length videos with graph grounding plus multi‑modal retrieval

0.60

0.70

0.40

2

VideoRAG enables searchable QA and summarization across many long videos, unlocking education, media-archive search, and customer-support video analytics without retraining large models.

Key finding

VideoRAG wins more LLM head-to-head judgments than standard RAG baselines

Numbers: VideoRAG chosen 53.26% vs baselines' 46.74% (Overall Winner, Table 2)

Hybrid RAG that cuts hallucinations and improves multi‑step reasoning via chunking, tables, tool math, and KG

0.60

0.40

2

A practical pipeline that trades some answer coverage for far fewer incorrect claims, which is often preferable in QA products where wrong answers are costly.

Key finding

Full system flipped the Task 1 public score from strongly negative to positive.

Numbers: Score: Our 15.8% vs Official RAG -46.6% (Table 1)

Ask-EDA: a Slack-ready design chatbot that combines hybrid retrieval and an abbreviation lookup to reduce hallucinations

0.70

0.50

0.60

1

A hybrid RAG layer plus a small abbreviation lookup can cut wrong answers and boost recall on internal technical queries, speeding engineering work and reducing time spent hunting docs.

Key finding

Hybrid RAG substantially increases answer recall versus no retrieval.

Numbers: q2a-100: >40% recall improvement vs no-RAG; cmds-100: >60% recall improvement vs no-RAG

Customize RAG for EDA docs: domain-tuned retriever, reranker, generator + ORD-QA benchmark

0.70

0.60

1

Specialized RAG reduces wrong answers on complex EDA docs, improving self-serve support and lowering costly human support for tooling documentation.

Key finding

Domain-finetuned embedding improves dense retrieval recall.

Numbers: recall@20: 0.733 (ours) vs 0.66 (bge-large) and 0.634 (text-embedding-ada-002)

Add a semantic timeline and durative summaries so agents recall events at the right time

0.70

0.65

0.40

0

TSM makes assistants recall facts that happened when they actually happened, improving time-sensitive answers and multi-session personalization—this can reduce wrong or stale recommendations in customer support and personal assistants.

Key finding

TSM raises overall QA accuracy on LONGMEMEVAL_S to 74.80%

Numbers: TSM 74.80% vs A-MEM 62.60% (+12.20 pp)

An open-source agent that switches between graph and vector search to improve literature review accuracy

0.70

0.60

0.50

0

Automating literature review with a system that picks the right retrieval mode reduces manual search time and improves the relevance of extracted evidence. This matters for teams that need fast, evidence-grounded summaries across many papers (R&D, clinical review, IP) and want an auditable pipeline.

Key finding

Agentic system with DPO substantially increases vector-store context recall.

Numbers: VS Context Recall +0.63 vs baseline

MemWeaver: tri-layer, temporally grounded memory that boosts long-horizon agent reasoning

0.70

0.60

0.70

0

MemWeaver cuts inference token cost by >95% while improving time-sensitive and multi-hop accuracy, so you can support long-running personalized agents without huge prompt costs or loss of traceability.

Key finding

MemWeaver reduces inference input length by over 95% compared to long-context prompting.

Numbers: >95% token reduction (22k → ~1k tokens per query)

BMAM: brain-inspired multi-agent memory that improves long-horizon agent consistency

0.60

0.70

0.60

0

If you build agents that must remember users and multi-session facts, a structured, timeline-aware memory reduces identity and temporal drift and improves preference stability across sessions.

Key finding

BMAM achieves strong long-horizon dialogue accuracy on LoCoMo.

Numbers: 78.45% (1558/1986)

Open-source Punjabi LLM suite + a quantum‑inspired hybrid retriever that improves retrieval and generation.

0.70

0.72

0.60

0

Tools that speak a local language well unlock use cases (education, news, local QA, civic services). A dedicated model + hybrid retrieval gives measurably better accuracy and cultural fit than off-the-shelf multilingual models.

Key finding

A 35GB, 4.8M-document Punjabi corpus was assembled and used for training.

Numbers: 35.5GB corpus, ~4,800,000 documents, 32GB train / 2GB val / 1GB test

CarbonChat: LLM system for corporate carbon-emissions analysis using hybrid RAG and Text2SQL

0.60

0.50

0.60

0

CarbonChat automates extraction and structured analysis of long sustainability reports and policy texts, cutting manual effort and providing traceable, SQL-queryable answers for decision-makers.

Key finding

Self-Prompting RAG improves text-generation metrics vs standard RAG.

Numbers: Qwen-Max ROUGE-1 0.592 vs 0.529; BERTScore F1 0.906 vs 0.831

AF-Retriever: a hybrid, LLM-driven pipeline that combines graph constraints and vector search to improve multi-hop QA over semi-structured K

0.70

0.60

0

If your app uses mixed graph and text data, AF-Retriever gives large zero-shot retrieval gains and traceable answers without domain fine-tuning, saving dataset curation time while improving top-1 accuracy.

Key finding

AF-Retriever substantially improves first-hit rates over previous zero-/one-shot methods on STaRK.

Numbers: Avg hit@1 increase vs second-best = 32.1% (abstract); AF-Retriever hit@1: 62.0% (Table 2 avg synthetic).

RAG + a 10M‑token Vedanta corpus cuts hallucinations for niche long‑form QA

0.60

0.50

0

If you build knowledge products for niche domains, add retrieval plus keyword-aware retrieval to raise factuality and credibility without expensive model retraining.

Key finding

RAG responses were strongly preferred over non-RAG responses by experts.

Numbers: 81% preference rate (reported by domain experts)

SKETCH: combine semantic chunking with knowledge graphs to improve RAG retrieval for complex, multi-context queries

0.60

0.50

0.40

0

SKETCH gives more accurate, context-preserving retrieval for complex, multi-part queries, which improves downstream answers and traceability at the cost of higher KG construction and LLM use.

Key finding

On the small Italian Cuisine test, SKETCH reached very high relevancy and precision.

Numbers: answer_relevancy=0.94; context_precision=0.99

XRAG: open-source toolkit and benchmark that tests pre‑retrieval, retrieval, post‑retrieval, and generation modules in RAG

0.70

0.40

0.60

0

XRAG helps teams identify which retrieval or reranking change actually improves end-to-end QA accuracy, reducing guesswork and wasted engineering time when deploying RAG-powered search or assistant features.

Key finding

Combining hybrid retrieval (BM25 + vector) with re-ranking yields large retrieval gains.

Numbers: F1: 0.975 vs baseline 0.740 (Table 12)

ARCoT: hybrid retrieval + step-back + chain-of-thought boosts LLMs on a medical physics exam to 90%

0.60

0.50

0

You can raise domain accuracy of off-the-shelf LLMs without costly fine-tuning by adding a small retrieval corpus, re-ranking, and stepwise prompting, lowering risk and time-to-value for domain AI features.

Key finding

GPT-4 score rose from 67% (base) to 90% with ARCoT.

Numbers: 67% → 90% (+23 percentage points)

M2A: editable dual-layer multimodal memory for evolving personalization

0.60

0.70

0.60

0

Editable multimodal memory lets assistants evolve with users (names, images, preferences) and yields measurable accuracy gains on long conversations—helpful for retention and personalized UX.

Key finding

M2A improves average correctness on the enhanced LoCoMo benchmark versus a single-pass RAG baseline

Numbers: GPT-4o-mini Avg: M2A 44.64% vs RAG 33.27% (≈+11.4 pp)

Compile NL queries into DAG plans to orchestrate parallel, auditable QA across SQL and vector stores

0.70

0.66

0.60

0

A.DOT reduces over-retrieval and exposes verifiable evidence for each step, cutting unnecessary data exposure and improving multi-step QA accuracy—helpful for compliance-heavy enterprise queries.

Key finding

A.DOT improves answer correctness and completeness on HybridQA dev vs Standard RAG.

Numbers: Correctness +14.8p, Completeness +10.7p (Table 1)

Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

0.70

0.60

0.80

0

Automates time-consuming test-document work, preserves traceability for regulated enterprise projects, and can shrink timelines and costs if you can manage integration and KB upkeep.

Key finding

Agentic multi-agent RAG improves test artifact accuracy compared to Basic RAG.

Numbers: Basic RAG 65.2% -> Agentic RAG 94.8%

ONCOTIMIA: a RAG-powered tool that auto-fills lung cancer tumour-board forms with ~80% field accuracy

0.60

0.40

0.60

0

Automating tumour-board form filling can cut clinician paperwork and speed case preparation while keeping traceable links to source notes, but expect human review for ~20% of fields.

Key finding

Best model achieved 80% correct field completion on evaluated cases.

Numbers: Pixtral-large-2502-v1 mean accuracy = 80%

HybridRAG-Bench: contamination-aware tests that force retrieval + multi-hop reasoning over text + knowledge graphs

0.70

0.60

0

If your product needs up-to-date or relational knowledge, rely on retrieval and structured graphs; naive LLMs overfit to pretraining and will miss recent multi-hop facts.

Key finding

LLM-only prompting performs poorly on up-to-date, multi-hop questions.

Numbers: LLM-only accuracy ≈ 23–40% (Table 3)

Grounding LMs with OpenMath improves math reasoning when retrieval is good

0.40

0.60

0.50

0

Formal ontologies can make smaller models more dependable in specialist tasks, but only when retrieval reliably finds relevant definitions; otherwise, augmentation can reduce trust and accuracy.

Key finding

OpenMath coverage is limited: only a minority of problems have high-quality matches.

Numbers: 24.2% problems with max reranker score ≥ 0.5; mean max score 0.2715

Retrieve both claim and its negation from multiple sources, aggregate evidence, and use LLM log-probs to expose cross-source disagreement.

0.50

0.40

0

Aggregating evidence from multiple sources and retrieving negated queries expands coverage and surfaces disagreements, improving zero-shot claim checks and making automated decisions more transparent.

Key finding

Retrieving both the claim and its negation (dual-perspective) improves zero-shot verification.

Numbers: +2–10% accuracy; +2–8% macroF1 (typical gains across datasets)

HELP: HyperNode Expansion + Logical Path-Guided Localization for faster, more accurate GraphRAG

0.70

0.60

0.80

0

HELP preserves graph-style multi-hop accuracy while cutting retrieval latency up to ~28.8× on tested QA tasks, letting teams deploy knowledge-grounded LLMs at much lower cost and with faster response times.

Key finding

HELP matches or slightly improves top GraphRAG accuracy while being much faster.

Numbers: Avg F1 55.3 vs HippoRAG2 54.6 on multiple QA datasets (Table 1)