Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
2
Why It Matters For Business
You can replace multiple modality-specific retrievers with one LLM-based generative retriever that scales to millions of documents, improves precision, and needs only light fine-tuning, lowering engineering and data costs.
Summary TLDR
GeMKR is an end-to-end generative retriever for multi-modal queries. It fine-tunes a frozen LLaMA LLM (LoRA) plus a visual front-end (CLIP + object-aware prefix tuning) to generate short, document-unique text snippets called "knowledge clues." Those clues are constrained during decoding via an FM-Index so each clue maps to exactly one KB document. On three benchmarks (KB sizes 112K–21M) GeMKR improves retrieval metrics by 3.0–14.6% over strong baselines (e.g., P@5 49.1 vs 41.7). The system is trainable with ~20K instruction samples, updates ~14M parameters, and trains in ~3 hours on a single A6000 48GB GPU.
Problem Statement
Multi-modal retrieval usually stitches together separate text and image retrievers. That is data-hungry and weak at cross-modal interactions. We need a single, efficient retriever that handles text+image queries, generalizes to large KBs, and keeps training costs low.
Main Contribution
A generative retrieval pipeline (GeMKR) that produces short text "knowledge clues" which are then looked up in an indexed KB.
Object-aware prefix-tuning and a dual-flow attention trick to align multi-grained image features with an LLM.
Instruction tuning of a frozen LLaMA using LoRA to leverage LLM knowledge while updating only ~14M params.
Knowledge-guided constrained decoding using an FM-Index so each generated clue maps uniquely to one document.
Empirical gains across three multimodal retrieval benchmarks, including large-scale KBs (21M).
Key Findings
GeMKR raises P@5 on OKVQA-GS112K to 49.1, beating ReViz-ICT (41.7).
On a 21M-document KB, GeMKR improves P@5 by 14.6% and R@5 by 8.9% over baselines.
Constrained generation of short knowledge clues yields much higher recall than alternatives.
Training is efficient: ~20K instruction examples, LoRA updates ~14M params, training <3 hours on one A6000 48G GPU.
Results
P@5 (OKVQA-GS112K)
P@5 (OKVQA-WK21M, 21M KB)
P@1 (ReMuQ)
R@5 (Knowledge clue constrained decoding)
Who Should Care
What To Try In 7 Days
Index a small document corpus with an FM-Index and try constraint decoding to map generated substrings to docs.
Fine-tune a frozen LLaMA-7B with LoRA and feed CLIP image embeddings via a projection layer to generate short knowledge clues.
A/B test clue-based retrieval vs your existing pipeline on a subset of multimodal queries to measure precision and recall.
Agent Features
Memory
- indexed KB via FM-Index (external database)
Tool Use
- FM-Index
- sdsl-lite
- YOLOv7
- CLIP
- PyTorch
Architectures
- single LLM with visual projection
Optimization Features
Token Efficiency
- generate short discriminative clues instead of long passages
Infra Optimization
- single A6000 48GB GPU training within ~3 hours
Model Optimization
- LoRA
- object-aware prefix-tuning
- dual-flow attention
System Optimization
- FM-Index + sdsl-lite for millisecond-level lookup
Training Optimization
- instruction tuning with ~20K examples
- LoRA
Inference Optimization
- constrained beam search
- FM-Index lookups per decoding step
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Requires the KB to be indexed (FM-Index) so substring lookups work.
- Clues that appear in multiple documents are discarded, which may lower recall on noisy corpora.
- Relies on LLM parameter scale; small LLMs underperform unless fully tuned with more data.
- Constrained decoding and beam search can add inference complexity compared to simple nearest-neighbor retrieval.
When Not To Use
- You lack a document corpus that can be indexed by substring (no text or non-text KB).
- You cannot run or fine-tune an LLM (resource or policy constraints).
- Your system needs extremely low-latency retrieval without beam search overhead.
Failure Modes
- Generated clue maps to several docs — the model drops the result and misses retrieval.
- Insufficient visual features or bad object detection degrades cross-modal clues.
- Overly strict constraints in decoding can prevent useful but longer evidence from being generated.
Core Entities
Models
- LLaMA-7B
- LLaMA-13B
- CLIP ViT-L/14
- YOLOv7
- LoRA
Metrics
- P@5
- R@5
- R@10
- P@1
Datasets
- OKVQA-GS112K
- OKVQA-WK21M
- ReMuQ
Benchmarks
- OKVQA-GS112K
- OKVQA-WK21M
- ReMuQ

