Generate short, unique text 'knowledge clues' with an LLM and use them to look up documents for multi-modal queries.

January 16, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

2

Authors

Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, Jie Zhou

Links

Abstract / PDF

Why It Matters For Business

You can replace multiple modality-specific retrievers with one LLM-based generative retriever that scales to millions of documents, improves precision, and needs only light fine-tuning, lowering engineering and data costs.

Summary TLDR

GeMKR is an end-to-end generative retriever for multi-modal queries. It fine-tunes a frozen LLaMA LLM (LoRA) plus a visual front-end (CLIP + object-aware prefix tuning) to generate short, document-unique text snippets called "knowledge clues." Those clues are constrained during decoding via an FM-Index so each clue maps to exactly one KB document. On three benchmarks (KB sizes 112K–21M) GeMKR improves retrieval metrics by 3.0–14.6% over strong baselines (e.g., P@5 49.1 vs 41.7). The system is trainable with ~20K instruction samples, updates ~14M parameters, and trains in ~3 hours on a single A6000 48GB GPU.

Problem Statement

Multi-modal retrieval usually stitches together separate text and image retrievers. That is data-hungry and weak at cross-modal interactions. We need a single, efficient retriever that handles text+image queries, generalizes to large KBs, and keeps training costs low.

Main Contribution

A generative retrieval pipeline (GeMKR) that produces short text "knowledge clues" which are then looked up in an indexed KB.

Object-aware prefix-tuning and a dual-flow attention trick to align multi-grained image features with an LLM.

Instruction tuning of a frozen LLaMA using LoRA to leverage LLM knowledge while updating only ~14M params.

Knowledge-guided constrained decoding using an FM-Index so each generated clue maps uniquely to one document.

Empirical gains across three multimodal retrieval benchmarks, including large-scale KBs (21M).

Key Findings

GeMKR raises P@5 on OKVQA-GS112K to 49.1, beating ReViz-ICT (41.7).

NumbersP@5: 49.1 vs 41.7 (Table 1)

On a 21M-document KB, GeMKR improves P@5 by 14.6% and R@5 by 8.9% over baselines.

NumbersΔP@5 = +14.6%, ΔR@5 = +8.9% (OKVQA-WK21M, Table 1)

Constrained generation of short knowledge clues yields much higher recall than alternatives.

NumbersR@5: 78.6 (Knowledge Clue w/ Constraints) vs 62.4 (First Sentence) (Table 4)

Training is efficient: ~20K instruction examples, LoRA updates ~14M params, training <3 hours on one A6000 48G GPU.

Numbers20K instruction data; ~14M LoRA params; training time ≈3 hours (Implementation Details)

Results

P@5 (OKVQA-GS112K)

Value49.1

BaselineReViz-ICT 41.7

P@5 (OKVQA-WK21M, 21M KB)

Value46.0

BaselineReViz-ICT 31.4

P@1 (ReMuQ)

Value75.2

BaselineReViz-ICT 62.1

R@5 (Knowledge clue constrained decoding)

Value78.6

BaselineFirst sentence w/ constraints 62.4

Who Should Care

What To Try In 7 Days

Index a small document corpus with an FM-Index and try constraint decoding to map generated substrings to docs.

Fine-tune a frozen LLaMA-7B with LoRA and feed CLIP image embeddings via a projection layer to generate short knowledge clues.

A/B test clue-based retrieval vs your existing pipeline on a subset of multimodal queries to measure precision and recall.

Agent Features

Memory

  • indexed KB via FM-Index (external database)

Tool Use

  • FM-Index
  • sdsl-lite
  • YOLOv7
  • CLIP
  • PyTorch

Architectures

  • single LLM with visual projection

Optimization Features

Token Efficiency

  • generate short discriminative clues instead of long passages

Infra Optimization

  • single A6000 48GB GPU training within ~3 hours

Model Optimization

  • LoRA
  • object-aware prefix-tuning
  • dual-flow attention

System Optimization

  • FM-Index + sdsl-lite for millisecond-level lookup

Training Optimization

  • instruction tuning with ~20K examples
  • LoRA

Inference Optimization

  • constrained beam search
  • FM-Index lookups per decoding step

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Requires the KB to be indexed (FM-Index) so substring lookups work.
  • Clues that appear in multiple documents are discarded, which may lower recall on noisy corpora.
  • Relies on LLM parameter scale; small LLMs underperform unless fully tuned with more data.
  • Constrained decoding and beam search can add inference complexity compared to simple nearest-neighbor retrieval.

When Not To Use

  • You lack a document corpus that can be indexed by substring (no text or non-text KB).
  • You cannot run or fine-tune an LLM (resource or policy constraints).
  • Your system needs extremely low-latency retrieval without beam search overhead.

Failure Modes

  • Generated clue maps to several docs — the model drops the result and misses retrieval.
  • Insufficient visual features or bad object detection degrades cross-modal clues.
  • Overly strict constraints in decoding can prevent useful but longer evidence from being generated.

Core Entities

Models

  • LLaMA-7B
  • LLaMA-13B
  • CLIP ViT-L/14
  • YOLOv7
  • LoRA

Metrics

  • P@5
  • R@5
  • R@10
  • P@1

Datasets

  • OKVQA-GS112K
  • OKVQA-WK21M
  • ReMuQ

Benchmarks

  • OKVQA-GS112K
  • OKVQA-WK21M
  • ReMuQ