Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
4
Why It Matters For Business
You can update or patch a deployed black-box LLM by adding small domain models instead of retraining a giant model, cutting cost and latency of knowledge updates.
Summary TLDR
The paper introduces Knowledge Card: small, specialized language models trained on domain corpora that are invoked at inference to supply background knowledge to a larger black‑box LLM. Three filters (relevance, pruning/summary, factuality) clean generated content. Two integration modes are offered: bottom-up (activate many cards, then filter) and top-down (LLM asks whether and which card(s) are needed). On benchmarks the method improves Codex: +6.6% on MMLU accuracy, +≥31.7% balanced accuracy on misinformation detection, and up to +57.3% exact match on a curated 2022 midterm QA dataset, showing modular, low-cost knowledge updates.
Problem Statement
Large general-purpose LLMs are costly to retrain and hold static knowledge. They hallucinate, miss long-tail facts, and cannot be quickly updated. The paper asks: how can we cheaply and modularly add correct, up-to-date domain knowledge to black-box LLMs without retraining them?
Main Contribution
Knowledge cards: small specialized LMs trained on domain corpora to serve as plug-in knowledge sources.
Three content selectors: relevance (embedding similarity), pruning (summarization), and factuality (summary factuality + retrieval-based fact check with top-k sampling).
Two integration modes: bottom-up (activate many cards, then filter) and top-down (LLM decides iteratively whether and which card to call).
Empirical study on 6 datasets showing consistent gains vs. vanilla, retrieval-augmented, and generated-knowledge baselines.
MIDTERMQA: a curated dataset to test temporal knowledge updates and show a 1.3B-card can patch a 175B LLM on recent events.
Key Findings
Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).
Multi-card synthesis helps multi-domain tasks like misinformation detection.
A single small knowledge card can update temporal facts for a large LLM.
Factuality filtering is the most impactful selector.
Results
Accuracy
Accuracy
Exact Match (MIDTERMQA open-book)
Selector ablation
Who Should Care
What To Try In 7 Days
Train one small (≈1.3B) knowledge card on a recent domain corpus and plug it into your LLM via top-down to fix temporal errors.
Add a factuality filter (retrieval + fact-check scoring) to any generated external content before prepending it to prompts.
Run bottom-up multi-card tests for multi-domain tasks and compare to single retrieval corpora.
Agent Features
Memory
- parametric knowledge stored in small cards
Planning
- iterative ask-and-retrieve (top-down loop)
Tool Use
- invoke external specialized LMs as plugins
Frameworks
- bottom-up
- top-down
Is Agentic
true
Architectures
- black-box LLM + small specialized LMs (knowledge cards)
Collaboration
- community-contributed knowledge cards
Optimization Features
Token Efficiency
- pruning selector summarizes cards to fit context length
Training Optimization
- branch-train-merge style independent card training (cited)
Inference Optimization
- selective activation (top-down) to limit calls
- n1/n2/n3 hyperparameters to control prompt size
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Knowledge cards (1.3B OPT) can produce low-quality or off-topic text; selector system mitigates but does not fully remove errors.
- Factuality selector favors domains with good retrieval coverage (Wikipedia) and can underrate new/emerging facts.
- Top-down yes/no decision is imperfect; LLMs sometimes overconfident and skip needed external knowledge.
- Risk of malicious or low-quality community-contributed cards changing system behavior.
When Not To Use
- When tasks do not need external factual knowledge and extra calls risk adding noise.
- When you cannot curate or vet contributed cards (governance risk).
- When real-time, high-throughput latency constraints forbid extra model calls.
Failure Modes
- Injected cards supply false or biased facts and mislead the LLM.
- Selectors prune away crucial context or keep hallucinated content, causing wrong outputs.
- LLM refuses help (says No) but lacks needed knowledge (overconfidence).
- Heterogeneous card quality leads to unstable system performance.
Core Entities
Models
- CODEX (code-davinci-002)
- OPT-1.3B (knowledge cards base)
- PaLM
- Flan-PaLM
- TEXT-DAVINCI-003
- GPT-3.5-TURBO
- REPLUG
- REPLUG LSR
- ATLAS
- GKP
- RECITATION
- GRTR
Metrics
- Accuracy
- macro F1
- exact match
- F1
Datasets
- MMLU
- LUN misinformation
- MIDTERMQA (curated by authors)
Benchmarks
- MMLU
- LUN misinformation detection
- MidtermQA (MIDTERMQA)
Context Entities
Models
- OPT (other sizes)
- MoE
- Adapters/parameter averaging literature
Datasets
- The Pile
- Wikipedia (used in cards and comparisons)
- news corpora (used for midterm card)

