Plug small, specialized LMs ('knowledge cards') into black‑box LLMs to add updatable, domain knowledge

Overview

Decision SnapshotNeeds Validation

The idea is practical and tested on multiple datasets; empirical gains are strong on targeted tasks, but real-world deployment needs robust factuality checks and governance for contributed cards.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can update or patch a deployed black-box LLM by adding small domain models instead of retraining a giant model, cutting cost and latency of knowledge updates.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The paper introduces Knowledge Card: small, specialized language models trained on domain corpora that are invoked at inference to supply background knowledge to a larger black‑box LLM. Three filters (relevance, pruning/summary, factuality) clean generated content. Two integration modes are offered: bottom-up (activate many cards, then filter) and top-down (LLM asks whether and which card(s) are needed). On benchmarks the method improves Codex: +6.6% on MMLU accuracy, +≥31.7% balanced accuracy on misinformation detection, and up to +57.3% exact match on a curated 2022 midterm QA dataset, showing modular, low-cost knowledge updates.

Problem Statement

Large general-purpose LLMs are costly to retrain and hold static knowledge. They hallucinate, miss long-tail facts, and cannot be quickly updated. The paper asks: how can we cheaply and modularly add correct, up-to-date domain knowledge to black-box LLMs without retraining them?

Main Contribution

Knowledge cards: small specialized LMs trained on domain corpora to serve as plug-in knowledge sources.

Three content selectors: relevance (embedding similarity), pruning (summarization), and factuality (summary factuality + retrieval-based fact check with top-k sampling).

Key Findings

Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).

NumbersMMLU overall accuracy: Codex -> KNOWLEDGE CARD (top-down exp) +6.6%

Practical UsePlugging domain cards and using the top-down selector can raise a deployed LLM's general QA accuracy without retraining the big model.

Evidence RefTable 1; Section 4

Multi-card synthesis helps multi-domain tasks like misinformation detection.

NumbersBalanced accuracy gains >=31.7% over Codex on LUN misinformation (2-way/4-way settings)

Practical UseIf your task needs facts from multiple domains, activate several small in-domain cards (bottom-up) instead of relying on a single retrieval corpus.

Evidence RefTable 2; Section 4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	top-down (explicit) = +6.6% vs. Codex	Codex (vanilla)	+6.6%	MMLU (5-shot, all tasks)	Table 1; Section 4	Table 1
Accuracy	bottom-up best: +≥31.7% vs. Codex	Codex (vanilla)	>=31.7% BAcc improvement	LUN misinformation (2-way and 4-way, 16-shot)	Table 2; Section 4	Table 2

What To Try In 7 Days

Train one small (≈1.3B) knowledge card on a recent domain corpus and plug it into your LLM via top-down to fix temporal errors.

Add a factuality filter (retrieval + fact-check scoring) to any generated external content before prepending it to prompts.

Run bottom-up multi-card tests for multi-domain tasks and compare to single retrieval corpora.

Agent Features

Memory

parametric knowledge stored in small cards

Planning

iterative ask-and-retrieve (top-down loop)

Tool Use

invoke external specialized LMs as plugins

Frameworks

bottom-uptop-down

Is Agentic

Yes

Architectures

black-box LLM + small specialized LMs (knowledge cards)

Collaboration

community-contributed knowledge cards

Optimization Features

Token Efficiency

pruning selector summarizes cards to fit context length

Training Optimization

branch-train-merge style independent card training (cited)

Inference Optimization

selective activation (top-down) to limit callsn1/n2/n3 hyperparameters to control prompt size

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/BunsenFeng/Knowledge-Card

Data URLs

https://github.com/BunsenFeng/Knowledge-Card (authors report MIDTERMQA and resources available)

Risks & Boundaries

Limitations

Knowledge cards (1.3B OPT) can produce low-quality or off-topic text; selector system mitigates but does not fully remove errors.

Factuality selector favors domains with good retrieval coverage (Wikipedia) and can underrate new/emerging facts.

When Not To Use

When tasks do not need external factual knowledge and extra calls risk adding noise.

When you cannot curate or vet contributed cards (governance risk).

Failure Modes

Injected cards supply false or biased facts and mislead the LLM.

Selectors prune away crucial context or keep hallucinated content, causing wrong outputs.

Core Entities

Models

CODEX (code-davinci-002)OPT-1.3B (knowledge cards base)PaLMFlan-PaLMTEXT-DAVINCI-003GPT-3.5-TURBOREPLUGREPLUG LSRATLASGKPRECITATIONGRTR

Metrics

Accuracymacro F1exact matchF1

Datasets

MMLULUN misinformationMIDTERMQA (curated by authors)

Benchmarks

MMLULUN misinformation detectionMidtermQA (MIDTERMQA)

Context Entities

Models

OPT (other sizes)MoEAdapters/parameter averaging literature

Datasets

The PileWikipedia (used in cards and comparisons)news corpora (used for midterm card)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).

Multi-card synthesis helps multi-domain tasks like misinformation detection.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding