Plug small, specialized LMs ('knowledge cards') into black‑box LLMs to add updatable, domain knowledge

May 17, 20238 min

Overview

Decision SnapshotNeeds Validation

The idea is practical and tested on multiple datasets; empirical gains are strong on targeted tasks, but real-world deployment needs robust factuality checks and governance for contributed cards.

Citations4

Evidence Strength0.80

Confidence0.90

Risk Signals11

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can update or patch a deployed black-box LLM by adding small domain models instead of retraining a giant model, cutting cost and latency of knowledge updates.

Who Should Care

Summary TLDR

The paper introduces Knowledge Card: small, specialized language models trained on domain corpora that are invoked at inference to supply background knowledge to a larger black‑box LLM. Three filters (relevance, pruning/summary, factuality) clean generated content. Two integration modes are offered: bottom-up (activate many cards, then filter) and top-down (LLM asks whether and which card(s) are needed). On benchmarks the method improves Codex: +6.6% on MMLU accuracy, +≥31.7% balanced accuracy on misinformation detection, and up to +57.3% exact match on a curated 2022 midterm QA dataset, showing modular, low-cost knowledge updates.

Problem Statement

Large general-purpose LLMs are costly to retrain and hold static knowledge. They hallucinate, miss long-tail facts, and cannot be quickly updated. The paper asks: how can we cheaply and modularly add correct, up-to-date domain knowledge to black-box LLMs without retraining them?

Main Contribution

Knowledge cards: small specialized LMs trained on domain corpora to serve as plug-in knowledge sources.

Three content selectors: relevance (embedding similarity), pruning (summarization), and factuality (summary factuality + retrieval-based fact check with top-k sampling).

Key Findings

Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).

NumbersMMLU overall accuracy: Codex -> KNOWLEDGE CARD (top-down exp) +6.6%

Practical UsePlugging domain cards and using the top-down selector can raise a deployed LLM's general QA accuracy without retraining the big model.

Evidence RefTable 1; Section 4

Multi-card synthesis helps multi-domain tasks like misinformation detection.

NumbersBalanced accuracy gains >=31.7% over Codex on LUN misinformation (2-way/4-way settings)

Practical UseIf your task needs facts from multiple domains, activate several small in-domain cards (bottom-up) instead of relying on a single retrieval corpus.

Evidence RefTable 2; Section 4

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracytop-down (explicit) = +6.6% vs. CodexCodex (vanilla)+6.6%MMLU (5-shot, all tasks)Table 1; Section 4Table 1
Accuracybottom-up best: +≥31.7% vs. CodexCodex (vanilla)>=31.7% BAcc improvementLUN misinformation (2-way and 4-way, 16-shot)Table 2; Section 4Table 2

What To Try In 7 Days

Train one small (≈1.3B) knowledge card on a recent domain corpus and plug it into your LLM via top-down to fix temporal errors.

Add a factuality filter (retrieval + fact-check scoring) to any generated external content before prepending it to prompts.

Run bottom-up multi-card tests for multi-domain tasks and compare to single retrieval corpora.

Agent Features

Memory
parametric knowledge stored in small cards
Planning
iterative ask-and-retrieve (top-down loop)
Tool Use
invoke external specialized LMs as plugins
Frameworks
bottom-uptop-down
Is Agentic

Yes

Architectures
black-box LLM + small specialized LMs (knowledge cards)
Collaboration
community-contributed knowledge cards

Optimization Features

Token Efficiency
pruning selector summarizes cards to fit context length
Training Optimization
branch-train-merge style independent card training (cited)
Inference Optimization
selective activation (top-down) to limit callsn1/n2/n3 hyperparameters to control prompt size

Reproducibility

Risks & Boundaries

Limitations

Knowledge cards (1.3B OPT) can produce low-quality or off-topic text; selector system mitigates but does not fully remove errors.

Factuality selector favors domains with good retrieval coverage (Wikipedia) and can underrate new/emerging facts.

When Not To Use

When tasks do not need external factual knowledge and extra calls risk adding noise.

When you cannot curate or vet contributed cards (governance risk).

Failure Modes

Injected cards supply false or biased facts and mislead the LLM.

Selectors prune away crucial context or keep hallucinated content, causing wrong outputs.

Core Entities

Models

CODEX (code-davinci-002)OPT-1.3B (knowledge cards base)PaLMFlan-PaLMTEXT-DAVINCI-003GPT-3.5-TURBOREPLUGREPLUG LSRATLASGKPRECITATIONGRTR

Metrics

Accuracymacro F1exact matchF1

Datasets

MMLULUN misinformationMIDTERMQA (curated by authors)

Benchmarks

MMLULUN misinformation detectionMidtermQA (MIDTERMQA)

Context Entities

Models

OPT (other sizes)MoEAdapters/parameter averaging literature

Datasets

The PileWikipedia (used in cards and comparisons)news corpora (used for midterm card)