Plug small, specialized LMs ('knowledge cards') into black‑box LLMs to add updatable, domain knowledge

May 17, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

4

Authors

Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov

Links

Abstract / PDF

Why It Matters For Business

You can update or patch a deployed black-box LLM by adding small domain models instead of retraining a giant model, cutting cost and latency of knowledge updates.

Summary TLDR

The paper introduces Knowledge Card: small, specialized language models trained on domain corpora that are invoked at inference to supply background knowledge to a larger black‑box LLM. Three filters (relevance, pruning/summary, factuality) clean generated content. Two integration modes are offered: bottom-up (activate many cards, then filter) and top-down (LLM asks whether and which card(s) are needed). On benchmarks the method improves Codex: +6.6% on MMLU accuracy, +≥31.7% balanced accuracy on misinformation detection, and up to +57.3% exact match on a curated 2022 midterm QA dataset, showing modular, low-cost knowledge updates.

Problem Statement

Large general-purpose LLMs are costly to retrain and hold static knowledge. They hallucinate, miss long-tail facts, and cannot be quickly updated. The paper asks: how can we cheaply and modularly add correct, up-to-date domain knowledge to black-box LLMs without retraining them?

Main Contribution

Knowledge cards: small specialized LMs trained on domain corpora to serve as plug-in knowledge sources.

Three content selectors: relevance (embedding similarity), pruning (summarization), and factuality (summary factuality + retrieval-based fact check with top-k sampling).

Two integration modes: bottom-up (activate many cards, then filter) and top-down (LLM decides iteratively whether and which card to call).

Empirical study on 6 datasets showing consistent gains vs. vanilla, retrieval-augmented, and generated-knowledge baselines.

MIDTERMQA: a curated dataset to test temporal knowledge updates and show a 1.3B-card can patch a 175B LLM on recent events.

Key Findings

Knowledge Card improves a black-box LLM (Codex) on general knowledge QA (MMLU).

NumbersMMLU overall accuracy: Codex -> KNOWLEDGE CARD (top-down exp) +6.6%

Multi-card synthesis helps multi-domain tasks like misinformation detection.

NumbersBalanced accuracy gains >=31.7% over Codex on LUN misinformation (2-way/4-way settings)

A single small knowledge card can update temporal facts for a large LLM.

NumbersMidtermQA open-book exact match: Codex -> KNOWLEDGE CARD up to +57.3% EM

Factuality filtering is the most impactful selector.

NumbersAblation study: removing factuality selector causes the largest performance drop on misinformation detection

Results

Accuracy

Valuetop-down (explicit) = +6.6% vs. Codex

BaselineCodex (vanilla)

Accuracy

Valuebottom-up best: +≥31.7% vs. Codex

BaselineCodex (vanilla)

Exact Match (MIDTERMQA open-book)

Valuebottom-up/top-down EM up to +57.3% vs. Codex

BaselineCodex (vanilla)

Selector ablation

ValueFactuality selector removal yields largest drop

Baselinebottom-up with all selectors

Who Should Care

What To Try In 7 Days

Train one small (≈1.3B) knowledge card on a recent domain corpus and plug it into your LLM via top-down to fix temporal errors.

Add a factuality filter (retrieval + fact-check scoring) to any generated external content before prepending it to prompts.

Run bottom-up multi-card tests for multi-domain tasks and compare to single retrieval corpora.

Agent Features

Memory

  • parametric knowledge stored in small cards

Planning

  • iterative ask-and-retrieve (top-down loop)

Tool Use

  • invoke external specialized LMs as plugins

Frameworks

  • bottom-up
  • top-down

Is Agentic

true

Architectures

  • black-box LLM + small specialized LMs (knowledge cards)

Collaboration

  • community-contributed knowledge cards

Optimization Features

Token Efficiency

  • pruning selector summarizes cards to fit context length

Training Optimization

  • branch-train-merge style independent card training (cited)

Inference Optimization

  • selective activation (top-down) to limit calls
  • n1/n2/n3 hyperparameters to control prompt size

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Knowledge cards (1.3B OPT) can produce low-quality or off-topic text; selector system mitigates but does not fully remove errors.
  • Factuality selector favors domains with good retrieval coverage (Wikipedia) and can underrate new/emerging facts.
  • Top-down yes/no decision is imperfect; LLMs sometimes overconfident and skip needed external knowledge.
  • Risk of malicious or low-quality community-contributed cards changing system behavior.

When Not To Use

  • When tasks do not need external factual knowledge and extra calls risk adding noise.
  • When you cannot curate or vet contributed cards (governance risk).
  • When real-time, high-throughput latency constraints forbid extra model calls.

Failure Modes

  • Injected cards supply false or biased facts and mislead the LLM.
  • Selectors prune away crucial context or keep hallucinated content, causing wrong outputs.
  • LLM refuses help (says No) but lacks needed knowledge (overconfidence).
  • Heterogeneous card quality leads to unstable system performance.

Core Entities

Models

  • CODEX (code-davinci-002)
  • OPT-1.3B (knowledge cards base)
  • PaLM
  • Flan-PaLM
  • TEXT-DAVINCI-003
  • GPT-3.5-TURBO
  • REPLUG
  • REPLUG LSR
  • ATLAS
  • GKP
  • RECITATION
  • GRTR

Metrics

  • Accuracy
  • macro F1
  • exact match
  • F1

Datasets

  • MMLU
  • LUN misinformation
  • MIDTERMQA (curated by authors)

Benchmarks

  • MMLU
  • LUN misinformation detection
  • MidtermQA (MIDTERMQA)

Context Entities

Models

  • OPT (other sizes)
  • MoE
  • Adapters/parameter averaging literature

Datasets

  • The Pile
  • Wikipedia (used in cards and comparisons)
  • news corpora (used for midterm card)