Steer a frozen language model toward more commonsense using a small auxiliary head and a reference-free scorer

October 25, 20236 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Yufei Tian, Felix Zhang, Nanyun Peng

Links

Abstract / PDF

Why It Matters For Business

BOOST improves generation commonsense without fine-tuning large models, so teams can upgrade deployed LMs cheaply by adding a small controller and a scorer.

Summary TLDR

BOOST is a plug-and-play system that improves how frozen language models (GPT-2, Flan-T5, Alpaca) generate commonsensical sentences from a list of concepts. It builds a reference-free O-Scorer that extracts relation tuples from candidate sentences and checks them against COMET (a dynamic commonsense knowledge base). The scorer labels samples generated by the base LM and trains a small auxiliary NADO head to bias generation at token time. On CommonGen and CSK-PN, BOOST raises automatic O-scores and human commonsense ratings while avoiding full model finetuning.

Problem Statement

Given a list of concepts, pre-trained LMs often output sentences that break commonsense. Fine-tuning large models is costly. The problem: how to steer a frozen LM to produce more commonsensical, constraint-satisfying sentences without changing its weights.

Main Contribution

A reference-free commonsense scorer (O-Scorer) that extracts relation tuples from a sentence and scores them by grounding to COMET.

A black-box controllable generation pipeline that trains a small auxiliary NADO head on self-sampled outputs to steer a frozen PTLM toward higher O-scores.

Empirical validation on CommonGen and CSK-PN showing consistent gains in automatic O-score and human commonsense judgments across GPT-2, Flan-T5, and Alpaca variants.

Key Findings

Reference-free O-Scorer correlates with human commonsense ratings and matches top reference-based metrics.

NumbersO-Score (mean) T5: 0.284, GPT-3.5: 0.299, Gold: 0.365; BERTScore-all: 0.302

BOOST raises human commonsense ratings across base models.

Numbersgpt2 CS: 2.31 → 2.64 (+0.33); Alpaca warm-up CS: 3.02 → 3.36 (+0.34)

Combining lexical checking with O-Scorer (BOOST Joint) preserves constraint coverage better than O-Scorer alone.

Numbersgpt2 coverage: BOOST CS 90.9% vs BOOST Joint 96.1% (base 90.7%)

Results

O-Score (automatic)

Value0.615 (BOOST CS)

Baseline0.514 (Base gpt2)

Human commonsense (Likert 1-4)

Value2.64 (BOOST CS)

Baseline2.31 (Base gpt2)

Keyword coverage

Value96.1%

Baseline90.7% (Base gpt2)

Who Should Care

What To Try In 7 Days

Run the O-Scorer on your model's outputs to get a reference-free commonsense baseline.

Generate self-sampled outputs from your frozen LM and train a small NADO head using the O-Scorer as labels.

Use BOOST Joint (lexical check × O-Score) to keep required keywords while improving commonsense.

Agent Features

Frameworks

  • NADO (neurally-decomposed oracle)

Architectures

  • transformer decoder (aux head)

Optimization Features

Infra Optimization

  • avoids expensive full-model finetuning, fits on single 80GB GPU

Model Optimization

  • train only small auxiliary head, not base LM

System Optimization

  • black-box control that only needs access to output probabilities

Training Optimization

  • train on self-sampled model outputs (no external labels needed)

Inference Optimization

  • two forward passes but auxiliary head is small so latency roughly unchanged

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Tuple extractor covers only four relation types (UsedFor, CapableOf, AtLocation, PartOf).
  • Cosine similarity of sentence embeddings sometimes misaligns with human judgement, which can mislabel training data.
  • Training the auxiliary head needs many self-sampled examples, which raises compute cost even if smaller than finetuning the full LM.

When Not To Use

  • When you can afford to finetune the full model and prefer end-to-end training.
  • When your constraints need relations outside the four covered types (causal, temporal, etc.).
  • When you cannot generate many self-sampled outputs due to API or compute limits.

Failure Modes

  • Scorer errors propagate into the auxiliary model and produce plausible-looking but incorrect outputs.
  • BOOST CS can improve commonsense but drop required keywords (low coverage) unless combined with lexical checks.
  • Sentence embedding similarity mismatches lead to wrong tuple compatibility scores.

Core Entities

Models

  • GPT-2
  • Alpaca-7b
  • Flan-T5-large
  • GPT-3.5-Turbo
  • T5-large
  • COMET
  • NADO (auxiliary head)
  • BOOST (system)

Metrics

  • O-Score
  • BERTScore
  • METEOR
  • BLEU-4
  • Keyword Coverage

Datasets

  • CommonGen
  • CSK-PN