Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
BOOST improves generation commonsense without fine-tuning large models, so teams can upgrade deployed LMs cheaply by adding a small controller and a scorer.
Summary TLDR
BOOST is a plug-and-play system that improves how frozen language models (GPT-2, Flan-T5, Alpaca) generate commonsensical sentences from a list of concepts. It builds a reference-free O-Scorer that extracts relation tuples from candidate sentences and checks them against COMET (a dynamic commonsense knowledge base). The scorer labels samples generated by the base LM and trains a small auxiliary NADO head to bias generation at token time. On CommonGen and CSK-PN, BOOST raises automatic O-scores and human commonsense ratings while avoiding full model finetuning.
Problem Statement
Given a list of concepts, pre-trained LMs often output sentences that break commonsense. Fine-tuning large models is costly. The problem: how to steer a frozen LM to produce more commonsensical, constraint-satisfying sentences without changing its weights.
Main Contribution
A reference-free commonsense scorer (O-Scorer) that extracts relation tuples from a sentence and scores them by grounding to COMET.
A black-box controllable generation pipeline that trains a small auxiliary NADO head on self-sampled outputs to steer a frozen PTLM toward higher O-scores.
Empirical validation on CommonGen and CSK-PN showing consistent gains in automatic O-score and human commonsense judgments across GPT-2, Flan-T5, and Alpaca variants.
Key Findings
Reference-free O-Scorer correlates with human commonsense ratings and matches top reference-based metrics.
BOOST raises human commonsense ratings across base models.
Combining lexical checking with O-Scorer (BOOST Joint) preserves constraint coverage better than O-Scorer alone.
Results
O-Score (automatic)
Human commonsense (Likert 1-4)
Keyword coverage
Who Should Care
What To Try In 7 Days
Run the O-Scorer on your model's outputs to get a reference-free commonsense baseline.
Generate self-sampled outputs from your frozen LM and train a small NADO head using the O-Scorer as labels.
Use BOOST Joint (lexical check × O-Score) to keep required keywords while improving commonsense.
Agent Features
Frameworks
- NADO (neurally-decomposed oracle)
Architectures
- transformer decoder (aux head)
Optimization Features
Infra Optimization
- avoids expensive full-model finetuning, fits on single 80GB GPU
Model Optimization
- train only small auxiliary head, not base LM
System Optimization
- black-box control that only needs access to output probabilities
Training Optimization
- train on self-sampled model outputs (no external labels needed)
Inference Optimization
- two forward passes but auxiliary head is small so latency roughly unchanged
Reproducibility
Data Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Tuple extractor covers only four relation types (UsedFor, CapableOf, AtLocation, PartOf).
- Cosine similarity of sentence embeddings sometimes misaligns with human judgement, which can mislabel training data.
- Training the auxiliary head needs many self-sampled examples, which raises compute cost even if smaller than finetuning the full LM.
When Not To Use
- When you can afford to finetune the full model and prefer end-to-end training.
- When your constraints need relations outside the four covered types (causal, temporal, etc.).
- When you cannot generate many self-sampled outputs due to API or compute limits.
Failure Modes
- Scorer errors propagate into the auxiliary model and produce plausible-looking but incorrect outputs.
- BOOST CS can improve commonsense but drop required keywords (low coverage) unless combined with lexical checks.
- Sentence embedding similarity mismatches lead to wrong tuple compatibility scores.
Core Entities
Models
- GPT-2
- Alpaca-7b
- Flan-T5-large
- GPT-3.5-Turbo
- T5-large
- COMET
- NADO (auxiliary head)
- BOOST (system)
Metrics
- O-Score
- BERTScore
- METEOR
- BLEU-4
- Keyword Coverage
Datasets
- CommonGen
- CSK-PN

