Overview
Production Readiness
0.4
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
6
Why It Matters For Business
Instructions let you get useful tabular predictions with few or no labels, reducing costly data collection in privacy-sensitive domains.
Summary TLDR
The authors release TABLET, a benchmark of 20 tabular prediction tasks (10 UCI, 10 clinical DDX) annotated with natural and generated instructions. They test LLMs (Flan-T5 11b, Tk-Instruct 11b, GPT-J 6b, ChatGPT) and find instructions reliably raise performance versus prompts without instructions. Zero-shot instructions give moderate gains; combining instructions with a few in-context examples gives larger gains. However, models often ignore flipped or modified instructions and remain biased on specific instances, so instruction learning is promising but not yet a safe replacement for full supervised training in high-stakes domains.
Problem Statement
Can large language models solve tabular prediction tasks by following natural-language instructions alone or with a few examples, reducing the need for costly labeled tabular data?
Main Contribution
TABLET benchmark: 20 tabular tasks (10 UCI + 10 differential-diagnosis clinical tasks) annotated with diverse instructions.
Instruction taxonomy and generation: naturally occurring (consumer/professional) and generated instructions from simple rule/prototype models edited by GPT-3.
Prompting schema for serializing tabular rows and instructions into LLM prompts and an evaluation suite including flipped-instruction tests for faithfulness.
Empirical evaluation across several LLMs showing instruction-led gains and exposing failure modes like instruction unfaithfulness and instance biases.
Key Findings
Instructions improve zero-shot LLM performance over prompts without instructions.
Few-shot examples amplify instruction benefits.
LLMs can ignore or contradict instruction logic.
Generated prototype-style instructions outperform rule-style ones.
LLMs still lag fully supervised models on DDX tasks.
Results
Zero-shot F1 improvement (instructions vs no instructions)
Few-shot (4-shot) F1 improvement (instructions vs no instructions)
Upper-bound supervised vs instruction-led LLMs
Instruction faithfulness (identical predictions after flipping instruction logic)
Generated instruction signal
Who Should Care
What To Try In 7 Days
Run the TABLET demo with one of your small tabular tasks to see instruction gains.
Write a short natural instruction and test zero-shot vs few-shot (2–4 examples) with Flan-T5 or ChatGPT.
Generate prototype-style instructions from a simple centroid or rule model and polish with GPT-3 or templates.
Reproducibility
Data Urls
- TABLET benchmark bundled with paper (see demo site)
- UCI ML Repository
- DDXPlus (Tchango et al., 2022)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- LLMs sometimes ignore instructions or follow pretraining biases instead of the prompt logic
- Models remain biased on particular instances and can consistently misclassify some data points
- ChatGPT/API costs limit broader evaluation; experiments used select seeds for it
- Instruction generation depends on simple template models and human review, which may not scale perfectly across domains
When Not To Use
- When you have abundant labeled data and need top performance—fully supervised models outperform instruction-only LLMs
- When faithful adherence to a precise rule is required without risk
- When cost or latency of API LLMs is prohibitive
Failure Modes
- Over-reliance on pretraining leading to identical predictions after instruction flips
- Consistent misclassification of particular instances despite few-shot examples
- Token collision in open-ended ChatGPT outputs causing label extraction errors
- Domain gaps when LLM pretraining lacks domain-specific references
Core Entities
Models
- Flan-T5 11b
- Tk-Instruct 11b
- GPT-J 6b
- ChatGPT
- GPT-3 (used for instruction rewriting)
- XGBoost
Metrics
- macro F1
Datasets
- TABLET (20 tasks)
- DDXPlus (differential diagnosis subset)
- UCI repository datasets (e.g., Adult, Credit, Churn, Breast Cancer, Wine)
Benchmarks
- TABLET

