TABLET: a 20-task benchmark testing whether LLMs can learn tabular prediction from natural-language instructions

April 25, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

6

Authors

Dylan Slack, Sameer Singh

Links

Abstract / PDF

Why It Matters For Business

Instructions let you get useful tabular predictions with few or no labels, reducing costly data collection in privacy-sensitive domains.

Summary TLDR

The authors release TABLET, a benchmark of 20 tabular prediction tasks (10 UCI, 10 clinical DDX) annotated with natural and generated instructions. They test LLMs (Flan-T5 11b, Tk-Instruct 11b, GPT-J 6b, ChatGPT) and find instructions reliably raise performance versus prompts without instructions. Zero-shot instructions give moderate gains; combining instructions with a few in-context examples gives larger gains. However, models often ignore flipped or modified instructions and remain biased on specific instances, so instruction learning is promising but not yet a safe replacement for full supervised training in high-stakes domains.

Problem Statement

Can large language models solve tabular prediction tasks by following natural-language instructions alone or with a few examples, reducing the need for costly labeled tabular data?

Main Contribution

TABLET benchmark: 20 tabular tasks (10 UCI + 10 differential-diagnosis clinical tasks) annotated with diverse instructions.

Instruction taxonomy and generation: naturally occurring (consumer/professional) and generated instructions from simple rule/prototype models edited by GPT-3.

Prompting schema for serializing tabular rows and instructions into LLM prompts and an evaluation suite including flipped-instruction tests for faithfulness.

Empirical evaluation across several LLMs showing instruction-led gains and exposing failure modes like instruction unfaithfulness and instance biases.

Key Findings

Instructions improve zero-shot LLM performance over prompts without instructions.

NumbersFlan-T5 zero-shot F1 +20% avg; ChatGPT zero-shot F1 +10% avg (vs LIFT)

Few-shot examples amplify instruction benefits.

NumbersWith 4 in-context examples Flan-T5 F1 gain +44% vs no-instruction baseline; ChatGPT +13%

LLMs can ignore or contradict instruction logic.

NumbersMedian identical predictions with flipped instructions: Flan-T5 52%, Tk-Instruct 49%

Generated prototype-style instructions outperform rule-style ones.

NumbersPrototype vs rulesets: p = 0.03 favoring prototypes

LLMs still lag fully supervised models on DDX tasks.

NumbersXGBoost (full train) avg F1 0.94 vs ChatGPT 4-shot 0.68 and Flan-T5 4-shot 0.66

Results

Zero-shot F1 improvement (instructions vs no instructions)

ValueFlan-T5 +20% avg; ChatGPT +10% avg

BaselineLIFT/no-instruction prompt

Few-shot (4-shot) F1 improvement (instructions vs no instructions)

ValueFlan-T5 +44% avg; ChatGPT +13% avg

BaselineLIFT/no-instruction prompt

Upper-bound supervised vs instruction-led LLMs

ValueXGBoost full-train avg F1 0.94; ChatGPT 4-shot 0.68; Flan-T5 4-shot 0.66

BaselineFully supervised XGBoost

Instruction faithfulness (identical predictions after flipping instruction logic)

ValueFlan-T5 median 52% identical; Tk-Instruct median 49% identical

BaselinePredictions following original instructions

Generated instruction signal

ValueGenerated instructions significantly improve GPT-J and other LLMs (p < 1e-4)

BaselineNo instructions

Who Should Care

What To Try In 7 Days

Run the TABLET demo with one of your small tabular tasks to see instruction gains.

Write a short natural instruction and test zero-shot vs few-shot (2–4 examples) with Flan-T5 or ChatGPT.

Generate prototype-style instructions from a simple centroid or rule model and polish with GPT-3 or templates.

Reproducibility

Data Urls

  • TABLET benchmark bundled with paper (see demo site)
  • UCI ML Repository
  • DDXPlus (Tchango et al., 2022)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • LLMs sometimes ignore instructions or follow pretraining biases instead of the prompt logic
  • Models remain biased on particular instances and can consistently misclassify some data points
  • ChatGPT/API costs limit broader evaluation; experiments used select seeds for it
  • Instruction generation depends on simple template models and human review, which may not scale perfectly across domains

When Not To Use

  • When you have abundant labeled data and need top performance—fully supervised models outperform instruction-only LLMs
  • When faithful adherence to a precise rule is required without risk
  • When cost or latency of API LLMs is prohibitive

Failure Modes

  • Over-reliance on pretraining leading to identical predictions after instruction flips
  • Consistent misclassification of particular instances despite few-shot examples
  • Token collision in open-ended ChatGPT outputs causing label extraction errors
  • Domain gaps when LLM pretraining lacks domain-specific references

Core Entities

Models

  • Flan-T5 11b
  • Tk-Instruct 11b
  • GPT-J 6b
  • ChatGPT
  • GPT-3 (used for instruction rewriting)
  • XGBoost

Metrics

  • macro F1

Datasets

  • TABLET (20 tasks)
  • DDXPlus (differential diagnosis subset)
  • UCI repository datasets (e.g., Adult, Credit, Churn, Breast Cancer, Wine)

Benchmarks

  • TABLET