Use ChatGPT to teach a smaller model to score answers and explain why

May 22, 20237 min

Overview

Decision SnapshotNeeds Validation

The method is practical: it reuses ChatGPT as a data generator, shows consistent gains on selected ASAP-SAS subsets, and includes human evaluation; limitations exist around prompt tuning and dataset-specific rubrics.

Citations5

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jiazheng Li, Lin Gui, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AERA turns expensive LLM reasoning into a deployable, smaller model that scores answers and explains decisions, lowering inference costs and improving explainability for education products.

Who Should Care

Summary TLDR

The authors present AERA: a three-step pipeline that (1) prompts ChatGPT to produce score+explanation pairs for short-answer grading, (2) refines those rationales and corrects obvious label noise, and (3) fine-tunes a compact LongT5 model to output both scores and human-readable rationales. On subsets of the ASAP-SAS short-answer dataset, the distilled LongT5 model improves scoring agreement (QWK) over ChatGPT and yields rationales that human annotators prefer more often. The method also helps surface mislabeled training examples without extra human rationale annotation.

Problem Statement

Automated short-answer scoring is fast but opaque. High-quality natural-language rationales are rare and expensive to collect. Large LLMs can produce explanations but are costly and non-open. The problem: get reliable, explainable grading at lower cost by distilling LLM reasoning into a smaller model.

Main Contribution

AERA: a 3-step pipeline — prompt ChatGPT for score+rationale, refine outputs and fix label noise, then fine-tune a smaller LongT5 model to produce scores and rationales.

Two refinement strategies: detect/fix likely mislabelled examples using ChatGPT’s semantic confidence and a prompt-based rationale-refinement (XY→R) to make rationales align with supplied scores.

Key Findings

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

NumbersOverall QWK +11% vs ChatGPT (paper abstract; Table 1)

Practical UseYou can get better automatic grading agreement by fine-tuning a smaller model on ChatGPT-generated rationales instead of using ChatGPT at inference; cheaper and deployable locally.

Evidence RefAbstract; Table 1

Human evaluators preferred AERA rationales more often than ChatGPT's.

NumbersAnnotator preference: AERA 54% vs ChatGPT 23% (Table A8)

Practical UseDistilled model explanations are judged clearer and more useful, so students and teachers are likely to trust and act on them more.

Evidence RefHuman evaluation results (Table A8)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Overall QWK (AERA vs ChatGPT)AERA QWK +11% over ChatGPT (reported overall improvement)ChatGPT (Example Instruction)+11%Aggregated over evaluated ASAP-SAS subsetsAbstract; Table 1Abstract; Table 1
Human preference of rationalesAERA preferred 54% / ChatGPT 23% / No preference 23%Pairwise human comparisonAERA +31pp over ChatGPTSampled 10% from top run (Table A6,A8)Table A8Table A8

What To Try In 7 Days

Prompt ChatGPT few-shot (Example Instruction) to generate score+rationale pairs on a small sample of your task.

Run the XY→R refinement: feed predicted scores back to ChatGPT to tighten rationales and spot likely label errors.

Fine-tune an open LongT5 checkpoint on the refined pairs and run spot human checks to compare explanations.

Agent Features

Tool Use
OpenAI API used as data-generation teacher
Frameworks
AERA (prompt → refine → distill pipeline)

Optimization Features

Model Optimization
Knowledge distillation (teacher ChatGPT → student LongT5)
Training Optimization
Filter and refine teacher outputs to reduce noisy labels before fine-tuning
Inference Optimization
Switch from ChatGPT API to local LongT5 to cut inference cost

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Prompt templates and demonstration choices affect results and need tuning per dataset.

Human evaluators lacked domain assessment background; human eval quality could vary.

When Not To Use

When you have very little task data — refinement strategies may not rescue data scarcity.

For high-stakes grading without rigorous human oversight.

Failure Modes

Teacher hallucinations produce wrong rationales or wrong score formats, harming student model.

Noisy or mislabelled original dataset causes distilled model to learn incorrect mappings.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)LongT5 (Long-t5-tglobal-large)BERT (bert-base-uncased)Longformer (longformer-base-4096)LLaMA-2 70BFlanT5Bard

Metrics

Quadratic Weighted Kappa (QWK)AccuracyMacro F1sacreBLEU

Datasets

ASAP-SAS (Hewlett Foundation Short Answer Scoring)