Use ChatGPT to teach a smaller model to score answers and explain why

Overview

Decision SnapshotNeeds Validation

The method is practical: it reuses ChatGPT as a data generator, shows consistent gains on selected ASAP-SAS subsets, and includes human evaluation; limitations exist around prompt tuning and dataset-specific rubrics.

Citations5

Evidence Strength0.70

Confidence0.86

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Jiazheng Li, Lin Gui, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He

Links

Abstract / PDF / Code / Data

Why It Matters For Business

AERA turns expensive LLM reasoning into a deployable, smaller model that scores answers and explains decisions, lowering inference costs and improving explainability for education products.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

The authors present AERA: a three-step pipeline that (1) prompts ChatGPT to produce score+explanation pairs for short-answer grading, (2) refines those rationales and corrects obvious label noise, and (3) fine-tunes a compact LongT5 model to output both scores and human-readable rationales. On subsets of the ASAP-SAS short-answer dataset, the distilled LongT5 model improves scoring agreement (QWK) over ChatGPT and yields rationales that human annotators prefer more often. The method also helps surface mislabeled training examples without extra human rationale annotation.

Problem Statement

Automated short-answer scoring is fast but opaque. High-quality natural-language rationales are rare and expensive to collect. Large LLMs can produce explanations but are costly and non-open. The problem: get reliable, explainable grading at lower cost by distilling LLM reasoning into a smaller model.

Main Contribution

AERA: a 3-step pipeline — prompt ChatGPT for score+rationale, refine outputs and fix label noise, then fine-tune a smaller LongT5 model to produce scores and rationales.

Two refinement strategies: detect/fix likely mislabelled examples using ChatGPT’s semantic confidence and a prompt-based rationale-refinement (XY→R) to make rationales align with supplied scores.

Key Findings

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

NumbersOverall QWK +11% vs ChatGPT (paper abstract; Table 1)

Practical UseYou can get better automatic grading agreement by fine-tuning a smaller model on ChatGPT-generated rationales instead of using ChatGPT at inference; cheaper and deployable locally.

Evidence RefAbstract; Table 1

Human evaluators preferred AERA rationales more often than ChatGPT's.

NumbersAnnotator preference: AERA 54% vs ChatGPT 23% (Table A8)

Practical UseDistilled model explanations are judged clearer and more useful, so students and teachers are likely to trust and act on them more.

Evidence RefHuman evaluation results (Table A8)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall QWK (AERA vs ChatGPT)	AERA QWK +11% over ChatGPT (reported overall improvement)	ChatGPT (Example Instruction)	+11%	Aggregated over evaluated ASAP-SAS subsets	Abstract; Table 1	Abstract; Table 1
Human preference of rationales	AERA preferred 54% / ChatGPT 23% / No preference 23%	Pairwise human comparison	AERA +31pp over ChatGPT	Sampled 10% from top run (Table A6,A8)	Table A8	Table A8

What To Try In 7 Days

Prompt ChatGPT few-shot (Example Instruction) to generate score+rationale pairs on a small sample of your task.

Run the XY→R refinement: feed predicted scores back to ChatGPT to tighten rationales and spot likely label errors.

Fine-tune an open LongT5 checkpoint on the refined pairs and run spot human checks to compare explanations.

Agent Features

Tool Use

OpenAI API used as data-generation teacher

Frameworks

AERA (prompt → refine → distill pipeline)

Optimization Features

Model Optimization

Knowledge distillation (teacher ChatGPT → student LongT5)

Training Optimization

Filter and refine teacher outputs to reduce noisy labels before fine-tuning

Inference Optimization

Switch from ChatGPT API to local LongT5 to cut inference cost

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/lijiazheng99/aera

Data URLs

https://kaggle.com/competitions/asap-sas

Risks & Boundaries

Limitations

Prompt templates and demonstration choices affect results and need tuning per dataset.

Human evaluators lacked domain assessment background; human eval quality could vary.

When Not To Use

When you have very little task data — refinement strategies may not rescue data scarcity.

For high-stakes grading without rigorous human oversight.

Failure Modes

Teacher hallucinations produce wrong rationales or wrong score formats, harming student model.

Noisy or mislabelled original dataset causes distilled model to learn incorrect mappings.

Core Entities

Models

ChatGPT (gpt-3.5-turbo)LongT5 (Long-t5-tglobal-large)BERT (bert-base-uncased)Longformer (longformer-base-4096)LLaMA-2 70BFlanT5Bard

Metrics

Quadratic Weighted Kappa (QWK)AccuracyMacro F1sacreBLEU

Datasets

ASAP-SAS (Hewlett Foundation Short Answer Scoring)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

Human evaluators preferred AERA rationales more often than ChatGPT's.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Make LLM judges reliable and cheaper: EIF and tuned PPI give tighter, valid evaluation with few labels

Key finding

Model rankings from LLM judges become less biased and get valid confidence intervals when you down-weight unreliable judges

Key finding

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding