Use ChatGPT to teach a smaller model to score answers and explain why

May 22, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

5

Authors

Jiazheng Li, Lin Gui, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He

Links

Abstract / PDF

Why It Matters For Business

AERA turns expensive LLM reasoning into a deployable, smaller model that scores answers and explains decisions, lowering inference costs and improving explainability for education products.

Summary TLDR

The authors present AERA: a three-step pipeline that (1) prompts ChatGPT to produce score+explanation pairs for short-answer grading, (2) refines those rationales and corrects obvious label noise, and (3) fine-tunes a compact LongT5 model to output both scores and human-readable rationales. On subsets of the ASAP-SAS short-answer dataset, the distilled LongT5 model improves scoring agreement (QWK) over ChatGPT and yields rationales that human annotators prefer more often. The method also helps surface mislabeled training examples without extra human rationale annotation.

Problem Statement

Automated short-answer scoring is fast but opaque. High-quality natural-language rationales are rare and expensive to collect. Large LLMs can produce explanations but are costly and non-open. The problem: get reliable, explainable grading at lower cost by distilling LLM reasoning into a smaller model.

Main Contribution

AERA: a 3-step pipeline — prompt ChatGPT for score+rationale, refine outputs and fix label noise, then fine-tune a smaller LongT5 model to produce scores and rationales.

Two refinement strategies: detect/fix likely mislabelled examples using ChatGPT’s semantic confidence and a prompt-based rationale-refinement (XY→R) to make rationales align with supplied scores.

Comprehensive evaluation showing the distilled model matches or exceeds ChatGPT on scoring (QWK) on evaluated subsets and produces rationales human annotators prefer.

Key Findings

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

NumbersOverall QWK +11% vs ChatGPT (paper abstract; Table 1)

Human evaluators preferred AERA rationales more often than ChatGPT's.

NumbersAnnotator preference: AERA 54% vs ChatGPT 23% (Table A8)

ChatGPT-produced rationales strongly predict ChatGPT's own scores.

NumbersScore-from-rationale QWK ≈ 90–99 across subsets (Table A2)

Results

Overall QWK (AERA vs ChatGPT)

ValueAERA QWK +11% over ChatGPT (reported overall improvement)

BaselineChatGPT (Example Instruction)

Human preference of rationales

ValueAERA preferred 54% / ChatGPT 23% / No preference 23%

BaselinePairwise human comparison

Faithfulness (rationale→score predictability)

ValueQWK 90–99 when predicting ChatGPT scores from its rationales

BaselineChatGPT outputs

Who Should Care

What To Try In 7 Days

Prompt ChatGPT few-shot (Example Instruction) to generate score+rationale pairs on a small sample of your task.

Run the XY→R refinement: feed predicted scores back to ChatGPT to tighten rationales and spot likely label errors.

Fine-tune an open LongT5 checkpoint on the refined pairs and run spot human checks to compare explanations.

Agent Features

Tool Use

  • OpenAI API used as data-generation teacher

Frameworks

  • AERA (prompt → refine → distill pipeline)

Optimization Features

Model Optimization

  • Knowledge distillation (teacher ChatGPT → student LongT5)

Training Optimization

  • Filter and refine teacher outputs to reduce noisy labels before fine-tuning

Inference Optimization

  • Switch from ChatGPT API to local LongT5 to cut inference cost

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Prompt templates and demonstration choices affect results and need tuning per dataset.
  • Human evaluators lacked domain assessment background; human eval quality could vary.
  • Trade-off observed between interpretability (generation) and top classification performance.
  • Method depends on the teacher (ChatGPT) quality; teacher hallucinations can propagate if not filtered.

When Not To Use

  • When you have very little task data — refinement strategies may not rescue data scarcity.
  • For high-stakes grading without rigorous human oversight.
  • When rubrics rely heavily on domain-specific background not covered in prompts.

Failure Modes

  • Teacher hallucinations produce wrong rationales or wrong score formats, harming student model.
  • Noisy or mislabelled original dataset causes distilled model to learn incorrect mappings.
  • Ambiguous rubric entries like “other acceptable responses” lead to inconsistent scoring.
  • Overfitting to teacher-style explanations rather than true rubric reasoning.

Core Entities

Models

  • ChatGPT (gpt-3.5-turbo)
  • LongT5 (Long-t5-tglobal-large)
  • BERT (bert-base-uncased)
  • Longformer (longformer-base-4096)
  • LLaMA-2 70B
  • FlanT5
  • Bard

Metrics

  • Quadratic Weighted Kappa (QWK)
  • Accuracy
  • Macro F1
  • sacreBLEU

Datasets

  • ASAP-SAS (Hewlett Foundation Short Answer Scoring)