Practical review: how large language models can help — and where they fall short — in language teaching and automated assessment

July 17, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper is a focused review with practical examples and early empirical citations; evidence is preliminary and mixed, so deployment should be conservative with human oversight.

Citations31

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 35%

Authors

Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christopher Bryant, Marek Rei, Helen Yannakoudakis, Andrew Mullooly, Diane Nicholls, Paula Buttery

Links

Abstract / PDF

Why It Matters For Business

LLMs let EdTech scale content creation and interactive features quickly, but they add compute cost and require human oversight to avoid quality, bias and calibration problems.

Who Should Care

Summary TLDR

This short review surveys uses of large language models (LLMs) for language teaching and assessment. LLMs improve open-ended text generation and enable new content and chat features, but they do not yet beat specialist systems on standard automated grading and grammatical-error benchmarks. Best current practice is human-in-the-loop use: prompt and post-edit generated content, combine LLM outputs with traditional linguistic features for scoring, and validate any learner-facing feedback with targeted human evaluation and safeguards.

Problem Statement

EdTech needs scalable ways to create calibrated content, grade writing, and provide feedback. Recent LLMs are powerful at generating fluent text but their accuracy, calibration and suitability for assessment tasks are unclear. The paper asks: where can LLMs help in language learning, what do they not solve, and what risks must be managed?

Main Contribution

Survey of LLM applications across content creation, calibration, assessment and feedback for language learning

Practical account of early experiments and industry examples (e.g., Duolingo Max) and human-in-the-loop pipelines

Key Findings

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

Practical UseUse LLMs to draft learner texts and chat scenarios, but plan for prompt engineering and human edit/selection before publishing content.

Evidence RefSections 3, human-in-the-loop generation experiments with GPT-3; Duolingo Max

A GPT-only automated essay scorer showed weak agreement with human reference scores (quadratic weighted kappa ≈ 0.388) while combining GPT outputs with established linguistic features improved agreement (≈ 0.605 on TOEFL11).

NumbersQWK: GPT-only ≈ 0.388; GPT + linguistic features ≈ 0.605

Practical UseDon't rely on LLMs alone for high-stakes scoring; combine LLM signals with engineered linguistic features and validate against benchmarks.

Evidence RefSection 5, Mizumoto & Eguchi (GPT-3.5 on TOEFL11)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Automated essay scoring agreement (QWK)GPT-only ≈ 0.388Human reference scoresTOEFL11 (ETS corpus)Mizumoto & Eguchi experiments reported in paperSection 5
Automated essay scoring agreement (QWK)GPT + linguistic features ≈ 0.605Human reference scores≈ +0.217 vs GPT-onlyTOEFL11Combining GPT signals with engineered linguistic features improved agreementSection 5

What To Try In 7 Days

Run a pilot: generate low-stakes reading texts with an LLM and have editors accept/reject outputs

A/B test LLM-written feedback versus template-based feedback on user engagement

Combine LLM scoring signals with existing linguistic feature models and compare to current scorer on a held-out sample

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Mostly a survey and position paper; contains limited new empirical results

Empirical citations are preliminary and use small human-evaluation samples in places

When Not To Use

For high-stakes automated scoring without human review

As sole source of grammatical correction for learning when minimal edits are required

Failure Modes

Over-fluent edits that change learner intent and confuse learning feedback

Inconsistent difficulty estimates depending on prompt and assumed student population

Core Entities

Models

GPT-3GPT-3.5GPT-4ChatGPTBERTBARTT5GECToRGrammarly

Metrics

Quadratic weighted kappa (QWK)BLEUROUGEBERTScoreHuman preference

Datasets

TOEFL11CoNLL-2014JFLEGBEA-2019RACECEPOC

Benchmarks

GEC benchmarks (CoNLL-2014, JFLEG, BEA-2019)Automated essay scoring (TOEFL11 comparisons)TSAR lexical simplification shared task

Context Entities

Models

PaLMLaMDALLaMAOPTGopherGPT-NeoBLOOM

Metrics

Automatic text-similarity scoresUser engagement signals

Datasets

The PileRACECLOTH

Benchmarks

HELM (Holistic Evaluation of Language Models)