Practical review: how large language models can help — and where they fall short — in language teaching and automated assessment

Overview

Decision SnapshotNeeds Validation

The paper is a focused review with practical examples and early empirical citations; evidence is preliminary and mixed, so deployment should be conservative with human oversight.

Citations31

Evidence Strength0.60

Confidence0.82

Risk Signals10

Trust Signals

Findings with numeric evidence: 2/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 40%

Novelty: 35%

Authors

Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christopher Bryant, Marek Rei, Helen Yannakoudakis, Andrew Mullooly, Diane Nicholls, Paula Buttery

Links

Abstract / PDF

Why It Matters For Business

LLMs let EdTech scale content creation and interactive features quickly, but they add compute cost and require human oversight to avoid quality, bias and calibration problems.

Who Should Care

Product Manager ML Engineer Data Scientist CTO CEO

Summary TLDR

This short review surveys uses of large language models (LLMs) for language teaching and assessment. LLMs improve open-ended text generation and enable new content and chat features, but they do not yet beat specialist systems on standard automated grading and grammatical-error benchmarks. Best current practice is human-in-the-loop use: prompt and post-edit generated content, combine LLM outputs with traditional linguistic features for scoring, and validate any learner-facing feedback with targeted human evaluation and safeguards.

Problem Statement

EdTech needs scalable ways to create calibrated content, grade writing, and provide feedback. Recent LLMs are powerful at generating fluent text but their accuracy, calibration and suitability for assessment tasks are unclear. The paper asks: where can LLMs help in language learning, what do they not solve, and what risks must be managed?

Main Contribution

Survey of LLM applications across content creation, calibration, assessment and feedback for language learning

Practical account of early experiments and industry examples (e.g., Duolingo Max) and human-in-the-loop pipelines

Key Findings

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

Practical UseUse LLMs to draft learner texts and chat scenarios, but plan for prompt engineering and human edit/selection before publishing content.

Evidence RefSections 3, human-in-the-loop generation experiments with GPT-3; Duolingo Max

A GPT-only automated essay scorer showed weak agreement with human reference scores (quadratic weighted kappa ≈ 0.388) while combining GPT outputs with established linguistic features improved agreement (≈ 0.605 on TOEFL11).

NumbersQWK: GPT-only ≈ 0.388; GPT + linguistic features ≈ 0.605

Practical UseDon't rely on LLMs alone for high-stakes scoring; combine LLM signals with engineered linguistic features and validate against benchmarks.

Evidence RefSection 5, Mizumoto & Eguchi (GPT-3.5 on TOEFL11)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Automated essay scoring agreement (QWK)	GPT-only ≈ 0.388	Human reference scores	—	TOEFL11 (ETS corpus)	Mizumoto & Eguchi experiments reported in paper	Section 5
Automated essay scoring agreement (QWK)	GPT + linguistic features ≈ 0.605	Human reference scores	≈ +0.217 vs GPT-only	TOEFL11	Combining GPT signals with engineered linguistic features improved agreement	Section 5

What To Try In 7 Days

Run a pilot: generate low-stakes reading texts with an LLM and have editors accept/reject outputs

A/B test LLM-written feedback versus template-based feedback on user engagement

Combine LLM scoring signals with existing linguistic feature models and compare to current scorer on a held-out sample

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Mostly a survey and position paper; contains limited new empirical results

Empirical citations are preliminary and use small human-evaluation samples in places

When Not To Use

For high-stakes automated scoring without human review

As sole source of grammatical correction for learning when minimal edits are required

Failure Modes

Over-fluent edits that change learner intent and confuse learning feedback

Inconsistent difficulty estimates depending on prompt and assumed student population

Core Entities

Models

GPT-3GPT-3.5GPT-4ChatGPTBERTBARTT5GECToRGrammarly

Metrics

Quadratic weighted kappa (QWK)BLEUROUGEBERTScoreHuman preference

Datasets

TOEFL11CoNLL-2014JFLEGBEA-2019RACECEPOC

Benchmarks

GEC benchmarks (CoNLL-2014, JFLEG, BEA-2019)Automated essay scoring (TOEFL11 comparisons)TSAR lexical simplification shared task

Context Entities

Models

PaLMLaMDALLaMAOPTGopherGPT-NeoBLOOM

Metrics

Automatic text-similarity scoresUser engagement signals

Datasets

The PileRACECLOTH

Benchmarks

HELM (Holistic Evaluation of Language Models)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLMs produce better open-ended text generation than prior small models, enabling plausible content generation for reading and chat practice.

A GPT-only automated essay scorer showed weak agreement with human reference scores (quadratic weighted kappa ≈ 0.388) while combining GPT outputs with established linguistic features improved agreement (≈ 0.605 on TOEFL11).

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Hallucinations in LLMs are diverse, theoretically inevitable, and must be managed with grounding and human oversight

Key finding

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

SciIG: a benchmark that asks LLMs to draft research-paper introductions from title, abstract, and related work

Key finding

PersonaLens: a large benchmark and LLM-based user+judge agents to measure personalization in task-oriented assistants

Key finding

Use simple entropy-based reweighting to make cheap model judges match human preferences.

Key finding