LACA: use GPT-3.5 to speed deductive qualitative coding while checking reliability

June 23, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

61

Authors

Robert Chew, John Bollenbacher, Michael Wenger, Jessica Speer, Annice Kim

Links

Abstract / PDF

Why It Matters For Business

LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.

Summary TLDR

The authors present LLM-Assisted Content Analysis (LACA): a practical workflow that uses GPT-3.5 (gpt-3.5-turbo) to co-develop codebooks, run short validity checks, and either assist or replace human coders for deductive coding. Tested on four public datasets (Trump tweets, Contrarian Claims, BBC news, Ukraine water reports), GPT-3.5 often reaches human-level agreement on many codes and is faster (examples: Contrarian Claims 144s→4s per doc). Simple hypothesis tests flag codes the model struggles with (e.g., character-formatting codes). Model-produced reasons help debug mistakes but should not be taken as faithful explanations. The method reduces time but needs codebook refinement, manual QA

Problem Statement

Deductive coding requires humans to read and consistently label many texts. This is slow and costly. The paper asks: can a current LLM (GPT-3.5) speed coding while keeping reliability, and how should researchers incorporate LLMs into a standard coding workflow?

Main Contribution

Propose LACA, a stepwise workflow to integrate LLMs into deductive coding (codebook co-development, reliability checks, final coding).

Formalize hypothesis tests of randomness to flag codes the model may be guessing.

Show a benchmark on four public datasets comparing GPT-3.5 vs human coders on agreement and time.

Demonstrate value of model-generated reasons to debug prompts and codebooks.

Key Findings

GPT-3.5 often matches human agreement on many coding tasks.

NumbersHuman-model Gwet's AC1 frequently ≥0.76; examples MAGA 0.98, MEDI 0.96

LLM coding is much faster than human coding on long or complex texts.

NumbersContrarian Claims: human 144s/doc → LLM 4s/doc (≈36× faster)

Simple hypothesis tests detect many model failures early.

NumbersSeveral Trump tweet codes failed binomial tests (e.g., HSTG p=0.19, CAPT p=0.76).

Character/formatting codes and some 'group vs individual' rules caused most LLM errors.

NumbersFormatting codes HSTG, ATSN, CAPT had low human-model AC1 (HSTG 0.18, CAPT 0.36)

Model-generated reasons help surface hallucinations and guide codebook edits.

NumbersQualitative examples show valid and hallucinated reasons; used to revise prompts and codes

Results

Inter-rater agreement (Gwet's AC1)

ValueMany human-model AC1 values ≥0.76 on multiple codes; failures down to 0.18

Baselinehuman-human AC1

Randomness tests (detect guessing)

ValueSome codes did not reject randomness (p>0.05) indicating potential guessing

Baselinebinomial/chi-sq test with α=0.05

Coding time per document

ValueHumans: 72–144 s/doc; GPT-3.5: 4–52 s/doc

Baselinehuman coder times in experiment

Who Should Care

What To Try In 7 Days

Pick one coding task and sample 100 documents.

Run LACA: co-develop prompt, get model reasons, run randomness tests on model outputs.

Compute human-model IRR (Gwet's AC1) on a 100-doc calibration set and review disagreements with model reasons.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Single LLM evaluated (gpt-3.5-turbo); results may differ for other models.
  • Formatting and character-level codes cause errors due to tokenizer limits.
  • Authors did limited prompt engineering; better prompts might improve results.
  • Using LLMs reduces direct human reading, which can limit discovery of new themes.

When Not To Use

  • When your codes rely on character-level features (hashtags, capitalization)—use regex instead.
  • When complete reproducibility of exact model outputs is required without logging prompts and model metadata.
  • When code definitions are vague and cannot be clarified with examples.

Failure Modes

  • Model randomly guesses on some codes (detected by randomness tests).
  • Hallucinated or incorrect reasons that look plausible but are unsupported by text.
  • Tokenization causes missed character-level signals (hashtags, ALLCAPS).
  • Model conflates group references with mentions of individuals without codebook fixes.

Core Entities

Models

  • gpt-3.5-turbo

Metrics

  • Gwet's AC1
  • binomial test
  • chi-squared test
  • coding time (sec/doc)

Datasets

  • Trump Tweets
  • Contrarian Claims
  • BBC News
  • Ukraine Water Problems