LACA: use GPT-3.5 to speed deductive qualitative coding while checking reliability

June 23, 20237 min

Overview

Decision SnapshotNeeds Validation

Method is practical and tested on four public datasets; results are promising but limited to one commercial LLM and modest prompt engineering.

Citations61

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Robert Chew, John Bollenbacher, Michael Wenger, Jessica Speer, Annice Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.

Who Should Care

Summary TLDR

The authors present LLM-Assisted Content Analysis (LACA): a practical workflow that uses GPT-3.5 (gpt-3.5-turbo) to co-develop codebooks, run short validity checks, and either assist or replace human coders for deductive coding. Tested on four public datasets (Trump tweets, Contrarian Claims, BBC news, Ukraine water reports), GPT-3.5 often reaches human-level agreement on many codes and is faster (examples: Contrarian Claims 144s→4s per doc). Simple hypothesis tests flag codes the model struggles with (e.g., character-formatting codes). Model-produced reasons help debug mistakes but should not be taken as faithful explanations. The method reduces time but needs codebook refinement, manual QA

Problem Statement

Deductive coding requires humans to read and consistently label many texts. This is slow and costly. The paper asks: can a current LLM (GPT-3.5) speed coding while keeping reliability, and how should researchers incorporate LLMs into a standard coding workflow?

Main Contribution

Propose LACA, a stepwise workflow to integrate LLMs into deductive coding (codebook co-development, reliability checks, final coding).

Formalize hypothesis tests of randomness to flag codes the model may be guessing.

Key Findings

GPT-3.5 often matches human agreement on many coding tasks.

NumbersHuman-model Gwet's AC1 frequently ≥0.76; examples MAGA 0.98, MEDI 0.96

Practical UseYou can often rely on GPT-3.5 for many deductive codes after calibration; validate with IRR first

Evidence RefTables 4,7,13

LLM coding is much faster than human coding on long or complex texts.

NumbersContrarian Claims: human 144s/doc → LLM 4s/doc (≈36× faster)

Practical UseUse LLMs to reduce time and cost for large-scale coding, especially for long texts or many categories

Evidence RefTable 8

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Inter-rater agreement (Gwet's AC1)Many human-model AC1 values ≥0.76 on multiple codes; failures down to 0.18human-human AC1model comparable to human on many codes; large gaps on a fewSee per-code tables (Trump Tweets, Ukraine, BBC, Contrarian)Tables 4,7,13Tables 4,7,13
Randomness tests (detect guessing)Some codes did not reject randomness (p>0.05) indicating potential guessingbinomial/chi-sq test with α=0.05Trump Tweets (HSTG p=0.19, CAPT p=0.76); Ukraine env_problems p=0.62Table 2, Table 7Table 2, Table 7

What To Try In 7 Days

Pick one coding task and sample 100 documents.

Run LACA: co-develop prompt, get model reasons, run randomness tests on model outputs.

Compute human-model IRR (Gwet's AC1) on a 100-doc calibration set and review disagreements with model reasons.

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Single LLM evaluated (gpt-3.5-turbo); results may differ for other models.

Formatting and character-level codes cause errors due to tokenizer limits.

When Not To Use

When your codes rely on character-level features (hashtags, capitalization)—use regex instead.

When complete reproducibility of exact model outputs is required without logging prompts and model metadata.

Failure Modes

Model randomly guesses on some codes (detected by randomness tests).

Hallucinated or incorrect reasons that look plausible but are unsupported by text.

Core Entities

Models

gpt-3.5-turbo

Metrics

Gwet's AC1binomial testchi-squared testcoding time (sec/doc)

Datasets

Trump TweetsContrarian ClaimsBBC NewsUkraine Water Problems