LACA: use GPT-3.5 to speed deductive qualitative coding while checking reliability

Overview

Decision SnapshotNeeds Validation

Method is practical and tested on four public datasets; results are promising but limited to one commercial LLM and modest prompt engineering.

Citations61

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 50%

Authors

Robert Chew, John Bollenbacher, Michael Wenger, Jessica Speer, Annice Kim

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead

Summary TLDR

The authors present LLM-Assisted Content Analysis (LACA): a practical workflow that uses GPT-3.5 (gpt-3.5-turbo) to co-develop codebooks, run short validity checks, and either assist or replace human coders for deductive coding. Tested on four public datasets (Trump tweets, Contrarian Claims, BBC news, Ukraine water reports), GPT-3.5 often reaches human-level agreement on many codes and is faster (examples: Contrarian Claims 144s→4s per doc). Simple hypothesis tests flag codes the model struggles with (e.g., character-formatting codes). Model-produced reasons help debug mistakes but should not be taken as faithful explanations. The method reduces time but needs codebook refinement, manual QA

Problem Statement

Deductive coding requires humans to read and consistently label many texts. This is slow and costly. The paper asks: can a current LLM (GPT-3.5) speed coding while keeping reliability, and how should researchers incorporate LLMs into a standard coding workflow?

Main Contribution

Propose LACA, a stepwise workflow to integrate LLMs into deductive coding (codebook co-development, reliability checks, final coding).

Formalize hypothesis tests of randomness to flag codes the model may be guessing.

Key Findings

GPT-3.5 often matches human agreement on many coding tasks.

NumbersHuman-model Gwet's AC1 frequently ≥0.76; examples MAGA 0.98, MEDI 0.96

Practical UseYou can often rely on GPT-3.5 for many deductive codes after calibration; validate with IRR first

Evidence RefTables 4,7,13

LLM coding is much faster than human coding on long or complex texts.

NumbersContrarian Claims: human 144s/doc → LLM 4s/doc (≈36× faster)

Practical UseUse LLMs to reduce time and cost for large-scale coding, especially for long texts or many categories

Evidence RefTable 8

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Inter-rater agreement (Gwet's AC1)	Many human-model AC1 values ≥0.76 on multiple codes; failures down to 0.18	human-human AC1	model comparable to human on many codes; large gaps on a few	See per-code tables (Trump Tweets, Ukraine, BBC, Contrarian)	Tables 4,7,13	Tables 4,7,13
Randomness tests (detect guessing)	Some codes did not reject randomness (p>0.05) indicating potential guessing	binomial/chi-sq test with α=0.05	—	Trump Tweets (HSTG p=0.19, CAPT p=0.76); Ukraine env_problems p=0.62	Table 2, Table 7	Table 2, Table 7

What To Try In 7 Days

Pick one coding task and sample 100 documents.

Run LACA: co-develop prompt, get model reasons, run randomness tests on model outputs.

Compute human-model IRR (Gwet's AC1) on a 100-doc calibration set and review disagreements with model reasons.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://doi.org/10.6084/m9.figshare.23291147

Data URLs

https://doi.org/10.6084/m9.figshare.23291147

Risks & Boundaries

Limitations

Single LLM evaluated (gpt-3.5-turbo); results may differ for other models.

Formatting and character-level codes cause errors due to tokenizer limits.

When Not To Use

When your codes rely on character-level features (hashtags, capitalization)—use regex instead.

When complete reproducibility of exact model outputs is required without logging prompts and model metadata.

Failure Modes

Model randomly guesses on some codes (detected by randomness tests).

Hallucinated or incorrect reasons that look plausible but are unsupported by text.

Core Entities

Models

gpt-3.5-turbo

Metrics

Gwet's AC1binomial testchi-squared testcoding time (sec/doc)

Datasets

Trump TweetsContrarian ClaimsBBC NewsUkraine Water Problems

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT-3.5 often matches human agreement on many coding tasks.

LLM coding is much faster than human coding on long or complex texts.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Key finding