Overview
Production Readiness
0.6
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
61
Why It Matters For Business
LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.
Summary TLDR
The authors present LLM-Assisted Content Analysis (LACA): a practical workflow that uses GPT-3.5 (gpt-3.5-turbo) to co-develop codebooks, run short validity checks, and either assist or replace human coders for deductive coding. Tested on four public datasets (Trump tweets, Contrarian Claims, BBC news, Ukraine water reports), GPT-3.5 often reaches human-level agreement on many codes and is faster (examples: Contrarian Claims 144s→4s per doc). Simple hypothesis tests flag codes the model struggles with (e.g., character-formatting codes). Model-produced reasons help debug mistakes but should not be taken as faithful explanations. The method reduces time but needs codebook refinement, manual QA
Problem Statement
Deductive coding requires humans to read and consistently label many texts. This is slow and costly. The paper asks: can a current LLM (GPT-3.5) speed coding while keeping reliability, and how should researchers incorporate LLMs into a standard coding workflow?
Main Contribution
Propose LACA, a stepwise workflow to integrate LLMs into deductive coding (codebook co-development, reliability checks, final coding).
Formalize hypothesis tests of randomness to flag codes the model may be guessing.
Show a benchmark on four public datasets comparing GPT-3.5 vs human coders on agreement and time.
Demonstrate value of model-generated reasons to debug prompts and codebooks.
Key Findings
GPT-3.5 often matches human agreement on many coding tasks.
LLM coding is much faster than human coding on long or complex texts.
Simple hypothesis tests detect many model failures early.
Character/formatting codes and some 'group vs individual' rules caused most LLM errors.
Model-generated reasons help surface hallucinations and guide codebook edits.
Results
Inter-rater agreement (Gwet's AC1)
Randomness tests (detect guessing)
Coding time per document
Who Should Care
What To Try In 7 Days
Pick one coding task and sample 100 documents.
Run LACA: co-develop prompt, get model reasons, run randomness tests on model outputs.
Compute human-model IRR (Gwet's AC1) on a 100-doc calibration set and review disagreements with model reasons.
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Single LLM evaluated (gpt-3.5-turbo); results may differ for other models.
- Formatting and character-level codes cause errors due to tokenizer limits.
- Authors did limited prompt engineering; better prompts might improve results.
- Using LLMs reduces direct human reading, which can limit discovery of new themes.
When Not To Use
- When your codes rely on character-level features (hashtags, capitalization)—use regex instead.
- When complete reproducibility of exact model outputs is required without logging prompts and model metadata.
- When code definitions are vague and cannot be clarified with examples.
Failure Modes
- Model randomly guesses on some codes (detected by randomness tests).
- Hallucinated or incorrect reasons that look plausible but are unsupported by text.
- Tokenization causes missed character-level signals (hashtags, ALLCAPS).
- Model conflates group references with mentions of individuals without codebook fixes.
Core Entities
Models
- gpt-3.5-turbo
Metrics
- Gwet's AC1
- binomial test
- chi-squared test
- coding time (sec/doc)
Datasets
- Trump Tweets
- Contrarian Claims
- BBC News
- Ukraine Water Problems

