Overview
Method is practical and tested on four public datasets; results are promising but limited to one commercial LLM and modest prompt engineering.
Citations61
Evidence Strength0.70
Confidence0.85
Risk Signals11
Trust Signals
Findings with numeric evidence: 5/5
Findings with evidence refs: 5/5
Results with explicit delta: 2/3
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
LLMs can cut the time and cost of large-scale manual coding while keeping results comparable to humans for many categories; validate on a small sample before scaling.
Who Should Care
Summary TLDR
The authors present LLM-Assisted Content Analysis (LACA): a practical workflow that uses GPT-3.5 (gpt-3.5-turbo) to co-develop codebooks, run short validity checks, and either assist or replace human coders for deductive coding. Tested on four public datasets (Trump tweets, Contrarian Claims, BBC news, Ukraine water reports), GPT-3.5 often reaches human-level agreement on many codes and is faster (examples: Contrarian Claims 144s→4s per doc). Simple hypothesis tests flag codes the model struggles with (e.g., character-formatting codes). Model-produced reasons help debug mistakes but should not be taken as faithful explanations. The method reduces time but needs codebook refinement, manual QA
Problem Statement
Deductive coding requires humans to read and consistently label many texts. This is slow and costly. The paper asks: can a current LLM (GPT-3.5) speed coding while keeping reliability, and how should researchers incorporate LLMs into a standard coding workflow?
Main Contribution
Propose LACA, a stepwise workflow to integrate LLMs into deductive coding (codebook co-development, reliability checks, final coding).
Formalize hypothesis tests of randomness to flag codes the model may be guessing.
Key Findings
GPT-3.5 often matches human agreement on many coding tasks.
LLM coding is much faster than human coding on long or complex texts.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Inter-rater agreement (Gwet's AC1) | Many human-model AC1 values ≥0.76 on multiple codes; failures down to 0.18 | human-human AC1 | model comparable to human on many codes; large gaps on a few | See per-code tables (Trump Tweets, Ukraine, BBC, Contrarian) | Tables 4,7,13 | Tables 4,7,13 |
| Randomness tests (detect guessing) | Some codes did not reject randomness (p>0.05) indicating potential guessing | binomial/chi-sq test with α=0.05 | — | Trump Tweets (HSTG p=0.19, CAPT p=0.76); Ukraine env_problems p=0.62 | Table 2, Table 7 | Table 2, Table 7 |
What To Try In 7 Days
Pick one coding task and sample 100 documents.
Run LACA: co-develop prompt, get model reasons, run randomness tests on model outputs.
Compute human-model IRR (Gwet's AC1) on a 100-doc calibration set and review disagreements with model reasons.
Reproducibility
Risks & Boundaries
Limitations
Single LLM evaluated (gpt-3.5-turbo); results may differ for other models.
Formatting and character-level codes cause errors due to tokenizer limits.
When Not To Use
When your codes rely on character-level features (hashtags, capitalization)—use regex instead.
When complete reproducibility of exact model outputs is required without logging prompts and model metadata.
Failure Modes
Model randomly guesses on some codes (detected by randomness tests).
Hallucinated or incorrect reasons that look plausible but are unsupported by text.

