Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

December 24, 20248 min

Overview

Decision SnapshotNeeds Validation

Agentable shows practical utility with high recall and precision on curated tests, but it currently only supports Python standalone agents and relies on an external LLM for reasoning.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

LLM-agents mix code and unpredictable text outputs; static checks reduce service outages and wrong tool calls, catching most integration and dependency issues before deployment.

Who Should Care

Summary TLDR

This paper defines eight common code- and integration-level defects in single LLM-based agents (from 6,854 StackOverflow posts), then builds Agentable: a static-analysis system that combines Code Property Graphs (CPGs) and an LLM to find those defects. The authors release two datasets (AgentSet: 84 real agents; AgentTest: 78 synthetic agents). On these sets Agentable reports 889 issues in the wild, achieves 88.79% precision and 91.03% recall on annotated tests, costs about $24.2 for full runs, and averages ~0.81 hours per project. The tool flags defects like wrong LLM choice, missing tool metadata, output parsing gaps, tool return bugs, trigger-word conflicts, missing fault tolerance, bad LM

Problem Statement

LLM-based agents mix developer code and natural-language outputs. This loose coupling creates repeatable code defects (tool invocation failures, parsing errors, API mistakes, dependency conflicts) that cause outages and wrong outputs. There is no prior systematic taxonomy or practical static tool that finds these agent-specific defects.

Main Contribution

A taxonomy of eight agent defect types derived from manual analysis of 6,854 StackOverflow posts.

Agentable: a static-analysis tool that combines Code Property Graphs (CPGs) with LLM reasoning to detect those eight defects.

Key Findings

Large-scale empirical source for defects.

Numbers6,854 StackOverflow posts collected; 331 posts analyzed

Practical UseUse community Q&A as a representative source to prioritize real-world agent defects when building checks.

Evidence RefSection III (Data Collection & Preprocessing)

Eight repeatable defect categories defined for LLM agents.

Numbers8 defect types (ADAL, IETI, LOPE, TRE, ALS, MNFT, LARD, EPDD)

Practical UseUse these eight categories as a checklist during code review and CI for agent projects.

Evidence RefTable II; Section III-D

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
overall_precision88.79%AgentSet sampled reports (n=339)Table III: weighted precision across defect typesTable III
recall91.03%AgentTest (78 labeled agents)Table IV: detected 71 of 78 defect agentsTable IV

What To Try In 7 Days

Run Agentable on one production or staging agent to get a prioritized defect report.

Verify the agent's LLM choice and API parameters (api_key, stop tokens) and fix obvious LARD/ADAL issues.

Add simple input/output fault-tolerance (type and format checks) around all LLM and tool calls to cover LOPE/MNFT cases quick wins.

Agent Features

Memory
short-term memory (context/observation injection)long-term memory referenced in taxonomy
Planning
prompt-guided planning (ReAct, Chain-of-Thought)
Tool Use
external tool registry and invocationtool metadata (name/description) drive selection
Frameworks
LangchainAutoGenLlamaIndexFlowise
Is Agentic

Yes

Architectures
LLM-driven single-agent (prompt-based planning)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Agentable supports only Python standalone agents (no multi-agent systems).

EPDD (dependency) checks produce warnings only; version conflicts may require runtime/packaging checks.

When Not To Use

Multi-agent systems with inter-agent message passing.

Non-Python agent implementations where Joern/AST pipelines are unavailable.

Failure Modes

False positives when code uses heavy abstraction or many BaseModel subclasses.

Missed defects for complex, nonstandard fault-tolerance logic.

Core Entities

Models

gpt-4o-mini

Metrics

precisionrecallanalysis_time_per_projectdetected_defects_count

Datasets

AgentSet (84 projects)AgentTest (78 labeled agents)StackOverflow posts (6,854 collected; 331 filtered)