Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

Overview

Decision SnapshotNeeds Validation

Agentable shows practical utility with high recall and precision on curated tests, but it currently only supports Python standalone agents and relies on an external LLM for reasoning.

Citations0

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/5

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 60%

Authors

Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

LLM-agents mix code and unpredictable text outputs; static checks reduce service outages and wrong tool calls, catching most integration and dependency issues before deployment.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO Founder

Summary TLDR

This paper defines eight common code- and integration-level defects in single LLM-based agents (from 6,854 StackOverflow posts), then builds Agentable: a static-analysis system that combines Code Property Graphs (CPGs) and an LLM to find those defects. The authors release two datasets (AgentSet: 84 real agents; AgentTest: 78 synthetic agents). On these sets Agentable reports 889 issues in the wild, achieves 88.79% precision and 91.03% recall on annotated tests, costs about $24.2 for full runs, and averages ~0.81 hours per project. The tool flags defects like wrong LLM choice, missing tool metadata, output parsing gaps, tool return bugs, trigger-word conflicts, missing fault tolerance, bad LM

Problem Statement

LLM-based agents mix developer code and natural-language outputs. This loose coupling creates repeatable code defects (tool invocation failures, parsing errors, API mistakes, dependency conflicts) that cause outages and wrong outputs. There is no prior systematic taxonomy or practical static tool that finds these agent-specific defects.

Main Contribution

A taxonomy of eight agent defect types derived from manual analysis of 6,854 StackOverflow posts.

Agentable: a static-analysis tool that combines Code Property Graphs (CPGs) with LLM reasoning to detect those eight defects.

Key Findings

Large-scale empirical source for defects.

Numbers6,854 StackOverflow posts collected; 331 posts analyzed

Practical UseUse community Q&A as a representative source to prioritize real-world agent defects when building checks.

Evidence RefSection III (Data Collection & Preprocessing)

Eight repeatable defect categories defined for LLM agents.

Numbers8 defect types (ADAL, IETI, LOPE, TRE, ALS, MNFT, LARD, EPDD)

Practical UseUse these eight categories as a checklist during code review and CI for agent projects.

Evidence RefTable II; Section III-D

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
overall_precision	88.79%	—	—	AgentSet sampled reports (n=339)	Table III: weighted precision across defect types	Table III
recall	91.03%	—	—	AgentTest (78 labeled agents)	Table IV: detected 71 of 78 defect agents	Table IV

What To Try In 7 Days

Run Agentable on one production or staging agent to get a prioritized defect report.

Verify the agent's LLM choice and API parameters (api_key, stop tokens) and fix obvious LARD/ADAL issues.

Add simple input/output fault-tolerance (type and format checks) around all LLM and tool calls to cover LOPE/MNFT cases quick wins.

Agent Features

Memory

short-term memory (context/observation injection)long-term memory referenced in taxonomy

Planning

prompt-guided planning (ReAct, Chain-of-Thought)

Tool Use

external tool registry and invocationtool metadata (name/description) drive selection

Frameworks

LangchainAutoGenLlamaIndexFlowise

Is Agentic

Yes

Architectures

LLM-driven single-agent (prompt-based planning)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Agentable supports only Python standalone agents (no multi-agent systems).

EPDD (dependency) checks produce warnings only; version conflicts may require runtime/packaging checks.

When Not To Use

Multi-agent systems with inter-agent message passing.

Non-Python agent implementations where Joern/AST pipelines are unavailable.

Failure Modes

False positives when code uses heavy abstraction or many BaseModel subclasses.

Missed defects for complex, nonstandard fault-tolerance logic.

Core Entities

Models

gpt-4o-mini

Metrics

precisionrecallanalysis_time_per_projectdetected_defects_count

Datasets

AgentSet (84 projects)AgentTest (78 labeled agents)StackOverflow posts (6,854 collected; 331 filtered)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Large-scale empirical source for defects.

Eight repeatable defect categories defined for LLM agents.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding

A 1,000-task, real-server benchmark that measures how well LLMs discover and use tools

Key finding

DrugPilot: LLM agent with a key-value memory pool for reliable drug-discovery tool calling

Key finding

A runnable benchmark of 760 real financial tools and 295 tool-required questions for auditing LLM agents

Key finding