Overview
Agentable shows practical utility with high recall and precision on curated tests, but it currently only supports Python standalone agents and relies on an external LLM for reasoning.
Citations0
Evidence Strength0.80
Confidence0.80
Risk Signals11
Trust Signals
Findings with numeric evidence: 6/6
Findings with evidence refs: 6/6
Results with explicit delta: 0/5
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 60%
Why It Matters For Business
LLM-agents mix code and unpredictable text outputs; static checks reduce service outages and wrong tool calls, catching most integration and dependency issues before deployment.
Who Should Care
Summary TLDR
This paper defines eight common code- and integration-level defects in single LLM-based agents (from 6,854 StackOverflow posts), then builds Agentable: a static-analysis system that combines Code Property Graphs (CPGs) and an LLM to find those defects. The authors release two datasets (AgentSet: 84 real agents; AgentTest: 78 synthetic agents). On these sets Agentable reports 889 issues in the wild, achieves 88.79% precision and 91.03% recall on annotated tests, costs about $24.2 for full runs, and averages ~0.81 hours per project. The tool flags defects like wrong LLM choice, missing tool metadata, output parsing gaps, tool return bugs, trigger-word conflicts, missing fault tolerance, bad LM
Problem Statement
LLM-based agents mix developer code and natural-language outputs. This loose coupling creates repeatable code defects (tool invocation failures, parsing errors, API mistakes, dependency conflicts) that cause outages and wrong outputs. There is no prior systematic taxonomy or practical static tool that finds these agent-specific defects.
Main Contribution
A taxonomy of eight agent defect types derived from manual analysis of 6,854 StackOverflow posts.
Agentable: a static-analysis tool that combines Code Property Graphs (CPGs) with LLM reasoning to detect those eight defects.
Key Findings
Large-scale empirical source for defects.
Eight repeatable defect categories defined for LLM agents.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| overall_precision | 88.79% | — | — | AgentSet sampled reports (n=339) | Table III: weighted precision across defect types | Table III |
| recall | 91.03% | — | — | AgentTest (78 labeled agents) | Table IV: detected 71 of 78 defect agents | Table IV |
What To Try In 7 Days
Run Agentable on one production or staging agent to get a prioritized defect report.
Verify the agent's LLM choice and API parameters (api_key, stop tokens) and fix obvious LARD/ADAL issues.
Add simple input/output fault-tolerance (type and format checks) around all LLM and tool calls to cover LOPE/MNFT cases quick wins.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Agentable supports only Python standalone agents (no multi-agent systems).
EPDD (dependency) checks produce warnings only; version conflicts may require runtime/packaging checks.
When Not To Use
Multi-agent systems with inter-agent message passing.
Non-Python agent implementations where Joern/AST pipelines are unavailable.
Failure Modes
False positives when code uses heavy abstraction or many BaseModel subclasses.
Missed defects for complex, nonstandard fault-tolerance logic.

