Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
LLM-agents mix code and unpredictable text outputs; static checks reduce service outages and wrong tool calls, catching most integration and dependency issues before deployment.
Summary TLDR
This paper defines eight common code- and integration-level defects in single LLM-based agents (from 6,854 StackOverflow posts), then builds Agentable: a static-analysis system that combines Code Property Graphs (CPGs) and an LLM to find those defects. The authors release two datasets (AgentSet: 84 real agents; AgentTest: 78 synthetic agents). On these sets Agentable reports 889 issues in the wild, achieves 88.79% precision and 91.03% recall on annotated tests, costs about $24.2 for full runs, and averages ~0.81 hours per project. The tool flags defects like wrong LLM choice, missing tool metadata, output parsing gaps, tool return bugs, trigger-word conflicts, missing fault tolerance, bad LM
Problem Statement
LLM-based agents mix developer code and natural-language outputs. This loose coupling creates repeatable code defects (tool invocation failures, parsing errors, API mistakes, dependency conflicts) that cause outages and wrong outputs. There is no prior systematic taxonomy or practical static tool that finds these agent-specific defects.
Main Contribution
A taxonomy of eight agent defect types derived from manual analysis of 6,854 StackOverflow posts.
Agentable: a static-analysis tool that combines Code Property Graphs (CPGs) with LLM reasoning to detect those eight defects.
Two datasets: AgentSet (84 real-world agents) and AgentTest (78 labeled agents), plus an evaluation showing 88.79% precision and 91.03% recall.
An empirical study reporting 889 detected defects across 84 real projects and guidance for practical mitigations.
Key Findings
Large-scale empirical source for defects.
Eight repeatable defect categories defined for LLM agents.
Agentable precision on real projects.
Agentable recall on held-out annotated tests.
Prevalence of defects in the wild.
Cost and time of running Agentable.
Results
overall_precision
recall
defects_detected
avg_analysis_time
llm_cost
Who Should Care
What To Try In 7 Days
Run Agentable on one production or staging agent to get a prioritized defect report.
Verify the agent's LLM choice and API parameters (api_key, stop tokens) and fix obvious LARD/ADAL issues.
Add simple input/output fault-tolerance (type and format checks) around all LLM and tool calls to cover LOPE/MNFT cases quick wins.
Agent Features
Memory
- short-term memory (context/observation injection)
- long-term memory referenced in taxonomy
Planning
- prompt-guided planning (ReAct, Chain-of-Thought)
Tool Use
- external tool registry and invocation
- tool metadata (name/description) drive selection
Frameworks
- Langchain
- AutoGen
- LlamaIndex
- Flowise
Is Agentic
true
Architectures
- LLM-driven single-agent (prompt-based planning)
Reproducibility
Open Source Status
- partial
Risks & Boundaries
Limitations
- Agentable supports only Python standalone agents (no multi-agent systems).
- EPDD (dependency) checks produce warnings only; version conflicts may require runtime/packaging checks.
- Some complex fault-tolerance patterns (e.g., nonstandard if-based checks) are missed.
- Detection quality depends on LLM prompts and model; LLM judgement can cause over-detection.
When Not To Use
- Multi-agent systems with inter-agent message passing.
- Non-Python agent implementations where Joern/AST pipelines are unavailable.
- When you need runtime fuzzing of all possible LLM outputs rather than static checks.
Failure Modes
- False positives when code uses heavy abstraction or many BaseModel subclasses.
- Missed defects for complex, nonstandard fault-tolerance logic.
- Over-detection when the LLM makes unnecessary judgments about code intent.
- EPDD reports that cannot confirm actual version conflicts from static code alone.
Core Entities
Models
- gpt-4o-mini
Metrics
- precision
- recall
- analysis_time_per_project
- detected_defects_count
Datasets
- AgentSet (84 projects)
- AgentTest (78 labeled agents)
- StackOverflow posts (6,854 collected; 331 filtered)

