Agentable: a static analyzer that finds eight common defects in LLM-based agents and flags 889 issues in 84 projects

December 24, 20248 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Kaiwen Ning, Jiachi Chen, Jingwen Zhang, Wei Li, Zexu Wang, Yuming Feng, Weizhe Zhang, Zibin Zheng

Links

Abstract / PDF

Why It Matters For Business

LLM-agents mix code and unpredictable text outputs; static checks reduce service outages and wrong tool calls, catching most integration and dependency issues before deployment.

Summary TLDR

This paper defines eight common code- and integration-level defects in single LLM-based agents (from 6,854 StackOverflow posts), then builds Agentable: a static-analysis system that combines Code Property Graphs (CPGs) and an LLM to find those defects. The authors release two datasets (AgentSet: 84 real agents; AgentTest: 78 synthetic agents). On these sets Agentable reports 889 issues in the wild, achieves 88.79% precision and 91.03% recall on annotated tests, costs about $24.2 for full runs, and averages ~0.81 hours per project. The tool flags defects like wrong LLM choice, missing tool metadata, output parsing gaps, tool return bugs, trigger-word conflicts, missing fault tolerance, bad LM

Problem Statement

LLM-based agents mix developer code and natural-language outputs. This loose coupling creates repeatable code defects (tool invocation failures, parsing errors, API mistakes, dependency conflicts) that cause outages and wrong outputs. There is no prior systematic taxonomy or practical static tool that finds these agent-specific defects.

Main Contribution

A taxonomy of eight agent defect types derived from manual analysis of 6,854 StackOverflow posts.

Agentable: a static-analysis tool that combines Code Property Graphs (CPGs) with LLM reasoning to detect those eight defects.

Two datasets: AgentSet (84 real-world agents) and AgentTest (78 labeled agents), plus an evaluation showing 88.79% precision and 91.03% recall.

An empirical study reporting 889 detected defects across 84 real projects and guidance for practical mitigations.

Key Findings

Large-scale empirical source for defects.

Numbers6,854 StackOverflow posts collected; 331 posts analyzed

Eight repeatable defect categories defined for LLM agents.

Numbers8 defect types (ADAL, IETI, LOPE, TRE, ALS, MNFT, LARD, EPDD)

Agentable precision on real projects.

NumbersOverall precision 88.79% (sampled 339 reports)

Agentable recall on held-out annotated tests.

NumbersRecall 91.03% on AgentTest (71/78)

Prevalence of defects in the wild.

Numbers889 defects detected across 84 AgentSet projects

Cost and time of running Agentable.

Numbers$24.20 total LLM cost; avg 0.81 hours per project

Results

overall_precision

Value88.79%

recall

Value91.03%

defects_detected

Value889

avg_analysis_time

Value0.81 hours / project

llm_cost

Value$24.20 total

Who Should Care

What To Try In 7 Days

Run Agentable on one production or staging agent to get a prioritized defect report.

Verify the agent's LLM choice and API parameters (api_key, stop tokens) and fix obvious LARD/ADAL issues.

Add simple input/output fault-tolerance (type and format checks) around all LLM and tool calls to cover LOPE/MNFT cases quick wins.

Agent Features

Memory

  • short-term memory (context/observation injection)
  • long-term memory referenced in taxonomy

Planning

  • prompt-guided planning (ReAct, Chain-of-Thought)

Tool Use

  • external tool registry and invocation
  • tool metadata (name/description) drive selection

Frameworks

  • Langchain
  • AutoGen
  • LlamaIndex
  • Flowise

Is Agentic

true

Architectures

  • LLM-driven single-agent (prompt-based planning)

Reproducibility

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Agentable supports only Python standalone agents (no multi-agent systems).
  • EPDD (dependency) checks produce warnings only; version conflicts may require runtime/packaging checks.
  • Some complex fault-tolerance patterns (e.g., nonstandard if-based checks) are missed.
  • Detection quality depends on LLM prompts and model; LLM judgement can cause over-detection.

When Not To Use

  • Multi-agent systems with inter-agent message passing.
  • Non-Python agent implementations where Joern/AST pipelines are unavailable.
  • When you need runtime fuzzing of all possible LLM outputs rather than static checks.

Failure Modes

  • False positives when code uses heavy abstraction or many BaseModel subclasses.
  • Missed defects for complex, nonstandard fault-tolerance logic.
  • Over-detection when the LLM makes unnecessary judgments about code intent.
  • EPDD reports that cannot confirm actual version conflicts from static code alone.

Core Entities

Models

  • gpt-4o-mini

Metrics

  • precision
  • recall
  • analysis_time_per_project
  • detected_defects_count

Datasets

  • AgentSet (84 projects)
  • AgentTest (78 labeled agents)
  • StackOverflow posts (6,854 collected; 331 filtered)