How LLM-based coding agents must earn developer trust to be useful

Overview

Decision SnapshotNeeds Validation

The paper is a conceptual, opinion-oriented roadmap with examples and proposals but no controlled experiments or quantitative evaluations.

Citations1

Evidence Strength0.35

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 0/6

Findings with evidence refs: 6/6

Results with explicit delta: 0/0

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, Baishakhi Ray

Links

Abstract / PDF

Why It Matters For Business

AI coding agents can cut developer time but only if they earn developer trust through verifiable outputs, provenance, and integrated review processes.

Who Should Care

CTO Engineering Lead Product Manager ML Engineer

Summary TLDR

This short opinion paper argues that deploying AI "software engineers"—LLM-based agents that write, edit, and test code—depends less on raw capability and more on trust. The authors outline technical (testing, static analysis, formal proofs, guardrails) and human (explainability, provenance, review parity) mechanisms for trust, survey early agent systems, and call for unified, explainable agents that integrate coding, testing, and review into developer workflows.

Problem Statement

LLMs can generate and edit code but industry adoption of fully autonomous AI software engineers is held back by lack of developer trust. The paper asks how LLM agents can be designed to earn the same practical, reviewable trust that human contributors have.

Main Contribution

Framing trust as the central barrier to adopting AI software engineers and separating technical vs human trust.

Describing what software-engineering LLM agents are: LLM back-ends + tool interaction + autonomy.

Key Findings

Developer trust, not raw generation skill, is the main barrier to widespread adoption of AI software engineers.

Practical UsePrioritize systems that make AI outputs verifiable and auditable before trying to fully automate coding tasks.

Evidence RefSection 1; cites Forbes blog [3]

Integrating standard engineering tools (tests, linters, program analysis) into LLM agents increases technical trust.

Practical UseBuild agents to produce tests and run linters automatically and fail fast when checks fail.

Evidence RefSection 4 'Testing and Lightweight Static Analysis'; Table 1

What To Try In 7 Days

Run an LLM agent to generate a small feature and require it produce tests; run those tests and linters automatically.

Add provenance tags and a short rationale to AI-generated pull requests before human review.

Implement basic guardrails: input sanitizers and output validators around any code-writing agent.

Agent Features

Memory

code search / retrieval for intent inference

Planning

autonomous nondeterministic work-planstool invocation planning

Tool Use

file navigationcode editingtest executionstatic analysisshell commandsweb browsing

Is Agentic

Yes

Architectures

LLM back-end

Collaboration

AI-human feedback loopsreview parityconfidence and provenance reporting

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Opinion piece with no empirical user studies or measurements.

No quantitative evaluation of trust interventions or agent designs.

When Not To Use

As sole decision-maker for safety-critical code without formal verification.

As a replacement for human review in regulated environments.

Failure Modes

Agent hallucinations leading to incorrect code.

Prompt-injection or malicious inputs causing unsafe outputs.

Core Entities

Models

LLMs (general)Codex

Metrics

correctnesssecurityperformancemaintainabilityexplainabilityconfidence

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Developer trust, not raw generation skill, is the main barrier to widespread adoption of AI software engineers.

Integrating standard engineering tools (tests, linters, program analysis) into LLM agents increases technical trust.

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding