Overview
This benchmark convincingly exposes long-horizon planning and regression risks; results are limited to 20 tasks and manually curated settings but consistently show large gaps in agent capabilities.
Citations0
Evidence Strength0.70
Confidence0.85
Risk Signals9
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/5
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 20%
Novelty: 60%
Why It Matters For Business
Current agents often fail early on long, realistic engineering tasks; adding human-in-the-loop planning and stepwise checks can raise end-to-end success and reduce regression risk.
Who Should Care
Summary TLDR
LongCLI-Bench is a curated benchmark of 20 long-horizon CLI software-engineering tasks drawn from 958 CS assignments and 50 real workflows. It uses dual tests—Fail→Pass (implement new requirements) and Pass→Pass (avoid regressions)—plus step-level scoring to pinpoint where multi-step runs break. State-of-the-art commercial and open-source agents score under 20% overall pass rate; most runs stall early (<30% completion). Human plan injection and interactive guidance raise pass rates up to ~62%, showing human-in-the-loop workflows materially help current agents.
Problem Statement
Existing coding/CLI benchmarks focus on short, isolated tasks, are often contaminated by GitHub scraping, and give only binary pass/fail signals. That leaves open whether agents can plan and execute long, interdependent workflows in realistic, isolated environments.
Main Contribution
LongCLI-Bench: 20 manually curated, long-horizon CLI tasks sourced from 958 CS assignments and 50 real workflows to reduce GitHub contamination.
Dual-set evaluation (F2P and P2P) and step-level scoring to measure requirement completion and regression avoidance.
Key Findings
Overall pass rates are very low across agents.
Most runs stall early in the workflow.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Best overall Pass Rate | 16.7% (Claude-Opus-4.6) | — | — | LongCLI-Bench (20 tasks) | Table 2 shows top Pass = 16.7% | Table 2 |
| Best F2P Step Score (avg) | 50.7% (Claude-Opus-4.6 F2P Step Score) | — | — | LongCLI-Bench | Table 2 F2P Step Score | Table 2 |
What To Try In 7 Days
Run a small subset of LongCLI-Bench tasks on your agent to measure step-level scores.
Add explicit initial plan injection for multi-step jobs before agent execution.
Instrument CI with P2P-style regressions checks whenever agents modify codebases.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Reproducibility
Risks & Boundaries
Limitations
Small benchmark scale: 20 tasks after heavy manual curation (each ≈40 hours).
Creation is labor intensive, limiting coverage across domains and languages.
When Not To Use
For short-function or single-file code generation benchmarking.
To measure code style, performance optimization, or microbenchmarks.
Failure Modes
Repetitive loops: agents repeat superficial fixes without root-cause changes.
Environment grounding gaps: misattributing environment failures to code logic.

