LongCLI-Bench: a 20-task CLI benchmark showing state-of-the-art agents pass <20% on long-horizon engineering tasks

February 15, 20267 min

Overview

Decision SnapshotNeeds Validation

This benchmark convincingly exposes long-horizon planning and regression risks; results are limited to 20 tasks and manually curated settings but consistently show large gaps in agent capabilities.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 20%

Novelty: 60%

Authors

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Current agents often fail early on long, realistic engineering tasks; adding human-in-the-loop planning and stepwise checks can raise end-to-end success and reduce regression risk.

Who Should Care

Summary TLDR

LongCLI-Bench is a curated benchmark of 20 long-horizon CLI software-engineering tasks drawn from 958 CS assignments and 50 real workflows. It uses dual tests—Fail→Pass (implement new requirements) and Pass→Pass (avoid regressions)—plus step-level scoring to pinpoint where multi-step runs break. State-of-the-art commercial and open-source agents score under 20% overall pass rate; most runs stall early (<30% completion). Human plan injection and interactive guidance raise pass rates up to ~62%, showing human-in-the-loop workflows materially help current agents.

Problem Statement

Existing coding/CLI benchmarks focus on short, isolated tasks, are often contaminated by GitHub scraping, and give only binary pass/fail signals. That leaves open whether agents can plan and execute long, interdependent workflows in realistic, isolated environments.

Main Contribution

LongCLI-Bench: 20 manually curated, long-horizon CLI tasks sourced from 958 CS assignments and 50 real workflows to reduce GitHub contamination.

Dual-set evaluation (F2P and P2P) and step-level scoring to measure requirement completion and regression avoidance.

Key Findings

Overall pass rates are very low across agents.

NumbersBest Pass Rate: 16.7% (Claude-Opus-4.6; Table 2)

Practical UseDo not assume current agents can finish long, multi-step engineering tasks autonomously; plan for human oversight or hybrid workflows.

Evidence RefTable 2

Most runs stall early in the workflow.

NumbersMajority of tasks fall in F2P step score <30% (e.g., DeepSeek-V3.1: 65% in [0,30); Table 3)

Practical UseFocus debugging and augmentation on initial planning and early verification steps to gain the biggest returns.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Best overall Pass Rate16.7% (Claude-Opus-4.6)LongCLI-Bench (20 tasks)Table 2 shows top Pass = 16.7%Table 2
Best F2P Step Score (avg)50.7% (Claude-Opus-4.6 F2P Step Score)LongCLI-BenchTable 2 F2P Step ScoreTable 2

What To Try In 7 Days

Run a small subset of LongCLI-Bench tasks on your agent to measure step-level scores.

Add explicit initial plan injection for multi-step jobs before agent execution.

Instrument CI with P2P-style regressions checks whenever agents modify codebases.

Agent Features

Memory
Long-horizon context management
Planning
Static plan injection (human-provided roadmap)Interactive guidance (dynamic human interventions)Self-correction (multi-turn feedback reuse)
Tool Use
Command-line / terminalDocker environments
Frameworks
OpenHandsCodexClaude Code
Is Agentic

Yes

Architectures
LLM-driven agentsOpenHands frameworkCommercial CLI assistants
Collaboration
Human-in-the-loopPlan injectionInteractive guidance

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Small benchmark scale: 20 tasks after heavy manual curation (each ≈40 hours).

Creation is labor intensive, limiting coverage across domains and languages.

When Not To Use

For short-function or single-file code generation benchmarking.

To measure code style, performance optimization, or microbenchmarks.

Failure Modes

Repetitive loops: agents repeat superficial fixes without root-cause changes.

Environment grounding gaps: misattributing environment failures to code logic.

Core Entities

Models

GPT-5.1-Codex-MaxGPT-5.2-CodexGPT-5.3-CodexClaude-Sonnet-4.5Claude-Opus-4.5Claude-Opus-4.6DeepSeek-V3.1GLM-4.6Qwen3-235B-A22B

Metrics

F2P (Fail→Pass)P2P (Pass→Pass)Step-level scorePass RatePass@3Execution time (min)

Datasets

CS assignments (958)Real-world workflows (50)LongCLI-Bench (20 tasks)

Benchmarks

LongCLI-BenchTerminal-BenchSWE-bench