LongCLI-Bench: a 20-task CLI benchmark showing state-of-the-art agents pass <20% on long-horizon engineering tasks

Overview

Decision SnapshotNeeds Validation

This benchmark convincingly exposes long-horizon planning and regression risks; results are limited to 20 tasks and manually curated settings but consistently show large gaps in agent capabilities.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/5

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 20%

Novelty: 60%

Authors

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Current agents often fail early on long, realistic engineering tasks; adding human-in-the-loop planning and stepwise checks can raise end-to-end success and reduce regression risk.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO

Summary TLDR

LongCLI-Bench is a curated benchmark of 20 long-horizon CLI software-engineering tasks drawn from 958 CS assignments and 50 real workflows. It uses dual tests—Fail→Pass (implement new requirements) and Pass→Pass (avoid regressions)—plus step-level scoring to pinpoint where multi-step runs break. State-of-the-art commercial and open-source agents score under 20% overall pass rate; most runs stall early (<30% completion). Human plan injection and interactive guidance raise pass rates up to ~62%, showing human-in-the-loop workflows materially help current agents.

Problem Statement

Existing coding/CLI benchmarks focus on short, isolated tasks, are often contaminated by GitHub scraping, and give only binary pass/fail signals. That leaves open whether agents can plan and execute long, interdependent workflows in realistic, isolated environments.

Main Contribution

LongCLI-Bench: 20 manually curated, long-horizon CLI tasks sourced from 958 CS assignments and 50 real workflows to reduce GitHub contamination.

Dual-set evaluation (F2P and P2P) and step-level scoring to measure requirement completion and regression avoidance.

Key Findings

Overall pass rates are very low across agents.

NumbersBest Pass Rate: 16.7% (Claude-Opus-4.6; Table 2)

Practical UseDo not assume current agents can finish long, multi-step engineering tasks autonomously; plan for human oversight or hybrid workflows.

Evidence RefTable 2

Most runs stall early in the workflow.

NumbersMajority of tasks fall in F2P step score <30% (e.g., DeepSeek-V3.1: 65% in [0,30); Table 3)

Practical UseFocus debugging and augmentation on initial planning and early verification steps to gain the biggest returns.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Best overall Pass Rate	16.7% (Claude-Opus-4.6)	—	—	LongCLI-Bench (20 tasks)	Table 2 shows top Pass = 16.7%	Table 2
Best F2P Step Score (avg)	50.7% (Claude-Opus-4.6 F2P Step Score)	—	—	LongCLI-Bench	Table 2 F2P Step Score	Table 2

What To Try In 7 Days

Run a small subset of LongCLI-Bench tasks on your agent to measure step-level scores.

Add explicit initial plan injection for multi-step jobs before agent execution.

Instrument CI with P2P-style regressions checks whenever agents modify codebases.

Agent Features

Memory

Long-horizon context management

Planning

Static plan injection (human-provided roadmap)Interactive guidance (dynamic human interventions)Self-correction (multi-turn feedback reuse)

Tool Use

Command-line / terminalDocker environments

Frameworks

OpenHandsCodexClaude Code

Is Agentic

Yes

Architectures

LLM-driven agentsOpenHands frameworkCommercial CLI assistants

Collaboration

Human-in-the-loopPlan injectionInteractive guidance

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/finyorko/longcli-bench

Data URLs

https://github.com/finyorko/longcli-bench

Risks & Boundaries

Limitations

Small benchmark scale: 20 tasks after heavy manual curation (each ≈40 hours).

Creation is labor intensive, limiting coverage across domains and languages.

When Not To Use

For short-function or single-file code generation benchmarking.

To measure code style, performance optimization, or microbenchmarks.

Failure Modes

Repetitive loops: agents repeat superficial fixes without root-cause changes.

Environment grounding gaps: misattributing environment failures to code logic.

Core Entities

Models

GPT-5.1-Codex-MaxGPT-5.2-CodexGPT-5.3-CodexClaude-Sonnet-4.5Claude-Opus-4.5Claude-Opus-4.6DeepSeek-V3.1GLM-4.6Qwen3-235B-A22B

Metrics

F2P (Fail→Pass)P2P (Pass→Pass)Step-level scorePass RatePass@3Execution time (min)

Datasets

CS assignments (958)Real-world workflows (50)LongCLI-Bench (20 tasks)

Benchmarks

LongCLI-BenchTerminal-BenchSWE-bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Overall pass rates are very low across agents.

Most runs stall early in the workflow.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding