OpenHands: an open, sandboxed platform that lets LLM-based agents write, run, and browse code like software developers

Overview

Decision SnapshotReady For Pilot

OpenHands is a practical engineering platform ready for prototyping and evaluation; it reduces integration cost but agents still require model improvements for complex production tasks.

Citations7

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: MIT

At A Glance

Cost impact: 70%

Production readiness: 65%

Novelty: 60%

Authors

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Links

Abstract / PDF / Code

Why It Matters For Business

OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Founder

Summary TLDR

OpenHands is an open-source platform and community for building, running, and evaluating LLM-driven agents that interact with the world via code, a shell, and a browser. It provides a docker sandbox runtime, a small set of executable actions (run Python, run bash, drive a browser), an extensible tool (AgentSkills) library, multi-agent delegation, integration tests, and an evaluation harness covering 15 public benchmarks. The repo is MIT-licensed and already hosts many agents and community contributions.

Problem Statement

Developing and evaluating agents that act like software developers is hard: you need safe code execution, browser control, shareable tools, multi-agent coordination, and reproducible benchmarks. OpenHands aims to provide a single, runnable platform that solves these engineering gaps so researchers and engineers can build, test, and compare generalist agents reliably.

Main Contribution

An event-stream agent interface and simple agent abstraction where agents produce actions (python, shell, browser) against a sandbox.

A runtime built on docker sandboxes with a REST API exposing a bash shell, IPython server, and a Chromium browser (Playwright + BrowserGym).

Key Findings

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

NumbersHumanEvalFix: 79.3% (CodeAct v1.5, gpt-4o); SWE-Bench Lite: 22–26% (CodeAct v1.8)

Practical UseUse a single CodeAct-style agent to prototype across tasks rather than building distinct task-specific agents; expect good but not SOTA results.

Evidence RefTables 3–4

OpenHands integrates 15 established benchmarks into one evaluation harness.

NumbersBenchmarks: 15 (software, web, misc.) listed in Table 2

Practical UseYou can run cross-domain comparisons in one pipeline instead of wiring separate benchmarks yourself.

Evidence RefTable 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HumanEvalFix success rate	79.3%	StarCoder2-15B: 48.6%	≈+30.7 pp vs StarCoder2 (on Python subset)	HumanEvalFix (164 instances), 0-shot	Tab. 4 reports CodeActAgent v1.5 gpt-4o 79.3%	Table 4
SWE-Bench Lite resolve rate	22.0–26.0%	Aider: 26.3%	Comparable to open-source specialists	SWE-Bench Lite (300 instances, no hints)	Tab. 4 shows CodeActAgent v1.8 22.0% (gpt-4o) and 26.0% (claude-3-5-sonnet)	Table 4

What To Try In 7 Days

Clone OpenHands and run the included CodeAct agent against one repository in a docker image to see end-to-end editing and tests.

Use AgentSkills to wrap one internal utility (e.g., repo search) so the agent can call it safely from the sandbox.

Run an integration test with LLM mocking to validate prompt changes before spending money on full LLM evaluations.

Agent Features

Memory

Event stream with past actions/observationsMetadata tracking (costs, delegation)Configurable workspace mounted into sandbox

Planning

Event stream state for multi-turn planningStep-based step(state)->action loop

Tool Use

IPythonRunCellAction (run Python)CmdRunAction (run bash)BrowserInteractiveAction (browser primitives)AgentDelegateAction (delegate subtasks)

Frameworks

BrowserGymPlaywrightDocker runtimeJupyter/IPythonAgentSkills

Is Agentic

Yes

Architectures

CodeActGPTSwarmmicro agentsevent-stream agent loop

Collaboration

AgentHub for sharing agentsMulti-agent delegation and micro agents

Optimization Features

Infra Optimization

Reuses existing runtime images to reduce build time

System Optimization

Dual-tagged Docker images for reproducibility and cachingLLM mocking in integration tests to save evaluation cost

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusYes

LicenseMIT

Code URLs

https://github.com/All-Hands-AI/OpenHands

Risks & Boundaries

Limitations

Agents still fall short on complex, long-horizon tasks and specialized stateful editing.

File editing on long files is fragile and needs research improvements.

When Not To Use

When you need hardened, real-world autonomous agents without human oversight.

When regulatory constraints forbid running code in containers without formal audits.

Failure Modes

Agents may perform incorrect code edits that pass superficial checks but break behavior.

Browser-driven tasks can fail when pages require complex visual reasoning or authentication.

Core Entities

Models

gpt-4ogpt-4o-minigpt-4o-2024-05-13gpt-4-1106-previewgpt-4-turbogpt-3.5-turboclaude-3-5-sonnet

Metrics

Success Rate (%)pass@kAverage Cost ($ per instance)

Datasets

SWE-BenchHumanEvalFixBIRDBioCoderML-BenchGorilla APIBenchToolQAWebArenaMiniWoB++GAIAGPQAAgentBenchMINTProofWriterEntity Deduction Arena

Benchmarks

SWE-BenchHumanEvalFixWebArenaMiniWoB++GAIAGPQAAgentBenchMINTProofWriterBioCoderBIRDML-BenchGorilla APIBenchToolQAEntity Deduction Arena

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

OpenHands integrates 15 established benchmarks into one evaluation harness.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

TRiSM: practical trust, risk and security controls for LLM-based multi-agent systems

Key finding

RAPS: intent-driven, reputation-aware publish–subscribe for adaptive multi-agent LLM coordination

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

ACP: a layered, federated protocol for secure cross-platform agent-to-agent collaboration

Key finding