OpenHands: an open, sandboxed platform that lets LLM-based agents write, run, and browse code like software developers

July 23, 20248 min

Overview

Decision SnapshotReady For Pilot

OpenHands is a practical engineering platform ready for prototyping and evaluation; it reduces integration cost but agents still require model improvements for complex production tasks.

Citations7

Evidence Strength0.85

Confidence0.90

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/5

Reproducibility

Status: Partial assets available

Open source: Yes

License: MIT

At A Glance

Cost impact: 70%

Production readiness: 65%

Novelty: 60%

Authors

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Links

Abstract / PDF / Code

Why It Matters For Business

OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.

Who Should Care

Summary TLDR

OpenHands is an open-source platform and community for building, running, and evaluating LLM-driven agents that interact with the world via code, a shell, and a browser. It provides a docker sandbox runtime, a small set of executable actions (run Python, run bash, drive a browser), an extensible tool (AgentSkills) library, multi-agent delegation, integration tests, and an evaluation harness covering 15 public benchmarks. The repo is MIT-licensed and already hosts many agents and community contributions.

Problem Statement

Developing and evaluating agents that act like software developers is hard: you need safe code execution, browser control, shareable tools, multi-agent coordination, and reproducible benchmarks. OpenHands aims to provide a single, runnable platform that solves these engineering gaps so researchers and engineers can build, test, and compare generalist agents reliably.

Main Contribution

An event-stream agent interface and simple agent abstraction where agents produce actions (python, shell, browser) against a sandbox.

A runtime built on docker sandboxes with a REST API exposing a bash shell, IPython server, and a Chromium browser (Playwright + BrowserGym).

Key Findings

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

NumbersHumanEvalFix: 79.3% (CodeAct v1.5, gpt-4o); SWE-Bench Lite: 2226% (CodeAct v1.8)

Practical UseUse a single CodeAct-style agent to prototype across tasks rather than building distinct task-specific agents; expect good but not SOTA results.

Evidence RefTables 3–4

OpenHands integrates 15 established benchmarks into one evaluation harness.

NumbersBenchmarks: 15 (software, web, misc.) listed in Table 2

Practical UseYou can run cross-domain comparisons in one pipeline instead of wiring separate benchmarks yourself.

Evidence RefTable 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HumanEvalFix success rate79.3%StarCoder2-15B: 48.6%≈+30.7 pp vs StarCoder2 (on Python subset)HumanEvalFix (164 instances), 0-shotTab. 4 reports CodeActAgent v1.5 gpt-4o 79.3%Table 4
SWE-Bench Lite resolve rate22.026.0%Aider: 26.3%Comparable to open-source specialistsSWE-Bench Lite (300 instances, no hints)Tab. 4 shows CodeActAgent v1.8 22.0% (gpt-4o) and 26.0% (claude-3-5-sonnet)Table 4

What To Try In 7 Days

Clone OpenHands and run the included CodeAct agent against one repository in a docker image to see end-to-end editing and tests.

Use AgentSkills to wrap one internal utility (e.g., repo search) so the agent can call it safely from the sandbox.

Run an integration test with LLM mocking to validate prompt changes before spending money on full LLM evaluations.

Agent Features

Memory
Event stream with past actions/observationsMetadata tracking (costs, delegation)Configurable workspace mounted into sandbox
Planning
Event stream state for multi-turn planningStep-based step(state)->action loop
Tool Use
IPythonRunCellAction (run Python)CmdRunAction (run bash)BrowserInteractiveAction (browser primitives)AgentDelegateAction (delegate subtasks)
Frameworks
BrowserGymPlaywrightDocker runtimeJupyter/IPythonAgentSkills
Is Agentic

Yes

Architectures
CodeActGPTSwarmmicro agentsevent-stream agent loop
Collaboration
AgentHub for sharing agentsMulti-agent delegation and micro agents

Optimization Features

Infra Optimization
Reuses existing runtime images to reduce build time
System Optimization
Dual-tagged Docker images for reproducibility and cachingLLM mocking in integration tests to save evaluation cost

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusYes
LicenseMIT

Risks & Boundaries

Limitations

Agents still fall short on complex, long-horizon tasks and specialized stateful editing.

File editing on long files is fragile and needs research improvements.

When Not To Use

When you need hardened, real-world autonomous agents without human oversight.

When regulatory constraints forbid running code in containers without formal audits.

Failure Modes

Agents may perform incorrect code edits that pass superficial checks but break behavior.

Browser-driven tasks can fail when pages require complex visual reasoning or authentication.

Core Entities

Models

gpt-4ogpt-4o-minigpt-4o-2024-05-13gpt-4-1106-previewgpt-4-turbogpt-3.5-turboclaude-3-5-sonnet

Metrics

Success Rate (%)pass@kAverage Cost ($ per instance)

Datasets

SWE-BenchHumanEvalFixBIRDBioCoderML-BenchGorilla APIBenchToolQAWebArenaMiniWoB++GAIAGPQAAgentBenchMINTProofWriterEntity Deduction Arena

Benchmarks

SWE-BenchHumanEvalFixWebArenaMiniWoB++GAIAGPQAAgentBenchMINTProofWriterBioCoderBIRDML-BenchGorilla APIBenchToolQAEntity Deduction Arena