OpenHands: an open, sandboxed platform that lets LLM-based agents write, run, and browse code like software developers

July 23, 20248 min

Overview

Production Readiness

0.65

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

7

Authors

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Links

Abstract / PDF

Why It Matters For Business

OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.

Summary TLDR

OpenHands is an open-source platform and community for building, running, and evaluating LLM-driven agents that interact with the world via code, a shell, and a browser. It provides a docker sandbox runtime, a small set of executable actions (run Python, run bash, drive a browser), an extensible tool (AgentSkills) library, multi-agent delegation, integration tests, and an evaluation harness covering 15 public benchmarks. The repo is MIT-licensed and already hosts many agents and community contributions.

Problem Statement

Developing and evaluating agents that act like software developers is hard: you need safe code execution, browser control, shareable tools, multi-agent coordination, and reproducible benchmarks. OpenHands aims to provide a single, runnable platform that solves these engineering gaps so researchers and engineers can build, test, and compare generalist agents reliably.

Main Contribution

An event-stream agent interface and simple agent abstraction where agents produce actions (python, shell, browser) against a sandbox.

A runtime built on docker sandboxes with a REST API exposing a bash shell, IPython server, and a Chromium browser (Playwright + BrowserGym).

AgentSkills: an extensible, sharable Python tool library for common developer actions (file edits, parsing PDFs/images, etc.).

Multi-agent delegation via AgentDelegateAction and an AgentHub of community-contributed agents (CodeAct, BrowsingAgent, GPTSwarm, micro agents).

A built-in evaluation suite integrating 15 public benchmarks and an integration-test framework that mocks LLMs for deterministic tests.

A permissive, production-ready codebase (MIT) with an active community and documented reproducibility workflows.

Key Findings

A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.

NumbersHumanEvalFix: 79.3% (CodeAct v1.5, gpt-4o); SWE-Bench Lite: 22–26% (CodeAct v1.8)

OpenHands integrates 15 established benchmarks into one evaluation harness.

NumbersBenchmarks: 15 (software, web, misc.) listed in Table 2

Sandboxed execution and a REST runtime enable safe, repeatable agent runs on arbitrary docker images.

NumbersSupports arbitrary Docker images with dual-tag build and reuse policy

Community traction and open governance: fast feature growth and many contributors.

Numbers≈32K GitHub stars; >2.1K contributions from 188+ contributors

Integration tests reduce nondeterminism and CI cost by mocking LLM outputs for regression checks.

NumbersLLM mocking for deterministic test responses; prompt-response regeneration supported

Results

HumanEvalFix success rate

Value79.3%

BaselineStarCoder2-15B: 48.6%

SWE-Bench Lite resolve rate

Value22.0–26.0%

BaselineAider: 26.3%

WebArena success rate (browser tasks)

Value14–15.5%

BaselineWebArena agent (gpt-4-turbo): 14.4%

Accuracy

Value53.1%

BaselineGPT-4: 38.8%

Community contributions

Value2.1K+ PRs, 188+ contributors

Who Should Care

What To Try In 7 Days

Clone OpenHands and run the included CodeAct agent against one repository in a docker image to see end-to-end editing and tests.

Use AgentSkills to wrap one internal utility (e.g., repo search) so the agent can call it safely from the sandbox.

Run an integration test with LLM mocking to validate prompt changes before spending money on full LLM evaluations.

Agent Features

Memory

  • Event stream with past actions/observations
  • Metadata tracking (costs, delegation)
  • Configurable workspace mounted into sandbox

Planning

  • Event stream state for multi-turn planning
  • Step-based step(state)->action loop

Tool Use

  • IPythonRunCellAction (run Python)
  • CmdRunAction (run bash)
  • BrowserInteractiveAction (browser primitives)
  • AgentDelegateAction (delegate subtasks)

Frameworks

  • BrowserGym
  • Playwright
  • Docker runtime
  • Jupyter/IPython
  • AgentSkills

Is Agentic

true

Architectures

  • CodeAct
  • GPTSwarm
  • micro agents
  • event-stream agent loop

Collaboration

  • AgentHub for sharing agents
  • Multi-agent delegation and micro agents

Optimization Features

Infra Optimization

  • Reuses existing runtime images to reduce build time

System Optimization

  • Dual-tagged Docker images for reproducibility and caching
  • LLM mocking in integration tests to save evaluation cost

Reproducibility

License

  • MIT

Code Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Agents still fall short on complex, long-horizon tasks and specialized stateful editing.
  • File editing on long files is fragile and needs research improvements.
  • Current multi-modality support is limited and planned for future extensions.

When Not To Use

  • When you need hardened, real-world autonomous agents without human oversight.
  • When regulatory constraints forbid running code in containers without formal audits.
  • When you require specialized, heavily-trained agents that exceed zero-shot generalists.

Failure Modes

  • Agents may perform incorrect code edits that pass superficial checks but break behavior.
  • Browser-driven tasks can fail when pages require complex visual reasoning or authentication.
  • LLM nondeterminism can cause flaky end-to-end behavior without careful testing/mocking.

Core Entities

Models

  • gpt-4o
  • gpt-4o-mini
  • gpt-4o-2024-05-13
  • gpt-4-1106-preview
  • gpt-4-turbo
  • gpt-3.5-turbo
  • claude-3-5-sonnet

Metrics

  • Success Rate (%)
  • pass@k
  • Average Cost ($ per instance)

Datasets

  • SWE-Bench
  • HumanEvalFix
  • BIRD
  • BioCoder
  • ML-Bench
  • Gorilla APIBench
  • ToolQA
  • WebArena
  • MiniWoB++
  • GAIA
  • GPQA
  • AgentBench
  • MINT
  • ProofWriter
  • Entity Deduction Arena

Benchmarks

  • SWE-Bench
  • HumanEvalFix
  • WebArena
  • MiniWoB++
  • GAIA
  • GPQA
  • AgentBench
  • MINT
  • ProofWriter
  • BioCoder
  • BIRD
  • ML-Bench
  • Gorilla APIBench
  • ToolQA
  • Entity Deduction Arena