Overview
Production Readiness
0.65
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
7
Why It Matters For Business
OpenHands reduces the engineering work to run and compare LLM-driven developer agents by providing a sandboxed runtime, shared skills, and benchmark harness under an MIT license, so teams can prototype agent integrations faster and safely.
Summary TLDR
OpenHands is an open-source platform and community for building, running, and evaluating LLM-driven agents that interact with the world via code, a shell, and a browser. It provides a docker sandbox runtime, a small set of executable actions (run Python, run bash, drive a browser), an extensible tool (AgentSkills) library, multi-agent delegation, integration tests, and an evaluation harness covering 15 public benchmarks. The repo is MIT-licensed and already hosts many agents and community contributions.
Problem Statement
Developing and evaluating agents that act like software developers is hard: you need safe code execution, browser control, shareable tools, multi-agent coordination, and reproducible benchmarks. OpenHands aims to provide a single, runnable platform that solves these engineering gaps so researchers and engineers can build, test, and compare generalist agents reliably.
Main Contribution
An event-stream agent interface and simple agent abstraction where agents produce actions (python, shell, browser) against a sandbox.
A runtime built on docker sandboxes with a REST API exposing a bash shell, IPython server, and a Chromium browser (Playwright + BrowserGym).
AgentSkills: an extensible, sharable Python tool library for common developer actions (file edits, parsing PDFs/images, etc.).
Multi-agent delegation via AgentDelegateAction and an AgentHub of community-contributed agents (CodeAct, BrowsingAgent, GPTSwarm, micro agents).
A built-in evaluation suite integrating 15 public benchmarks and an integration-test framework that mocks LLMs for deterministic tests.
A permissive, production-ready codebase (MIT) with an active community and documented reproducibility workflows.
Key Findings
A single generalist agent (CodeAct) performs competitively across software, web, and miscellaneous tasks without benchmark-specific prompt tuning.
OpenHands integrates 15 established benchmarks into one evaluation harness.
Sandboxed execution and a REST runtime enable safe, repeatable agent runs on arbitrary docker images.
Community traction and open governance: fast feature growth and many contributors.
Integration tests reduce nondeterminism and CI cost by mocking LLM outputs for regression checks.
Results
HumanEvalFix success rate
SWE-Bench Lite resolve rate
WebArena success rate (browser tasks)
Accuracy
Community contributions
Who Should Care
What To Try In 7 Days
Clone OpenHands and run the included CodeAct agent against one repository in a docker image to see end-to-end editing and tests.
Use AgentSkills to wrap one internal utility (e.g., repo search) so the agent can call it safely from the sandbox.
Run an integration test with LLM mocking to validate prompt changes before spending money on full LLM evaluations.
Agent Features
Memory
- Event stream with past actions/observations
- Metadata tracking (costs, delegation)
- Configurable workspace mounted into sandbox
Planning
- Event stream state for multi-turn planning
- Step-based step(state)->action loop
Tool Use
- IPythonRunCellAction (run Python)
- CmdRunAction (run bash)
- BrowserInteractiveAction (browser primitives)
- AgentDelegateAction (delegate subtasks)
Frameworks
- BrowserGym
- Playwright
- Docker runtime
- Jupyter/IPython
- AgentSkills
Is Agentic
true
Architectures
- CodeAct
- GPTSwarm
- micro agents
- event-stream agent loop
Collaboration
- AgentHub for sharing agents
- Multi-agent delegation and micro agents
Optimization Features
Infra Optimization
- Reuses existing runtime images to reduce build time
System Optimization
- Dual-tagged Docker images for reproducibility and caching
- LLM mocking in integration tests to save evaluation cost
Reproducibility
License
- MIT
Code Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Agents still fall short on complex, long-horizon tasks and specialized stateful editing.
- File editing on long files is fragile and needs research improvements.
- Current multi-modality support is limited and planned for future extensions.
When Not To Use
- When you need hardened, real-world autonomous agents without human oversight.
- When regulatory constraints forbid running code in containers without formal audits.
- When you require specialized, heavily-trained agents that exceed zero-shot generalists.
Failure Modes
- Agents may perform incorrect code edits that pass superficial checks but break behavior.
- Browser-driven tasks can fail when pages require complex visual reasoning or authentication.
- LLM nondeterminism can cause flaky end-to-end behavior without careful testing/mocking.
Core Entities
Models
- gpt-4o
- gpt-4o-mini
- gpt-4o-2024-05-13
- gpt-4-1106-preview
- gpt-4-turbo
- gpt-3.5-turbo
- claude-3-5-sonnet
Metrics
- Success Rate (%)
- pass@k
- Average Cost ($ per instance)
Datasets
- SWE-Bench
- HumanEvalFix
- BIRD
- BioCoder
- ML-Bench
- Gorilla APIBench
- ToolQA
- WebArena
- MiniWoB++
- GAIA
- GPQA
- AgentBench
- MINT
- ProofWriter
- Entity Deduction Arena
Benchmarks
- SWE-Bench
- HumanEvalFix
- WebArena
- MiniWoB++
- GAIA
- GPQA
- AgentBench
- MINT
- ProofWriter
- BioCoder
- BIRD
- ML-Bench
- Gorilla APIBench
- ToolQA
- Entity Deduction Arena

