AgentLite: tiny open-source toolkit to rapidly prototype task-oriented and multi-agent LLM systems

February 23, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.5

Citation Count

10

Authors

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, Silvio Savarese

Links

Abstract / PDF

Why It Matters For Business

AgentLite reduces code overhead for prototyping LLM agents so engineering teams can test agent ideas quickly without a heavy framework or large code refactor.

Summary TLDR

AgentLite is an open-source, compact Python library (<1k lines) for building task-oriented LLM agents and hierarchical multi-agent systems. It provides four modular components (PromptGen, Actions, LLM wrapper, Memory), a ManagerAgent for task decomposition and orchestration, and easy hooks to add new reasoning actions (Think, Plan, Reflect) or varied LLM backends. The authors reproduce agent-style benchmarks (HotPotQA, WebShop) to show AgentLite runs standard agent experiments and ships ready demo apps (image Q&A, painter, chess, philosopher chat). AgentLite is a tooling contribution: it speeds prototyping and experimentation, but it is not a new model or training method.

Problem Statement

Existing agent frameworks are large, rigid, or hard to refactor for new reasoning strategies and agent architectures. Researchers need a small, modular codebase to iterate new agent designs, plug in custom reasoning actions, and assemble hierarchical multi-agent systems quickly.

Main Contribution

Released AgentLite: compact, research-oriented agent library with ~959 core lines of code.

Defined a task-oriented agent API with four modules: PromptGen, Actions, LLM wrapper, Memory.

Provided ManagerAgent for hierarchical task decomposition and multi-agent orchestration.

Made it easy to add new reasoning types (e.g., Think/Plan/Reflect) and multiple LLM backends per agent.

Demonstrated reproducible experiments on HotPotQA and WebShop and several demo apps.

Key Findings

AgentLite is small and focused: core codebase is under 1,000 lines.

NumbersAgentLite core lines = 959; LangChain = 248,650 (Table 1)

AgentLite runs agent-style QA experiments and shows expected model performance ordering.

NumbersHotPotQA medium F1: GPT-4-32k 0.644, xLAM 0.547, GPT-3.5 0.330 (Table 2)

AgentLite supports web-interaction benchmarks and reproduces reward gaps between models.

NumbersWebShop avg. reward (all): GPT-4-32k 0.681 vs GPT-3.5 0.522; xLAM 0.524 (Table 3)

Results

HotPotQA medium F1-Score

ValueGPT-4-32k: 0.644

BaselineGPT-3.5-Turbo-16k: 0.330

HotPotQA medium F1-Score

ValuexLAM-v0.1: 0.547

BaselineGPT-3.5-Turbo-16k: 0.330

WebShop avg. reward (all tasks)

ValueGPT-4-32k: 0.681

BaselineGPT-3.5-Turbo-16k: 0.522

WebShop avg. reward (all tasks)

ValuexLAM-v0.1: 0.524

BaselineGPT-3.5-Turbo-16k: 0.522

Who Should Care

What To Try In 7 Days

Clone the AgentLite GitHub and run the included HotPotQA or WebShop example to reproduce results.

Add a simple Think action to an agent and measure behavioral changes on one benchmark.

Build a ManagerAgent that delegates a two-step task to two specialized agents (search + action).

Agent Features

Memory

  • action-observation chain memory

Planning

  • ReAct-like (Think action)
  • Reflection (Reflect action)
  • Plan action

Tool Use

  • API-wrapped tools
  • web search (DuckDuckGo, Wikipedia)
  • WolframAlpha solver
  • image generation (DALL-E)

Frameworks

  • PromptGen
  • Actions
  • LLM wrapper
  • Memory module

Is Agentic

true

Architectures

  • hierarchical_multi_agent
  • multi_llm_multi_agent
  • manager-team

Collaboration

  • manager-agent orchestration
  • sequential TaskPackage delegation

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Not a new LLM or training method; improvements depend on chosen LLM backend.
  • Communication patterns among agents are basic; richer protocols are future work.
  • Relies on external LLM APIs for execution; costs and latency depend on those providers.
  • Benchmarks shown are limited to HotPotQA and WebShop-style tasks.

When Not To Use

  • If you need a full-featured industrial orchestration stack with heavy integrations.
  • If your team requires built-in advanced agent communication protocols not yet implemented.
  • If you need an end-to-end production system with SLAs and monitoring out of the box.

Failure Modes

  • Agent outputs limited by backend LLM quality (hallucinations or wrong tool calls).
  • ManagerAgent may create sub-tasks that subordinate agents misinterpret.
  • Memory growth could bloat prompts and cause token limits with long tasks.

Core Entities

Models

  • GPT-3.5-Turbo-16k-0613
  • GPT-4-0613
  • GPT-4-32k-0613
  • xLAM-v0.1
  • Mixtral 8x7b MoE

Metrics

  • F1-Score
  • Accuracy
  • avg. reward

Datasets

  • HotPotQA
  • WebShop (AgentBoard tasks)

Benchmarks

  • HotPotQA
  • WebShop