Let a CEO→Manager→Worker hierarchy auto-write better prompts and improve zero-shot LLM outputs

May 30, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

1

Authors

Yuchi Liu, Jaskirat Singh, Gaowen Liu, Ali Payani, Liang Zheng

Links

Abstract / PDF

Why It Matters For Business

HMAW automates prompt tuning without training and boosts response quality across varied tasks, letting teams improve outputs quickly while avoiding dataset-specific finetuning.

Summary TLDR

The authors propose HMAW, a zero-shot prompt optimizer that runs three LLM roles (CEO, Manager, Worker) to rewrite a user query into a refined prompt. HMAW is task-agnostic and needs no training. Across five datasets (education, dialog, math, code, general QA) it raises evaluator preference rates from ~38.5% to 69.2% on average (a +30.7 point absolute gain) and slightly improves GSM8K accuracy (+1.7%). The method costs extra latency (roughly +4–10 seconds per sample) and works with different LLM backbones (Mixtral, GPT-3.5, GPT-4).

Problem Statement

Good prompts matter, but manual prompts and learned prompts either require hand design or training and generalize poorly. The paper asks: can an LLM-based multi-agent hierarchy automatically produce query-specific, zero-shot prompts that generalize across tasks without training?

Main Contribution

HMAW: a 3-layer CEO→Manager→Worker workflow that rewrites queries into refined prompts without training.

Empirical evaluation on five datasets showing large average gains in evaluator preference scores.

Ablations showing skip connections and the three-layer design are important; three layers is empirically optimal.

Key Findings

Average preference score across five tasks increases by 30.7 percentage points

NumbersAvg pref: 69.2% (HMAW) vs 38.5% (no prompt); +30.7 pts

GSM8K accuracy improves slightly (+1.7%) under HMAW

NumbersGSM8K acc: 70.3% (HMAW) vs 68.6% (no prompt); +1.7 pts

Skip connections matter; removing them drops performance up to ~20.7 points

NumbersExample: removing Manager skip causes −20.7 pts on CodeNet

Three-layer hierarchy is better than fewer or more layers

NumbersBest results at 3 layers; >3 layers degrade performance

Method generalizes across LLM backbones

NumbersHMAW preference >50% with Mixtral, GPT-3.5 and GPT-4 agents

Latency and token cost increase materially (2–8× per sample)

NumbersExtra time per sample: ~4.3–9.97s (varies by dataset); increases 207%–734%

Results

Preference score (GPT-3.5 evaluator)

ValueATLAS: 64.1%

BaselineNo prompting: 35.9%

Preference score (GPT-3.5 evaluator)

ValueFED: 86.2%

BaselineNo prompting: 13.8%

Preference score (GPT-3.5 evaluator)

ValueCodeNet: 70.3%

BaselineNo prompting: 35.6%

Preference score (GPT-3.5 evaluator)

ValueEducation: 64.4%

BaselineNo prompting: 38.8%

Accuracy

ValueGSM8K: 70.3%

BaselineNo prompting: 68.6%

Average preference across 5 tasks

Value69.2%

Baseline38.5%

Extra inference time per sample (avg)

ValueExamples: +9.97s (ATLAS), +5.14s (FED), +4.30s (GSM8K), +7.27s (CodeNet), +7.76s (Education)

BaselineNo prompt inference times vary by dataset

Who Should Care

What To Try In 7 Days

Prototype HMAW with your current LLM: implement CEO→Manager→Worker prompt templates and compare outputs on 100 live queries.

Measure trade-offs: log evaluator preference, latency, and token cost to decide where quality gains justify extra runtime.

Enable skip connections: always include original user query at intermediate layers to preserve details.

Agent Features

Memory

  • skip connections to preserve query details

Planning

  • hierarchical instruction generation

Frameworks

  • HMAW

Is Agentic

true

Architectures

  • CEO-Manager-Worker hierarchy

Collaboration

  • multi-agent coordination
  • layered instruction passing

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Adds substantial extra latency and token cost per request
  • Relies on an LLM evaluator for subjective comparisons, which can carry bias
  • May not beat task-specific handcrafted prompts in specialized domains like math

When Not To Use

  • When strict low-latency constraints exist (real-time systems)
  • When you already have a tuned, high-performing task-specific prompt (e.g., math CoT)
  • When token budget or API costs prohibit multi-stage prompting

Failure Modes

  • Layer-generated instructions drift from the original intent if skip connections are removed
  • Deeper hierarchies (>3) can overcomplicate prompting and reduce quality
  • Evaluator bias can overstate gains if not checked

Core Entities

Models

  • Mixtral-8x7Bv0.1
  • GPT-3.5
  • GPT-4

Metrics

  • preference score (%) from GPT-3.5 evaluator
  • Accuracy

Datasets

  • ATLAS
  • FED
  • GSM8K
  • CodeNet (Python subset)
  • Education (100 Qs, authors' new set)