APEMO: reallocate compute to negative peaks and endings to stabilize long-horizon agent workflows

February 20, 20266 min

Overview

Decision SnapshotNeeds Validation

Scores reflect solid controlled experiments on small LLMs and ABM support; effects are consistent for long trajectories but human-subject validation and large-model scaling are missing.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals11

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 5/5

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 55%

Production readiness: 60%

Novelty: 65%

Authors

Hanjing Shi, Dominic DiFranzo

Links

Abstract / PDF

Why It Matters For Business

APEMO raises perceived reliability and reuse in multi-step AI workflows without retraining, giving product teams a runtime lever to improve user trust under fixed compute budgets.

Who Should Care

Summary TLDR

The paper introduces APEMO, a runtime orchestration layer that detects negative "peaks" and weak endings in multi-step agent workflows and reallocates a fixed compute budget toward repairs. APEMO does not change models or training. On small LLM families and multi-agent flows, it raises trajectory-level quality (e.g., +0.0791 mean quality vs a peak-end baseline) and reuse probability while modestly increasing coordination cost (~+6% in long-horizon blocks). Benefits grow with trajectory depth and are weaker when strong temporal baselines already exist.

Problem Statement

Human judgments of multi-step interactions weight intense moments and endings more than average step accuracy. Existing alignment and orchestration methods usually optimize step-level or structural properties and ignore this temporal asymmetry. The problem: how to improve perceived reliability and reuse of long-horizon agentic systems under a fixed compute budget by controlling when compute is applied over time.

Main Contribution

APEMO: a runtime temporal-affective orchestration layer that reallocates fixed compute toward detected negative peaks and endings.

A constrained multi-objective formulation balancing peak-end weighted quality, reuse robustness, frustration proxies, and coordination cost.

Key Findings

APEMO improves mean trajectory quality vs a peak-end baseline.

Numbers+0.0791 mean quality (95% CI [0.0525,0.1055])

Practical UseUnder the same compute cap, move inference precision toward detected negative peaks and endings to raise overall perceived quality.

Evidence RefTable 2; Section 5.1

APEMO increases reuse probability in long-horizon runs.

Numbers+0.0609 reuse probability (95% CI [0.0383,0.0826])

Practical UseImproved endpoints and fewer negative peaks lead users to reuse or trust outputs more; test reuse as a primary metric.

Evidence RefTable 2; Section 5.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Mean trajectory quality (APEMO - task_peak_end)+0.0791task_peak_end+0.0791 (95% CI [0.0525,0.1055])Long-horizon T=8, n=20Table 2 long-horizon blockSection 5.1
Reuse probability (APEMO - task_peak_end)+0.0609task_peak_end+0.0609 (95% CI [0.0383,0.0826])Long-horizon T=8, n=20Table 2 long-horizon blockSection 5.1

What To Try In 7 Days

Instrument simple frustration proxies (repetition, drift) across a sample 8-turn workflow.

Implement a runtime monitor that flags negative peaks and reassigns precision to flagged turns.

Run A/B tests on endpoint quality and reuse probability under fixed compute caps.

Agent Features

Memory
short-term trajectory signals (frustration proxies)
Planning
LLM planner-executor flowstemporal scheduling of reasoning precision
Tool Use
runtime orchestration overlayprecision repair operations
Frameworks
APEMO temporal-affective orchestrationAgent-Based Modeling (ABM) for stress tests
Is Agentic

Yes

Architectures
Planner-Executor-Critic topologyrole-based multi-agent flows
Collaboration
applies to multi-agent coordination as an overlay

Optimization Features

Token Efficiency
fixed total compute budget; reassign tokens/precision
Infra Optimization
local Ollama-based runtime experiments
System Optimization
trade-off analysis between robustness gain and coordination cost
Training Optimization
none (no fine-tuning or reward modification)
Inference Optimization
reallocate inference precision across turnsprecision repair triggered by peak detection

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No human-subject studies to confirm subjective trust or reuse intent.

Comparisons limited to plan-execute style orchestrators, not full industrial stacks.

When Not To Use

Shallow workflows (T≈2) where coordination overhead does not amortize.

Systems that already apply strong temporal scheduling.

Failure Modes

Mis-detection of peaks causes wasted compute and reduced step-level accuracy.

Coordination overhead can erase gains in short trajectories.

Core Entities

Models

llama3.2:1bqwen2.5:1.5bgemma2:2b

Metrics

peak-end weighted qualityreuse probabilityaverage frustrationcoordination cost