ExLLM: evolving compact experience + k-offspring LLM optimizer that sets new PMO SOTA

February 18, 20259 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Nian Ran, Yue Wang, Xiaoyuan Zhang, Zhongzheng Li, Qingsong Ran, Wenhao Li, Richard Allmendinger

Links

Abstract / PDF

Why It Matters For Business

ExLLM turns LLMs into a sample-efficient, no-training optimizer that cuts API cost and runtime and generalizes across chemistry, engineering and code tasks, lowering the barrier to rapid design under limited evaluation budgets.

Summary TLDR

ExLLM is an LLM-as-optimizer framework for large discrete search (molecules, geometry, engineering). It combines (1) a single compact evolving experience (few-hundred-word memo), (2) k-offspring sampling to generate multiple candidates per LLM call (default k=2), and (3) a feedback adapter that normalizes objectives and formats constraints. No model training required. Under PMO it reaches an aggregate score of 19.165 (max 23), +7.3% vs prior SOTA, and transfers to records in circle packing, stellarator design and several engineering tasks while cutting API cost and runtime.

Problem Statement

Black-box optimization in large, discrete spaces (e.g., molecules) is costly and hard to steer with textual priors, heterogeneous feedback, and many iterations. Existing LLM approaches either over-append memory (prompt bloat and exploration collapse) or need further training. Practitioners need a sample-efficient, transferable optimizer that uses LLM reasoning without exploding prompt cost.

Main Contribution

A compact, evolving experience snippet that distills non-redundant good and bad examples into a single memo to avoid prompt bloat and preserve exploration.

A k-offspring sampling scheme that produces multiple diverse candidates per LLM call to widen exploration while lowering query overhead (default k=2).

A feedback adapter that normalizes objectives to [0,1], formats constraints/expert hints, and can promote critical constraints into explicit objectives.

A practical, no-training optimizer that sets new SOTA on PMO (19.165 total), generalizes to circle packing and stellarator tasks, and ships code.

Key Findings

ExLLM achieves the top aggregate PMO score reported in the paper.

NumbersPMO aggregate 19.165 (max 23) vs prior SOTA 17.862

A compact evolving experience improves multi-objective coverage and reduces query cost compared to retrieval-style memory.

NumbersHypervolume 0.750 vs retrieval-style 0.427; LLM queries 3312 vs 18055

k-offspring sampling increases exploration and is most stable at small k.

Numbersk=2 gives best, consistent gains across ablations; k>3 often degrades

ExLLM transfers beyond chemistry and finds strong engineering/geometry solutions.

NumbersOffshore jacket weight 13.6 tons (vs human 218 tons, −93%); Stellarator P2 0.505 vs 0.431 (+17%)

Results

PMO aggregate score

Value19.165 (max 23)

Baseline17.862 (MOLLEO)

Offshore jacket final weight

Value13.6 tons (feasible)

Baseline218 tons (human baseline)

Stellarator P2 score

Value0.505

Baseline0.431 (ALM-NGOpt)

Evolving experience vs retrieval memory (5-objective task)

ValueHypervolume 0.750; LLM queries 3312

BaselineRetrieval-style Hypervolume 0.427; LLM queries 18055

Runtime (five-objective molecular task)

Value0.393 ±0.114 h (ExLLM GPT-4o)

Baseline6.029 ±1.281 h (MOLLEO GPT-4o)

Who Should Care

What To Try In 7 Days

Run ExLLM with your task: supply a short task template and an evaluation function; try k=2 and p_exp=0.5.

Promote any critical or variable constraint to an explicit objective in the feedback adapter to stabilize feasibility.

Compare wall-clock time and API cost vs your current optimizer; measure Top-10 AUC and validity to assess practical gains.

Agent Features

Memory

  • Evolving experience (single distilled memo updated per generation)

Planning

  • Evolutionary loop (selection, mutation/crossover)
  • k-offspring multi-sampling per call

Tool Use

  • AlphaFold3
  • SACS
  • SciPy SLSQP
  • TopsCC (GCU compile/run diagnostics)

Frameworks

  • Prompt templates + feedback adapter

Is Agentic

true

Architectures

  • LLM-as-optimizer (autoregressive LLMs)

Collaboration

  • LLM proposals + external solver postprocessing (hybrid LLM+solver)

Optimization Features

Token Efficiency

  • Single evolving memo reduces prompt token growth versus retrieval-style memory
  • Fewer LLM queries lowers API cost and runtime

Infra Optimization

  • Lower total API tokens and shorter wall-clock times vs prior LLM-based baselines

System Optimization

  • Hybrid selection (half fitness-ranked, half Pareto-front) balances exploitation and diversity

Training Optimization

  • No additional model training required; method works with in-context LLM calls

Inference Optimization

  • k-offspring sampling: generate multiple candidates per LLM call to increase diversity
  • probabilistic experience injection to limit average prompt size

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Relies heavily on access to capable LLM backbones; best results use proprietary models.
  • Generates SMILES strings and discards invalid outputs rather than repairing them, so validity can vary by domain and LLM.
  • High experience-injection rates can over-condition search and reduce diversity.
  • Some domain transfers require light postprocessing (e.g., feasibility solvers) to guarantee constraints.

When Not To Use

  • You lack reliable LLM API access or budget for many inference tokens.
  • Real-time or ultra-low-latency constraints where multi-step LLM calls are infeasible.
  • When strict end-to-end proofs of safety or determinism are required without human-in-the-loop checks.

Failure Modes

  • Over-conditioning: too-frequent experience injection collapses exploration.
  • Local over-exploration when k is set too large, reducing global search.
  • Many invalid proposals (e.g., malformed SMILES or compilation failures) can waste scarce evaluations.
  • Dependence on initial population quality can affect convergence speed in some settings.

Core Entities

Models

  • GPT-4o-2024-05-13 (GPT-4o)
  • Gemini-2.5-flash
  • DeepSeek-V3.1
  • Qwen3-Max

Metrics

  • PMO aggregate score
  • Top-1/Top-10 Fitness (F)
  • AUC-Top10
  • Hypervolume
  • Uniqueness
  • Validity
  • API cost (USD)
  • Running time (hours)

Datasets

  • PMO benchmark
  • ZINC250K
  • ConStellaration
  • MOCPOP (MOTSP/MOCVRP)
  • SACS offshore jacket dataset
  • Circle packing reference (Erich Friedman)

Benchmarks

  • PMO
  • MOCPOP (MOTSP/MOCVRP)
  • ConStellaration (stellarator)
  • SACS (offshore jacket)
  • Circle packing
  • GCU kernel competition