Use a pre-trained LLM (GPT-3.5) as a zero-shot search operator and distill it into a white-box linear operator for MOEA/D

October 19, 20237 min

Overview

Production Readiness

0.4

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

21

Authors

Fei Liu, Xi Lin, Zhenkun Wang, Shunyu Yao, Xialiang Tong, Mingxuan Yuan, Qingfu Zhang

Links

Abstract / PDF

Why It Matters For Business

You can prototype new evolutionary operators with natural-language prompts and then distill them into cheap, explainable operators — reducing expert design time and cutting API cost after distillation.

Summary TLDR

The authors show you can prompt a large language model (GPT-3.5) to act as a black‑box search operator inside a decomposition-based multiobjective evolutionary algorithm (MOEA/D). They collect the LLM input→output pairs, fit a weighted linear operator with randomness (LO) that approximates the LLM, and build MOEA/D-LO — a white-box operator that removes repeated LLM calls. On standard ZDT, UF and five real engineering RE instances, MOEA/D-LO is competitive with common MOEAs (HV/IGD metrics). Code is on GitHub. Caveats: results are limited to benchmark suites, online LLM calls are expensive, and LO captures average behavior rather than per-case nuance.

Problem Statement

Designing good search operators for multiobjective evolutionary algorithms needs expert time and often fails to generalize. Training neural operators is slow and brittle. This paper asks: can a pre-trained large language model be used zero-shot as a search operator inside MOEA/D, and can we distill that behavior into an explicit, cheaper operator?

Main Contribution

A decomposition-based MOEA/D framework that uses a pre-trained LLM (GPT-3.5) as a zero-shot black-box search operator via prompt engineering.

A white-box weighted linear operator with added randomness (LO) learned from LLM input–output pairs, and MOEA/D-LO that replaces costly LLM calls.

Empirical evaluation on ZDT, UF and real engineering RE instances showing MOEA/D-LO is competitive on HV/IGD and robust across varied problems.

Key Findings

MOEA/D-LLM (GPT-3.5) produces competitive hypervolume (HV) on five real engineering RE instances.

NumbersRE21 HV: 0.7936 vs MOEA/D 0.781 (Table I)

An explicit linear operator (LO) was learned from LLM behavior using 14,000 per-dimension samples.

Numbers14,000 sample–response pairs collected (Sec. V-B)

MOEA/D-LO (the distilled operator) matched or beat common MOEAs on several benchmarks by HV/IGD.

NumbersMOEA/D-LO yields superior average on 3 instances and favorable sum-rank tests (Tables II–III)

Results

Hypervolume (HV)

ValueMOEA/D-LLM on RE21: 0.7936 vs MOEA/D 0.781; similar overlaps in PFs (Fig.2)

BaselineMOEA/D (GA)

Aggregate performance across ZDT/UF

ValueMOEA/D-LO often comparable or better on HV/IGD across many instances; wins on 3 instances by average

BaselineNSGA-II, MOEA/D, MOEA/D-DE

Training data for LO

Value14,000 per-dimension samples collected from LLM interactions

Who Should Care

What To Try In 7 Days

Run the authors' demo with GPT-3.5 on one RE or ZDT instance using the GitHub repo.

Record prompt→offspring pairs for a few thousand variable-dimension samples.

Fit a simple weighted linear model and replace LLM calls to compare HV/IGD and API cost.

Agent Features

Memory

  • clears LLM conversation history each call (no retained memory)

Planning

  • in-context learning (uses examples in prompt)

Tool Use

  • calls GPT-3.5 Turbo via API as a black-box operator

Frameworks

  • MOEA/D
  • MOEA/D-LO

Architectures

  • decomposition-based MOEA (MOEA/D)

Optimization Features

Token Efficiency

  • prompts include concise samples and strict output format to limit verbosity

System Optimization

  • clear LLM cache between calls to avoid context leakage

Training Optimization

  • distill LLM behavior offline to avoid repeated API inference

Inference Optimization

  • replace costly LLM calls with learned linear operator to cut latency and cost

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • Evaluation is limited to ZDT, UF and five RE instances; real-world constrained/high-dimensional problems are untested.
  • Using LLM as a black box is expensive and slow; online interaction is resource intensive.
  • The linear operator models average LLM behavior and may miss case-specific patterns and richer, non-linear mappings.
  • LLM can return unparseable or repetitive responses requiring verification and retries.

When Not To Use

  • When API latency or per-call cost is prohibitive and no distilled LO is available.
  • When the problem has complex constraints, categorical variables, or domain rules that need specialized operators.
  • When interpretability requires exact, case-level LLM reasoning beyond average behavior.

Failure Modes

  • Unrecognized or malformed textual outputs from LLM require retrying prompts (IV-B).
  • LO can be overly greedy in high-dimensional problems; authors apply per-dimension updates with 10% probability to mitigate this.
  • Learned weights may not transfer to problems with very different structure than the training interactions.

Core Entities

Models

  • GPT-3.5 Turbo
  • MOEA/D-LO (linear operator distilled from LLM)

Metrics

  • Hypervolume (HV)
  • Inverted Generational Distance (IGD)

Datasets

  • ZDT (standard MOP suite)
  • UF (standard MOP suite)
  • RE21–RE25 (real engineering instances)

Benchmarks

  • ZDT
  • UF
  • RE21–RE25