Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
1
Why It Matters For Business
LLM4AD reduces engineering friction for using LLMs to generate and test algorithms by packaging search loops, safe execution, and task templates into a single Python toolkit and GUI.
Summary TLDR
LLM4AD is an open Python framework that plugs LLMs into iterative search pipelines to generate, evaluate, and refine algorithmic code. It bundles search methods (sampling, evolutionary and neighborhood search), an LLM interface (local or remote), and a secure evaluation sandbox. The repo includes 20+ ready tasks (160+ planned), a GUI, profilers (TensorBoard, wandb), and benchmarks across 9 tasks, 8 LLMs, and 3 runs. The platform shows that pairing LLMs with search beats naive sampling on most tasks, but LLM choice does not guarantee dominance and templates/evaluation settings matter.
Problem Statement
There is no unified, easy-to-use platform to run, compare, and extend LLM-assisted algorithm design. Prior work uses heterogeneous code, different prompts, and ad-hoc tasks, which makes fair comparison and rapid development hard. LLM4AD targets this gap by providing a modular pipeline, task suite, secure sandbox, and evaluation tools for LLM-guided algorithm design.
Main Contribution
An open Python platform (LLM4AD) that integrates LLM samplers, iterative search methods, and a secure evaluation sandbox.
A task suite of 20+ ready algorithm-design tasks (160+ planned) across optimization, ML, and scientific discovery.
A unified benchmarking setup and examples that evaluate 8 LLMs on 9 tasks with consistent hyperparameters and profilers.
Developer-focused APIs, documentation, Jupyter demos, and a GUI for non-coders to run experiments.
Key Findings
Pairing LLMs with search methods (EoH, FunSearch, (1+1)-EPS) outperforms random sampling on most tasks.
Platform experiments used 8 LLMs; model choice causes noticeable performance variance on several tasks but no single model dominates.
LLM coding capability (HumanEval) is not a reliable predictor of algorithm-design performance.
The platform enforces safe automated evaluation with timeouts and sandboxing to avoid runaway or harmful code.
LLM4AD is extensible: users can add methods, tasks, and custom samplers via base interfaces.
Results
Number of LLMs evaluated
Number of tasks in reported benchmark
Independent runs per experiment
Max function evaluations (#FE)
Per-algorithm maximum evaluation time
Who Should Care
What To Try In 7 Days
Run the included 9-task benchmark with one open LLM to see baseline performance.
Wrap an internal or open LLM with the provided sampler and test EoH vs sampling.
Add a small custom task via the Evaluation interface and run a secure evaluation.
Agent Features
Tool Use
- Remote LLM APIs (OpenAI-style)
- Local inference via transformers/vLLM
Optimization Features
Inference Optimization
- Parallel sampling for LLM queries
Reproducibility
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Reported benchmark covers 9 tasks; platform claims 20+ tasks now and 160+ planned but most tasks unreported.
- GUI currently supports only one method and one LLM configuration per run.
- Benchmark results rely on automated objective scores; no human evaluation of algorithm usefulness is reported.
- LLM performance and results are sensitive to prompt templates and search hyperparameters.
When Not To Use
- If you need provably-correct or formally-verified algorithms.
- When you require ultra-low-latency, on-device inference without LLMs.
- If you must compare methods across non-Python runtimes not supported by the platform.
Failure Modes
- LLM returns invalid or non-executable code leading to timeouts.
- Search may converge to poor heuristics if diversity control is weak (greedy methods fail on some tasks).
- Benchmarks may favor certain prompt templates or LLM behaviors, creating evaluation bias.
Core Entities
Models
- Llama-3.1-8B
- Yi-34b-Chat
- GLM-3-Turbo
- Claude-3-Haiku
- Doubao-pro-4k
- GPT-3.5-Turbo
- GPT-4o-Mini
- Qwen-Turbo
Metrics
- Fitness / objective score
- HumanEval
- MMLU
- Convergence over function evaluations
- Run-to-run standard deviation
Datasets
- CVRP
- OVRP
- OBP
- TSP
- VRPTW
- SET
- FSSP
- EA
- MEA
- MCP
- MKP
- Surrogate-based optimization
- ACRO
- CAR
- ML (Moon Lander)
- CARP
- CARC
- PEN
- BACT
- OSC
- MSB
- ODE
- SRSD-Feynman sets
Benchmarks
- LLM4AD task suite (9-task benchmark reported; 20+ tasks available, 160+ planned)
- HumanEval (for model capability proxy)
- MMLU (for model capability proxy)

