OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

July 8, 20237 min

Overview

Decision SnapshotNeeds Validation

The system is production‑ready for explainability tasks (deployed in Azure), but needs careful prompt design, helpers and safeguards to avoid silent errors.

Citations32

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, Ishai Menache

Links

Abstract / PDF

Why It Matters For Business

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Who Should Care

Summary TLDR

The authors present OptiGuide, a modular system that uses a large language model (LLM) to convert plain‑English supply‑chain questions into optimization code, runs a solver (Gurobi) on in‑house data, and returns human‑readable explanations and visualizations. Key design points: in‑context learning (no model fine‑tuning), a coder/safeguard/interpreter agent loop, and privacy by keeping data inside the solver. A benchmark across five supply‑chain scenarios shows GPT‑4 reaches ~93% in‑distribution accuracy; zero‑shot is ~59%. Deployed early in Azure with >90% in‑distribution accuracy reported.

Problem Statement

Supply‑chain planners need fast, understandable answers to what‑if and diagnostic questions about optimizer outputs. LLMs can speak plain English but cannot reliably solve large combinatorial optimizations. The gap: translate natural questions into correct solver code and readable explanations, preserve data privacy, and detect LLM mistakes without costly model retraining.

Main Contribution

OptiGuide: an LLM‑centric pipeline (coder, safeguard, interpreter) that generates solver code, runs an optimizer, and returns explanations and visualizations.

A benchmark and evaluation methodology for supply‑chain explainability across five scenario types (facility location, network flow, workforce assignment, TSP, coffee example).

Key Findings

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers≈93% average accuracy (GPT‑4, in‑distribution)

Practical UseUse GPT‑4 with a few in‑prompt examples to get reliable automated what‑if answers for common question types.

Evidence RefSection 4.3; abstract

Zero‑shot GPT‑4 still provides nontrivial performance.

Numbers59% accuracy (GPT‑4, zero examples)

Practical UseYou can get useful answers without curated examples, but expect many failures; add examples to improve reliability.

Evidence RefTable 1, Section 4.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy0.93Average across five scenariosReported average in abstract and Section 4.3Abstract; Section 4.3; Table 1
Accuracy0.59Table 1 (all scenarios)Table 1 shows 0.59 zero‑shotTable 1

What To Try In 7 Days

Prototype: wrap your solver (Gurobi/Python) with an LLM endpoint to translate 10 common queries into code and run results.

Create 20 question–ground‑truth pairs and test in‑distribution vs out‑of‑distribution accuracy.

Add a simple safeguard that validates generated code runs and re‑tries up to 3 times before alerting an engineer.

Agent Features

Memory
in-context learning (prompt examples as short-term memory)
Planning
Tool Planning
Tool Use
generates code that calls optimization solver (Gurobi)logs solver output for interpreter analysis
Frameworks
OptiGuide coder/safeguard/interpreter agent pattern
Is Agentic

Yes

Architectures
LLM-to-code-to-solver pipeline
Collaboration
human-in-the-loop planners and engineersagent retries and safeguard-driven debugging

Optimization Features

Token Efficiency
prompt size constrained; example selection matters

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Users must ask precise, unambiguous questions; ambiguity causes wrong code.

Relies on well‑designed application components (database schema, helper functions) that need engineering effort.

When Not To Use

For ambiguous or loosely specified queries without follow‑up clarification.

When you lack engineers to write helpers and validate generated code.

Failure Modes

LLM generates code that executes but implements the wrong constraint (silent semantic error).

Ambiguous user phrasing leads to incorrect interpretation (e.g., 'earlier' undefined).

Core Entities

Models

GPT-4text-davinci-003text-embedding-ada-002

Metrics

Accuracy

Datasets

facility_location_scenariosmulti_commodity_network_flowworkforce_assignmenttraveling_salesman_scenarioscoffee_distribution_exampleAzure IFS input data (proprietary)

Benchmarks

OptiGuide supply‑chain explainability benchmark (authors' multi‑scenario suite)