Overview
The system is production‑ready for explainability tasks (deployed in Azure), but needs careful prompt design, helpers and safeguards to avoid silent errors.
Citations32
Evidence Strength0.80
Confidence0.85
Risk Signals12
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 1/4
Reproducibility
Status: No open assets linked
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 70%
Novelty: 50%
Why It Matters For Business
OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.
Who Should Care
Summary TLDR
The authors present OptiGuide, a modular system that uses a large language model (LLM) to convert plain‑English supply‑chain questions into optimization code, runs a solver (Gurobi) on in‑house data, and returns human‑readable explanations and visualizations. Key design points: in‑context learning (no model fine‑tuning), a coder/safeguard/interpreter agent loop, and privacy by keeping data inside the solver. A benchmark across five supply‑chain scenarios shows GPT‑4 reaches ~93% in‑distribution accuracy; zero‑shot is ~59%. Deployed early in Azure with >90% in‑distribution accuracy reported.
Problem Statement
Supply‑chain planners need fast, understandable answers to what‑if and diagnostic questions about optimizer outputs. LLMs can speak plain English but cannot reliably solve large combinatorial optimizations. The gap: translate natural questions into correct solver code and readable explanations, preserve data privacy, and detect LLM mistakes without costly model retraining.
Main Contribution
OptiGuide: an LLM‑centric pipeline (coder, safeguard, interpreter) that generates solver code, runs an optimizer, and returns explanations and visualizations.
A benchmark and evaluation methodology for supply‑chain explainability across five scenario types (facility location, network flow, workforce assignment, TSP, coffee example).
Key Findings
GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.
Zero‑shot GPT‑4 still provides nontrivial performance.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 0.93 | — | — | Average across five scenarios | Reported average in abstract and Section 4.3 | Abstract; Section 4.3; Table 1 |
| Accuracy | 0.59 | — | — | Table 1 (all scenarios) | Table 1 shows 0.59 zero‑shot | Table 1 |
What To Try In 7 Days
Prototype: wrap your solver (Gurobi/Python) with an LLM endpoint to translate 10 common queries into code and run results.
Create 20 question–ground‑truth pairs and test in‑distribution vs out‑of‑distribution accuracy.
Add a simple safeguard that validates generated code runs and re‑tries up to 3 times before alerting an engineer.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
Reproducibility
Risks & Boundaries
Limitations
Users must ask precise, unambiguous questions; ambiguity causes wrong code.
Relies on well‑designed application components (database schema, helper functions) that need engineering effort.
When Not To Use
For ambiguous or loosely specified queries without follow‑up clarification.
When you lack engineers to write helpers and validate generated code.
Failure Modes
LLM generates code that executes but implements the wrong constraint (silent semantic error).
Ambiguous user phrasing leads to incorrect interpretation (e.g., 'earlier' undefined).

