OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

Overview

Decision SnapshotNeeds Validation

The system is production‑ready for explainability tasks (deployed in Azure), but needs careful prompt design, helpers and safeguards to avoid silent errors.

Citations32

Evidence Strength0.80

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 1/4

Reproducibility

Status: No open assets linked

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 50%

Authors

Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, Ishai Menache

Links

Abstract / PDF

Why It Matters For Business

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Who Should Care

Product Manager Engineering Lead Data Scientist ML Engineer CTO

Summary TLDR

The authors present OptiGuide, a modular system that uses a large language model (LLM) to convert plain‑English supply‑chain questions into optimization code, runs a solver (Gurobi) on in‑house data, and returns human‑readable explanations and visualizations. Key design points: in‑context learning (no model fine‑tuning), a coder/safeguard/interpreter agent loop, and privacy by keeping data inside the solver. A benchmark across five supply‑chain scenarios shows GPT‑4 reaches ~93% in‑distribution accuracy; zero‑shot is ~59%. Deployed early in Azure with >90% in‑distribution accuracy reported.

Problem Statement

Supply‑chain planners need fast, understandable answers to what‑if and diagnostic questions about optimizer outputs. LLMs can speak plain English but cannot reliably solve large combinatorial optimizations. The gap: translate natural questions into correct solver code and readable explanations, preserve data privacy, and detect LLM mistakes without costly model retraining.

Main Contribution

OptiGuide: an LLM‑centric pipeline (coder, safeguard, interpreter) that generates solver code, runs an optimizer, and returns explanations and visualizations.

A benchmark and evaluation methodology for supply‑chain explainability across five scenario types (facility location, network flow, workforce assignment, TSP, coffee example).

Key Findings

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers≈93% average accuracy (GPT‑4, in‑distribution)

Practical UseUse GPT‑4 with a few in‑prompt examples to get reliable automated what‑if answers for common question types.

Evidence RefSection 4.3; abstract

Zero‑shot GPT‑4 still provides nontrivial performance.

Numbers59% accuracy (GPT‑4, zero examples)

Practical UseYou can get useful answers without curated examples, but expect many failures; add examples to improve reliability.

Evidence RefTable 1, Section 4.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.93	—	—	Average across five scenarios	Reported average in abstract and Section 4.3	Abstract; Section 4.3; Table 1
Accuracy	0.59	—	—	Table 1 (all scenarios)	Table 1 shows 0.59 zero‑shot	Table 1

What To Try In 7 Days

Prototype: wrap your solver (Gurobi/Python) with an LLM endpoint to translate 10 common queries into code and run results.

Create 20 question–ground‑truth pairs and test in‑distribution vs out‑of‑distribution accuracy.

Add a simple safeguard that validates generated code runs and re‑tries up to 3 times before alerting an engineer.

Agent Features

Memory

in-context learning (prompt examples as short-term memory)

Planning

Tool Planning

Tool Use

generates code that calls optimization solver (Gurobi)logs solver output for interpreter analysis

Frameworks

OptiGuide coder/safeguard/interpreter agent pattern

Is Agentic

Yes

Architectures

LLM-to-code-to-solver pipeline

Collaboration

human-in-the-loop planners and engineersagent retries and safeguard-driven debugging

Optimization Features

Token Efficiency

prompt size constrained; example selection matters

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Users must ask precise, unambiguous questions; ambiguity causes wrong code.

Relies on well‑designed application components (database schema, helper functions) that need engineering effort.

When Not To Use

For ambiguous or loosely specified queries without follow‑up clarification.

When you lack engineers to write helpers and validate generated code.

Failure Modes

LLM generates code that executes but implements the wrong constraint (silent semantic error).

Ambiguous user phrasing leads to incorrect interpretation (e.g., 'earlier' undefined).

Core Entities

Models

GPT-4text-davinci-003text-embedding-ada-002

Metrics

Accuracy

Datasets

facility_location_scenariosmulti_commodity_network_flowworkforce_assignmenttraveling_salesman_scenarioscoffee_distribution_exampleAzure IFS input data (proprietary)

Benchmarks

OptiGuide supply‑chain explainability benchmark (authors' multi‑scenario suite)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Zero‑shot GPT‑4 still provides nontrivial performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding