POLARIS: typed, policy-aware plan synthesis and guarded execution for auditable back-office automation

January 16, 20267 min

Overview

Decision SnapshotNeeds Validation

POLARIS is a practical blueprint demonstrating governance primitives for enterprise automation, validated on a small synthetic suite and SROIE; production adoption needs policy DB quality and LLM access.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan

Links

Abstract / PDF

Why It Matters For Business

POLARIS adds auditable, policy-gated automation to invoice-like workflows so companies can reduce manual checks while retaining legal and auditability guarantees.

Who Should Care

Summary TLDR

POLARIS is a modular orchestration system that turns back-office automation into typed plan synthesis plus guarded execution. A planner generates several type-checked DAGs, a rubric-based ReasoningAgent picks a policy-compliant plan, and execution runs under validator-gated checks, a bounded repair loop, and compiled policy guardrails. On document tasks it produces auditable JSON decisions and traces. Reported results: SROIE F1 ≈ 0.81, synthetic extraction F1 ≈ 0.9565, policy violation F1 ≈ 0.8182, anomaly detection precision 1.0 (recall 0.90).

Problem Statement

Enterprise back-office automation needs auditable, policy-aligned, and predictable agent behavior. Existing multi-agent LLM stacks often use untyped I/O, best-of-N sampling, and open-ended retries, which break audit trails, policy gating, and operational guarantees. POLARIS re-casts orchestration as typed planning plus governed execution to meet enterprise requirements.

Main Contribution

POLARIS framework: typed-DAG plan synthesis, rubric selection, validator-gated execution and policy guardrails

CoAPlanner: enforces type-safe, diversity-constrained generation of candidate workflows

Key Findings

High extraction on synthetic invoices

NumbersF1 = 0.9565 on synthetic suite (160 fields)

Practical UseTyped plans + validator repair deliver near-decision-grade extraction under controlled scenarios; try the repair loop to boost coverage on noisy scans

Evidence RefTable 1 (TOTAL row)

Strong SROIE extraction baseline

NumbersF1 = 0.8116 on SROIE (4-field extraction)

Practical UseThe framework matches solid 4-field extraction performance while adding audit traces and policy checks; useful when traceability matters

Evidence RefTable 5

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Synthetic extraction (overall)precision=0.9453, recall=0.9680, f1=0.9565synthetic invoice suite (160 fields)Table 1 totalsTable 1
SROIE 4-field extractionprecision=0.8189, recall=0.8045, f1=0.8116SROIETable 5 overallTable 5

What To Try In 7 Days

Define typed I/O contracts for one invoice workflow and instrument a JSON normalizer

Run a planner that emits 3–5 type-checked DAG proposals and log them for review

Implement a validator-gated repair loop limited to 2–3 iterations for parser outputs to cut false negatives

Agent Features

Memory
few-shot exemplar bank (DSPy) for planning priors
Planning
typed plan synthesis (CoAPlanner)diversity-constrained generationfew-shot planning priors via DSPy
Tool Use
APIAccess agent for ERP/Bank integrationsPolicyRetrieval and RiskControl guardrails
Frameworks
DSPyAutogen ConversableAgent pattern
Is Agentic

Yes

Architectures
multi-agent LLM pipelinetyped DAG orchestration
Collaboration
plan-select-act orchestration across specialized agents

Optimization Features

System Optimization
dependency-aware scheduler enabling parallel middle-stage checkscompletion-based semantics to maximize safe parallelism
Inference Optimization
bounded repair loop (limits repeated parse calls)K=5 plan budget to bound planning cost

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use a small synthetic suite plus SROIE; cross-domain validation is limited

Design relies on proprietary models (GPT-4o, GPT-5) in examples

When Not To Use

Very low-latency real-time systems where LLM latency is unacceptable

Environments without a maintained policy store or vendor baselines

Failure Modes

Parser failures under extreme layout drift can still produce false positives even after repairs

Incorrect or missing policy records can cause wrong blocking or missed violations

Core Entities

Models

GPT-4o (CoAPlanner)GPT-5 reasoning model (ReasoningAgent)

Metrics

precisionrecallF1policy detection (TPV/FPV/FNV/TNV)anomaly MAD z-score

Datasets

SROIEcontrolled synthetic invoice suite

Benchmarks

SROIEsynthetic governance suite (introduced here)