POLARIS: typed, policy-aware plan synthesis and guarded execution for auditable back-office automation

Overview

Decision SnapshotNeeds Validation

POLARIS is a practical blueprint demonstrating governance primitives for enterprise automation, validated on a small synthetic suite and SROIE; production adoption needs policy DB quality and LLM access.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/4

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan

Links

Abstract / PDF

Why It Matters For Business

POLARIS adds auditable, policy-gated automation to invoice-like workflows so companies can reduce manual checks while retaining legal and auditability guarantees.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead

Summary TLDR

POLARIS is a modular orchestration system that turns back-office automation into typed plan synthesis plus guarded execution. A planner generates several type-checked DAGs, a rubric-based ReasoningAgent picks a policy-compliant plan, and execution runs under validator-gated checks, a bounded repair loop, and compiled policy guardrails. On document tasks it produces auditable JSON decisions and traces. Reported results: SROIE F1 ≈ 0.81, synthetic extraction F1 ≈ 0.9565, policy violation F1 ≈ 0.8182, anomaly detection precision 1.0 (recall 0.90).

Problem Statement

Enterprise back-office automation needs auditable, policy-aligned, and predictable agent behavior. Existing multi-agent LLM stacks often use untyped I/O, best-of-N sampling, and open-ended retries, which break audit trails, policy gating, and operational guarantees. POLARIS re-casts orchestration as typed planning plus governed execution to meet enterprise requirements.

Main Contribution

POLARIS framework: typed-DAG plan synthesis, rubric selection, validator-gated execution and policy guardrails

CoAPlanner: enforces type-safe, diversity-constrained generation of candidate workflows

Key Findings

High extraction on synthetic invoices

NumbersF1 = 0.9565 on synthetic suite (160 fields)

Practical UseTyped plans + validator repair deliver near-decision-grade extraction under controlled scenarios; try the repair loop to boost coverage on noisy scans

Evidence RefTable 1 (TOTAL row)

Strong SROIE extraction baseline

NumbersF1 = 0.8116 on SROIE (4-field extraction)

Practical UseThe framework matches solid 4-field extraction performance while adding audit traces and policy checks; useful when traceability matters

Evidence RefTable 5

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Synthetic extraction (overall)	precision=0.9453, recall=0.9680, f1=0.9565	—	—	synthetic invoice suite (160 fields)	Table 1 totals	Table 1
SROIE 4-field extraction	precision=0.8189, recall=0.8045, f1=0.8116	—	—	SROIE	Table 5 overall	Table 5

What To Try In 7 Days

Define typed I/O contracts for one invoice workflow and instrument a JSON normalizer

Run a planner that emits 3–5 type-checked DAG proposals and log them for review

Implement a validator-gated repair loop limited to 2–3 iterations for parser outputs to cut false negatives

Agent Features

Memory

few-shot exemplar bank (DSPy) for planning priors

Planning

typed plan synthesis (CoAPlanner)diversity-constrained generationfew-shot planning priors via DSPy

Tool Use

APIAccess agent for ERP/Bank integrationsPolicyRetrieval and RiskControl guardrails

Frameworks

DSPyAutogen ConversableAgent pattern

Is Agentic

Yes

Architectures

multi-agent LLM pipelinetyped DAG orchestration

Collaboration

plan-select-act orchestration across specialized agents

Optimization Features

System Optimization

dependency-aware scheduler enabling parallel middle-stage checkscompletion-based semantics to maximize safe parallelism

Inference Optimization

bounded repair loop (limits repeated parse calls)K=5 plan budget to bound planning cost

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations use a small synthetic suite plus SROIE; cross-domain validation is limited

Design relies on proprietary models (GPT-4o, GPT-5) in examples

When Not To Use

Very low-latency real-time systems where LLM latency is unacceptable

Environments without a maintained policy store or vendor baselines

Failure Modes

Parser failures under extreme layout drift can still produce false positives even after repairs

Incorrect or missing policy records can cause wrong blocking or missed violations

Core Entities

Models

GPT-4o (CoAPlanner)GPT-5 reasoning model (ReasoningAgent)

Metrics

precisionrecallF1policy detection (TPV/FPV/FNV/TNV)anomaly MAD z-score

Datasets

SROIEcontrolled synthetic invoice suite

Benchmarks

SROIEsynthetic governance suite (introduced here)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High extraction on synthetic invoices

Strong SROIE extraction baseline

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Reference architecture, multi-agent taxonomy, and enterprise hardening for LLM agents

Key finding

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

A closed-loop Sensing→Regulating→Correcting system that routes LLM execution by uncertainty to cut errors and API cost

Key finding

Diffusion-backed agents match accuracy but run ~30% faster and can reach up to 8× speedups in some cases

Key finding