POLARIS: typed, policy-aware plan synthesis and guarded execution for auditable back-office automation

January 16, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

0

Authors

Zahra Moslemi, Keerthi Koneru, Yen-Ting Lee, Sheethal Kumar, Ramesh Radhakrishnan

Links

Abstract / PDF

Why It Matters For Business

POLARIS adds auditable, policy-gated automation to invoice-like workflows so companies can reduce manual checks while retaining legal and auditability guarantees.

Summary TLDR

POLARIS is a modular orchestration system that turns back-office automation into typed plan synthesis plus guarded execution. A planner generates several type-checked DAGs, a rubric-based ReasoningAgent picks a policy-compliant plan, and execution runs under validator-gated checks, a bounded repair loop, and compiled policy guardrails. On document tasks it produces auditable JSON decisions and traces. Reported results: SROIE F1 ≈ 0.81, synthetic extraction F1 ≈ 0.9565, policy violation F1 ≈ 0.8182, anomaly detection precision 1.0 (recall 0.90).

Problem Statement

Enterprise back-office automation needs auditable, policy-aligned, and predictable agent behavior. Existing multi-agent LLM stacks often use untyped I/O, best-of-N sampling, and open-ended retries, which break audit trails, policy gating, and operational guarantees. POLARIS re-casts orchestration as typed planning plus governed execution to meet enterprise requirements.

Main Contribution

POLARIS framework: typed-DAG plan synthesis, rubric selection, validator-gated execution and policy guardrails

CoAPlanner: enforces type-safe, diversity-constrained generation of candidate workflows

ReasoningAgent: rubric-based, policy-aware selector that emits auditable JSON decisions

Validator-gated bounded repair loop: targeted re-parsing for failing fields before side effects

Policy layer: compiled guardrails (thresholds, currency, segregation-of-duties) that block or route actions

Evaluation: synthetic invoice suite stressing governance and SROIE extraction; provides reusable benchmark primitives

Key Findings

High extraction on synthetic invoices

NumbersF1 = 0.9565 on synthetic suite (160 fields)

Strong SROIE extraction baseline

NumbersF1 = 0.8116 on SROIE (4-field extraction)

Policy violation detection is effective in many cases

Numbersprecision/recall/F1 = 0.8182 each on violations present

Anomaly routing is very precise

Numbersprecision = 1.00, recall = 0.90, F1 = 0.9474 (TOTAL)

Results

Synthetic extraction (overall)

Valueprecision=0.9453, recall=0.9680, f1=0.9565

SROIE 4-field extraction

Valueprecision=0.8189, recall=0.8045, f1=0.8116

Policy violation detection (when positives exist)

Valueprecision=0.8182, recall=0.8182, f1=0.8182

Anomaly detection (MAD with kmad=3.5)

Valueprecision=1.0000, recall=0.9000, f1=0.9474

Who Should Care

What To Try In 7 Days

Define typed I/O contracts for one invoice workflow and instrument a JSON normalizer

Run a planner that emits 3–5 type-checked DAG proposals and log them for review

Implement a validator-gated repair loop limited to 2–3 iterations for parser outputs to cut false negatives

Agent Features

Memory

  • few-shot exemplar bank (DSPy) for planning priors

Planning

  • typed plan synthesis (CoAPlanner)
  • diversity-constrained generation
  • few-shot planning priors via DSPy

Tool Use

  • APIAccess agent for ERP/Bank integrations
  • PolicyRetrieval and RiskControl guardrails

Frameworks

  • DSPy
  • Autogen ConversableAgent pattern

Is Agentic

true

Architectures

  • multi-agent LLM pipeline
  • typed DAG orchestration

Collaboration

  • plan-select-act orchestration across specialized agents

Optimization Features

System Optimization

  • dependency-aware scheduler enabling parallel middle-stage checks
  • completion-based semantics to maximize safe parallelism

Inference Optimization

  • bounded repair loop (limits repeated parse calls)
  • K=5 plan budget to bound planning cost

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluations use a small synthetic suite plus SROIE; cross-domain validation is limited
  • Design relies on proprietary models (GPT-4o, GPT-5) in examples
  • Policy and vendor DB quality strongly affect gating and routing accuracy
  • Bounded repair loop may still escalate to humans when budget exhausted

When Not To Use

  • Very low-latency real-time systems where LLM latency is unacceptable
  • Environments without a maintained policy store or vendor baselines
  • Tiny workflows where the governance overhead outweighs benefits
  • Settings with strict model-open-source requirements and no access to high-capacity LLMs

Failure Modes

  • Parser failures under extreme layout drift can still produce false positives even after repairs
  • Incorrect or missing policy records can cause wrong blocking or missed violations
  • Insufficient or biased DSPy exemplars may lead the planner to omit valid plans
  • Dynamic edge injection could create unexpected dependency chains without careful testing

Core Entities

Models

  • GPT-4o (CoAPlanner)
  • GPT-5 reasoning model (ReasoningAgent)

Metrics

  • precision
  • recall
  • F1
  • policy detection (TPV/FPV/FNV/TNV)
  • anomaly MAD z-score

Datasets

  • SROIE
  • controlled synthetic invoice suite

Benchmarks

  • SROIE
  • synthetic governance suite (introduced here)