Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
POLARIS adds auditable, policy-gated automation to invoice-like workflows so companies can reduce manual checks while retaining legal and auditability guarantees.
Summary TLDR
POLARIS is a modular orchestration system that turns back-office automation into typed plan synthesis plus guarded execution. A planner generates several type-checked DAGs, a rubric-based ReasoningAgent picks a policy-compliant plan, and execution runs under validator-gated checks, a bounded repair loop, and compiled policy guardrails. On document tasks it produces auditable JSON decisions and traces. Reported results: SROIE F1 ≈ 0.81, synthetic extraction F1 ≈ 0.9565, policy violation F1 ≈ 0.8182, anomaly detection precision 1.0 (recall 0.90).
Problem Statement
Enterprise back-office automation needs auditable, policy-aligned, and predictable agent behavior. Existing multi-agent LLM stacks often use untyped I/O, best-of-N sampling, and open-ended retries, which break audit trails, policy gating, and operational guarantees. POLARIS re-casts orchestration as typed planning plus governed execution to meet enterprise requirements.
Main Contribution
POLARIS framework: typed-DAG plan synthesis, rubric selection, validator-gated execution and policy guardrails
CoAPlanner: enforces type-safe, diversity-constrained generation of candidate workflows
ReasoningAgent: rubric-based, policy-aware selector that emits auditable JSON decisions
Validator-gated bounded repair loop: targeted re-parsing for failing fields before side effects
Policy layer: compiled guardrails (thresholds, currency, segregation-of-duties) that block or route actions
Evaluation: synthetic invoice suite stressing governance and SROIE extraction; provides reusable benchmark primitives
Key Findings
High extraction on synthetic invoices
Strong SROIE extraction baseline
Policy violation detection is effective in many cases
Anomaly routing is very precise
Results
Synthetic extraction (overall)
SROIE 4-field extraction
Policy violation detection (when positives exist)
Anomaly detection (MAD with kmad=3.5)
Who Should Care
What To Try In 7 Days
Define typed I/O contracts for one invoice workflow and instrument a JSON normalizer
Run a planner that emits 3–5 type-checked DAG proposals and log them for review
Implement a validator-gated repair loop limited to 2–3 iterations for parser outputs to cut false negatives
Agent Features
Memory
- few-shot exemplar bank (DSPy) for planning priors
Planning
- typed plan synthesis (CoAPlanner)
- diversity-constrained generation
- few-shot planning priors via DSPy
Tool Use
- APIAccess agent for ERP/Bank integrations
- PolicyRetrieval and RiskControl guardrails
Frameworks
- DSPy
- Autogen ConversableAgent pattern
Is Agentic
true
Architectures
- multi-agent LLM pipeline
- typed DAG orchestration
Collaboration
- plan-select-act orchestration across specialized agents
Optimization Features
System Optimization
- dependency-aware scheduler enabling parallel middle-stage checks
- completion-based semantics to maximize safe parallelism
Inference Optimization
- bounded repair loop (limits repeated parse calls)
- K=5 plan budget to bound planning cost
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluations use a small synthetic suite plus SROIE; cross-domain validation is limited
- Design relies on proprietary models (GPT-4o, GPT-5) in examples
- Policy and vendor DB quality strongly affect gating and routing accuracy
- Bounded repair loop may still escalate to humans when budget exhausted
When Not To Use
- Very low-latency real-time systems where LLM latency is unacceptable
- Environments without a maintained policy store or vendor baselines
- Tiny workflows where the governance overhead outweighs benefits
- Settings with strict model-open-source requirements and no access to high-capacity LLMs
Failure Modes
- Parser failures under extreme layout drift can still produce false positives even after repairs
- Incorrect or missing policy records can cause wrong blocking or missed violations
- Insufficient or biased DSPy exemplars may lead the planner to omit valid plans
- Dynamic edge injection could create unexpected dependency chains without careful testing
Core Entities
Models
- GPT-4o (CoAPlanner)
- GPT-5 reasoning model (ReasoningAgent)
Metrics
- precision
- recall
- F1
- policy detection (TPV/FPV/FNV/TNV)
- anomaly MAD z-score
Datasets
- SROIE
- controlled synthetic invoice suite
Benchmarks
- SROIE
- synthetic governance suite (introduced here)

