TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

March 10, 20267 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Tavishi Sharma, Vinayak Sharma, Pragya Sharma

Links

Abstract / PDF

Why It Matters For Business

Insert fast, actionable checks before agents act to avoid costly or dangerous mistakes. TrustBench-style verification cuts harmful actions substantially while staying fast enough for real use.

Summary TLDR

TrustBench is a modular system that intercepts agent actions just before execution, combines calibrated model confidence with fast runtime checks and domain plugins, and substantially reduces harmful agent actions (reported ~87% reduction) while keeping verification latency under 200 ms.

Problem Statement

Current trust evaluations are mostly post-hoc and cannot stop harmful agent actions. Agents need a fast, plug-inable verification step between action formulation and execution to prevent harm in high-risk domains.

Main Contribution

A dual-mode framework: (1) Benchmarking Mode learns mappings between agent confidence and actual correctness using LLM-as-a-Judge; (2) Verification Mode applies calibrated priors plus fast runtime checks to gate actions.

A domain plugin architecture that encodes domain-specific evidence policies (e.g., PubMed/WHO checks for healthcare, regulatory checks for finance) and lets domains override thresholds and weights.

An implementation and empirical evaluation across healthcare (MedQA), finance (FinQA), and factual reasoning (TruthfulQA) showing large harm reduction with sub-200 ms verification latency.

Key Findings

TrustBench-equipped agents produced far fewer harmful actions in evaluation.

Numbersharmful actions reduced by 87% (aggregate)

Domain-specific plugins are more effective than generic verification.

Numbersdomain plugins gave 35% greater harm reduction than generic verification

Runtime verification meets interactive latency requirements.

Numbersmedian end-to-end verification latency <200 ms

Calibrated confidence alone provides limited mitigation.

NumbersConfidence-Only yields only marginal decrease in harmful actions; full system reduces harmful actions to ~10–13% of base

Applying a plugin out-of-domain weakens verification.

Numbersout-of-domain use increases harm rates by ~25–35%

Results

Harmful action reduction (aggregate)

Value87% reduction

Baselineunconstrained execution

Harmful actions after TrustBench

Value~10–13% of baseline

Baseline100% of identified harmful actions under unconstrained execution

Domain-plugin benefit

Value35% additional harm reduction vs generic verification

Baselinegeneric verification

Verification latency

Valuemedian <200 ms

Baselineinteractive application requirement

Calibration weighting used

Valueconfidence prior : runtime checks = 0.3 : 0.7

Baselineempirical choice in evaluation

Who Should Care

What To Try In 7 Days

Log agent self-reported confidence and outcomes on a small domain dataset to measure miscalibration.

Prototype an isotonic calibration mapping from confidence to observed correctness using an LLM-as-a-Judge.

Implement 2–3 cheap runtime checks (citation presence, timestamp recency, simple policy blacklist) and measure latency and false-block rates.

Agent Features

Memory

  • stores calibration profiles indexed by agent and domain (short-term operational state)

Planning

  • action interception before execution
  • trust-based action gating

Tool Use

  • invoked as pre-execution verification toolkit
  • integrates via model APIs and Ollama

Frameworks

  • LLM-as-a-Judge
  • isotonic regression calibration

Is Agentic

true

Architectures

  • dual-mode (benchmarking + runtime verification)
  • plugin-based domain extension

Collaboration

  • human-in-the-loop gating for low-trust actions

Optimization Features

Infra Optimization

  • plug-and-play model integration via Ollama and APIs

System Optimization

  • sub-200 ms median end-to-end verification pipeline
  • modular components for independent instantiation

Inference Optimization

  • select subset of runtime metrics to meet latency bounds
  • empirical 0.3:0.7 weighting to prioritize fast runtime signals

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Relies on LLM-as-a-Judge (LAJ) quality; judge bias or error can affect calibration.
  • Domain plugins require domain-specific design and maintenance.
  • Out-of-domain plugin use raises harm rates by ~25–35%, so cross-domain generalization is limited.
  • No public code or release details provided in the text to confirm reproducibility.

When Not To Use

  • When you cannot obtain reliable calibration data linking confidence to outcomes.
  • For ultra-low-latency paths where even ~200 ms is unacceptable.
  • If you cannot implement or maintain domain-specific evidence policies (plugins).

Failure Modes

  • False blocking of safe actions due to overstrict plugin rules.
  • Missed harmful actions if calibration or LAJ judgments are systematically biased.
  • Increased harm when applying misaligned domain plugins out-of-domain.
  • Overreliance on calibrated confidence when runtime checks are incomplete.

Core Entities

Models

  • Llama3.2:8B
  • Llama3:8B
  • GPT-OSS:20B

Metrics

  • harmful action rate (%)
  • LAJ correctness (LLM-as-a-Judge)
  • confidence calibration curves
  • verification latency (ms)

Datasets

  • MedQA
  • FinQA
  • TruthfulQA

Benchmarks

  • AgentBench
  • TrustLLM
  • HELM
  • SafeAgentBench