A locally hosted LLM agent (RCAgent) that uses tools, snapshot keys, and trajectory-level self-consistency to improve cloud root-cause triag

Overview

Decision SnapshotReady For Pilot

The paper provides offline and online results plus ablations and trajectory stats; methods are practical but tied to an internal dataset and a specific production stack.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, Qingsong Wen

Links

Abstract / PDF

Why It Matters For Business

RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.

Who Should Care

ML Engineer Engineering Lead Data Scientist CTO Product Manager

Summary TLDR

RCAgent is a tool-augmented LLM agent built on an internally deployed Vicuna-13B model. It adds three practical components for cloud Root Cause Analysis (RCA): OBSK snapshot keys to manage long context, LLM-based expert tools for code/log analysis, and trajectory-level Self-Consistency (TSC) to aggregate multiple action traces. On an Alibaba Flink dataset and live out-of-domain jobs, RCAgent beats a ReAct baseline across metrics, yields far fewer invalid actions, and is deployed as a feedback step for human SREs.

Problem Statement

Cloud RCA needs flexible decision-making over noisy, long logs and private production data. Existing LLM approaches either fine-tune large external models (privacy risk) or act only as analyzers, not autonomous agents. Challenges include data privacy (can't call external APIs), managing very long context, and preventing invalid tool actions from weaker local LLMs.

Main Contribution

RCAgent: a production-focused tool-augmented LLM agent framework using a locally hosted Vicuna-13B model for privacy.

OBSK (Observation Snapshot Key): a head+snapshot key and key-value store to keep prompts short while allowing full-observation retrieval.

Key Findings

RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.

NumbersMETEOR: RCAgent 15.15 vs ReAct 6.44 (+8.71)

Practical UseSwitching to RCAgent-style agents can roughly double or triple semantic match scores for root-cause explanations on evaluated jobs; expect clearer automated diagnoses to hand to SREs.

Evidence RefTable 1 (root cause METEOR)

RCAgent reduces invalid or malformed agent actions and increases successful runs.

NumbersPass Rate 99.38% vs ReAct 86.33%; Invalid Rate 7.93% vs 22.82%

Practical UseUse OBSK, JsonRegen, and expert agents to cut agent failures and noisy tool calls, saving time and preventing wasted compute.

Evidence RefTable 4 (trajectory stats)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Root cause quality (METEOR)	15.15 (RCAgent)	6.44 (ReAct)	+8.71	Offline Flink labeled set (161 jobs)	Table 1 root cause METEOR	Table 1
Solution quality (METEOR)	16.45 (RCAgent w/ TSC LLM)	12.94 (ReAct RCAgent w/o experts?)	+3.51 vs RCAgent (no TSC)	Offline Flink labeled set	Table 2 solution METEOR; TSC gains	Table 2

What To Try In 7 Days

Prototype OBSK: store large observations by snapshot key and pass only heads to the LLM prompt.

Replace raw query tools with semantically-minimal wrappers (ID-only inputs) to reduce invalid calls.

Add JsonRegen-like repair for structured outputs before tool execution to cut malformed interactions.

Agent Features

Memory

Key-value store for OBSK snapshots (long observation retrieval)

Planning

Thought-action-observation loopTrajectory-level Self-Consistency (mid-way sampling)

Tool Use

Tool-augmented LLMsFunction-like information-gathering tools (ID-only inputs)LLM expert agents for code/log analysis

Frameworks

ReAct-style prompting (modified)Self-Consistency (SC) and Trajectory-level SC (TSC)

Is Agentic

Yes

Architectures

Controller agent (thought-action-observation loop)Expert agents (LLM-based analytical tools)

Collaboration

Controller invokes expert agents and retrieves observations

Optimization Features

Token Efficiency

OBSK to reduce tokens in controller prompt

System Optimization

Semantically-minimal tools to avoid heavy tool invocation

Inference Optimization

vLLM backend for efficient servingGreedy decoding by default for stability

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusNo

LicenseUnknown

Risks & Boundaries

Limitations

Evaluations focus on Apache Flink jobs and internal datasets; cross-system generalization untested.

Relies on a locally hosted 13B model — performance will vary with different LLMs and infra.

When Not To Use

When you can safely use stronger external LLM APIs and prefer simpler fine-tuning pipelines.

When latency requirements forbid multi-step tool invocation and sampling.

Failure Modes

Hallucinated evidence if log expert fuzzy matching is bypassed.

High invalid action rate with stochastic decoding (nucleus sampling) or poor tool wrappers.

Core Entities

Models

Vicuna-13B-V1.5-16KGTE-LARGE (embedding)gpt-4-0613 (frozen, used as judge)

Metrics

METEORBLEURTBARTScoreEmbScoreNUBIAG-CorrectnessG-HelpfulnessH-HelpfulnessPass RateInvalid Rate

Datasets

Collected 15,616 anomalous jobs (1 month); filtered ~5,000; labeled offline set 161 jobs (class-bala

Context Entities

Models

Vicuna used for inference; random sampling during SC uses Vicuna defaults

Datasets

Flink Advisor history subset (for retrieval examples, no overlap with labels)Platform/runtime/infrastructure logs in Alibaba SLS

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.

RCAgent reduces invalid or malformed agent actions and increases successful runs.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Context Entities

Models

Datasets

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding