A locally hosted LLM agent (RCAgent) that uses tools, snapshot keys, and trajectory-level self-consistency to improve cloud root-cause triag

October 25, 20238 min

Overview

Decision SnapshotReady For Pilot

The paper provides offline and online results plus ablations and trajectory stats; methods are practical but tied to an internal dataset and a specific production stack.

Citations1

Evidence Strength0.80

Confidence0.82

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: No

At A Glance

Cost impact: 60%

Production readiness: 70%

Novelty: 70%

Authors

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, Qingsong Wen

Links

Abstract / PDF

Why It Matters For Business

RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.

Who Should Care

Summary TLDR

RCAgent is a tool-augmented LLM agent built on an internally deployed Vicuna-13B model. It adds three practical components for cloud Root Cause Analysis (RCA): OBSK snapshot keys to manage long context, LLM-based expert tools for code/log analysis, and trajectory-level Self-Consistency (TSC) to aggregate multiple action traces. On an Alibaba Flink dataset and live out-of-domain jobs, RCAgent beats a ReAct baseline across metrics, yields far fewer invalid actions, and is deployed as a feedback step for human SREs.

Problem Statement

Cloud RCA needs flexible decision-making over noisy, long logs and private production data. Existing LLM approaches either fine-tune large external models (privacy risk) or act only as analyzers, not autonomous agents. Challenges include data privacy (can't call external APIs), managing very long context, and preventing invalid tool actions from weaker local LLMs.

Main Contribution

RCAgent: a production-focused tool-augmented LLM agent framework using a locally hosted Vicuna-13B model for privacy.

OBSK (Observation Snapshot Key): a head+snapshot key and key-value store to keep prompts short while allowing full-observation retrieval.

Key Findings

RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.

NumbersMETEOR: RCAgent 15.15 vs ReAct 6.44 (+8.71)

Practical UseSwitching to RCAgent-style agents can roughly double or triple semantic match scores for root-cause explanations on evaluated jobs; expect clearer automated diagnoses to hand to SREs.

Evidence RefTable 1 (root cause METEOR)

RCAgent reduces invalid or malformed agent actions and increases successful runs.

NumbersPass Rate 99.38% vs ReAct 86.33%; Invalid Rate 7.93% vs 22.82%

Practical UseUse OBSK, JsonRegen, and expert agents to cut agent failures and noisy tool calls, saving time and preventing wasted compute.

Evidence RefTable 4 (trajectory stats)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Root cause quality (METEOR)15.15 (RCAgent)6.44 (ReAct)+8.71Offline Flink labeled set (161 jobs)Table 1 root cause METEORTable 1
Solution quality (METEOR)16.45 (RCAgent w/ TSC LLM)12.94 (ReAct RCAgent w/o experts?)+3.51 vs RCAgent (no TSC)Offline Flink labeled setTable 2 solution METEOR; TSC gainsTable 2

What To Try In 7 Days

Prototype OBSK: store large observations by snapshot key and pass only heads to the LLM prompt.

Replace raw query tools with semantically-minimal wrappers (ID-only inputs) to reduce invalid calls.

Add JsonRegen-like repair for structured outputs before tool execution to cut malformed interactions.

Agent Features

Memory
Key-value store for OBSK snapshots (long observation retrieval)
Planning
Thought-action-observation loopTrajectory-level Self-Consistency (mid-way sampling)
Tool Use
Tool-augmented LLMsFunction-like information-gathering tools (ID-only inputs)LLM expert agents for code/log analysis
Frameworks
ReAct-style prompting (modified)Self-Consistency (SC) and Trajectory-level SC (TSC)
Is Agentic

Yes

Architectures
Controller agent (thought-action-observation loop)Expert agents (LLM-based analytical tools)
Collaboration
Controller invokes expert agents and retrieves observations

Optimization Features

Token Efficiency
OBSK to reduce tokens in controller prompt
System Optimization
Semantically-minimal tools to avoid heavy tool invocation
Inference Optimization
vLLM backend for efficient servingGreedy decoding by default for stability

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusNo
LicenseUnknown

Risks & Boundaries

Limitations

Evaluations focus on Apache Flink jobs and internal datasets; cross-system generalization untested.

Relies on a locally hosted 13B model — performance will vary with different LLMs and infra.

When Not To Use

When you can safely use stronger external LLM APIs and prefer simpler fine-tuning pipelines.

When latency requirements forbid multi-step tool invocation and sampling.

Failure Modes

Hallucinated evidence if log expert fuzzy matching is bypassed.

High invalid action rate with stochastic decoding (nucleus sampling) or poor tool wrappers.

Core Entities

Models

Vicuna-13B-V1.5-16KGTE-LARGE (embedding)gpt-4-0613 (frozen, used as judge)

Metrics

METEORBLEURTBARTScoreEmbScoreNUBIAG-CorrectnessG-HelpfulnessH-HelpfulnessPass RateInvalid Rate

Datasets

Collected 15,616 anomalous jobs (1 month); filtered ~5,000; labeled offline set 161 jobs (class-bala

Context Entities

Models

Vicuna used for inference; random sampling during SC uses Vicuna defaults

Datasets

Flink Advisor history subset (for retrieval examples, no overlap with labels)Platform/runtime/infrastructure logs in Alibaba SLS