A locally hosted LLM agent (RCAgent) that uses tools, snapshot keys, and trajectory-level self-consistency to improve cloud root-cause triag

October 25, 20238 min

Overview

Production Readiness

0.7

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

1

Authors

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Jihong Wang, Fengbin Yin, Lunting Fan, Lingfei Wu, Qingsong Wen

Links

Abstract / PDF

Why It Matters For Business

RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.

Summary TLDR

RCAgent is a tool-augmented LLM agent built on an internally deployed Vicuna-13B model. It adds three practical components for cloud Root Cause Analysis (RCA): OBSK snapshot keys to manage long context, LLM-based expert tools for code/log analysis, and trajectory-level Self-Consistency (TSC) to aggregate multiple action traces. On an Alibaba Flink dataset and live out-of-domain jobs, RCAgent beats a ReAct baseline across metrics, yields far fewer invalid actions, and is deployed as a feedback step for human SREs.

Problem Statement

Cloud RCA needs flexible decision-making over noisy, long logs and private production data. Existing LLM approaches either fine-tune large external models (privacy risk) or act only as analyzers, not autonomous agents. Challenges include data privacy (can't call external APIs), managing very long context, and preventing invalid tool actions from weaker local LLMs.

Main Contribution

RCAgent: a production-focused tool-augmented LLM agent framework using a locally hosted Vicuna-13B model for privacy.

OBSK (Observation Snapshot Key): a head+snapshot key and key-value store to keep prompts short while allowing full-observation retrieval.

Expert agents: LLM-based code and log analysis tools that run recursive searches and RAG over clustered log chunks.

Stabilizations: JsonRegen for robust JSON outputs and error-handling rules to reduce invalid actions.

Trajectory-level Self-Consistency (TSC): a mid-way sampling+aggregation method to improve final answers without sampling whole trajectories.

Real deployment and evaluation on Alibaba Cloud Flink jobs, including online OoD evaluation and human ratings.

Key Findings

RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.

NumbersMETEOR: RCAgent 15.15 vs ReAct 6.44 (+8.71)

RCAgent reduces invalid or malformed agent actions and increases successful runs.

NumbersPass Rate 99.38% vs ReAct 86.33%; Invalid Rate 7.93% vs 22.82%

Trajectory-level Self-Consistency (TSC) further boosts solution quality and human usefulness.

NumbersSolution METEOR +3.51; G-Helpfulness +2.28% with TSC

Online deployment produced higher human helpfulness and responsibility precision than baselines.

NumbersH-Helpfulness RCAgent 2.47 vs ReAct 1.36; with TSC 2.92. Responsibility precision 82.06% with TSC

Results

Root cause quality (METEOR)

Value15.15 (RCAgent)

Baseline6.44 (ReAct)

Solution quality (METEOR)

Value16.45 (RCAgent w/ TSC LLM)

Baseline12.94 (ReAct RCAgent w/o experts?)

Evidence quality (METEOR)

Value28.10 (RCAgent)

Baseline11.82 (ReAct)

Trajectory stability (Pass Rate)

Value99.38%

Baseline86.33% (ReAct)

Invalid action rate

Value7.93% (RCAgent)

Baseline22.82% (ReAct)

Human helpfulness (H-Helpfulness)

Value2.47 ±0.17 (RCAgent online OoD)

Baseline1.36 ±0.03 (ReAct)

Who Should Care

What To Try In 7 Days

Prototype OBSK: store large observations by snapshot key and pass only heads to the LLM prompt.

Replace raw query tools with semantically-minimal wrappers (ID-only inputs) to reduce invalid calls.

Add JsonRegen-like repair for structured outputs before tool execution to cut malformed interactions.

Agent Features

Memory

  • Key-value store for OBSK snapshots (long observation retrieval)

Planning

  • Thought-action-observation loop
  • Trajectory-level Self-Consistency (mid-way sampling)

Tool Use

  • Tool-augmented LLMs
  • Function-like information-gathering tools (ID-only inputs)
  • LLM expert agents for code/log analysis

Frameworks

  • ReAct-style prompting (modified)
  • Self-Consistency (SC) and Trajectory-level SC (TSC)

Is Agentic

true

Architectures

  • Controller agent (thought-action-observation loop)
  • Expert agents (LLM-based analytical tools)

Collaboration

  • Controller invokes expert agents and retrieves observations

Optimization Features

Token Efficiency

  • OBSK to reduce tokens in controller prompt

System Optimization

  • Semantically-minimal tools to avoid heavy tool invocation

Inference Optimization

  • vLLM backend for efficient serving
  • Greedy decoding by default for stability

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Evaluations focus on Apache Flink jobs and internal datasets; cross-system generalization untested.
  • Relies on a locally hosted 13B model — performance will vary with different LLMs and infra.
  • Offline labeled set reduced to 161 jobs for balance; sample size for rare causes is limited.

When Not To Use

  • When you can safely use stronger external LLM APIs and prefer simpler fine-tuning pipelines.
  • When latency requirements forbid multi-step tool invocation and sampling.

Failure Modes

  • Hallucinated evidence if log expert fuzzy matching is bypassed.
  • High invalid action rate with stochastic decoding (nucleus sampling) or poor tool wrappers.
  • Malformed structured outputs if JsonRegen is disabled.

Core Entities

Models

  • Vicuna-13B-V1.5-16K
  • GTE-LARGE (embedding)
  • gpt-4-0613 (frozen, used as judge)

Metrics

  • METEOR
  • BLEURT
  • BARTScore
  • EmbScore
  • NUBIA
  • G-Correctness
  • G-Helpfulness
  • H-Helpfulness
  • Pass Rate
  • Invalid Rate

Datasets

  • Collected 15,616 anomalous jobs (1 month); filtered ~5,000; labeled offline set 161 jobs (class-bala

Context Entities

Models

  • Vicuna used for inference; random sampling during SC uses Vicuna defaults

Datasets

  • Flink Advisor history subset (for retrieval examples, no overlap with labels)
  • Platform/runtime/infrastructure logs in Alibaba SLS