Overview
The paper provides offline and online results plus ablations and trajectory stats; methods are practical but tied to an internal dataset and a specific production stack.
Citations1
Evidence Strength0.80
Confidence0.82
Risk Signals8
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 6/6
Reproducibility
Status: No open assets linked
Open source: No
At A Glance
Cost impact: 60%
Production readiness: 70%
Novelty: 70%
Why It Matters For Business
RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.
Who Should Care
Summary TLDR
RCAgent is a tool-augmented LLM agent built on an internally deployed Vicuna-13B model. It adds three practical components for cloud Root Cause Analysis (RCA): OBSK snapshot keys to manage long context, LLM-based expert tools for code/log analysis, and trajectory-level Self-Consistency (TSC) to aggregate multiple action traces. On an Alibaba Flink dataset and live out-of-domain jobs, RCAgent beats a ReAct baseline across metrics, yields far fewer invalid actions, and is deployed as a feedback step for human SREs.
Problem Statement
Cloud RCA needs flexible decision-making over noisy, long logs and private production data. Existing LLM approaches either fine-tune large external models (privacy risk) or act only as analyzers, not autonomous agents. Challenges include data privacy (can't call external APIs), managing very long context, and preventing invalid tool actions from weaker local LLMs.
Main Contribution
RCAgent: a production-focused tool-augmented LLM agent framework using a locally hosted Vicuna-13B model for privacy.
OBSK (Observation Snapshot Key): a head+snapshot key and key-value store to keep prompts short while allowing full-observation retrieval.
Key Findings
RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.
RCAgent reduces invalid or malformed agent actions and increases successful runs.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Root cause quality (METEOR) | 15.15 (RCAgent) | 6.44 (ReAct) | +8.71 | Offline Flink labeled set (161 jobs) | Table 1 root cause METEOR | Table 1 |
| Solution quality (METEOR) | 16.45 (RCAgent w/ TSC LLM) | 12.94 (ReAct RCAgent w/o experts?) | +3.51 vs RCAgent (no TSC) | Offline Flink labeled set | Table 2 solution METEOR; TSC gains | Table 2 |
What To Try In 7 Days
Prototype OBSK: store large observations by snapshot key and pass only heads to the LLM prompt.
Replace raw query tools with semantically-minimal wrappers (ID-only inputs) to reduce invalid calls.
Add JsonRegen-like repair for structured outputs before tool execution to cut malformed interactions.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Collaboration
Optimization Features
Token Efficiency
System Optimization
Inference Optimization
Reproducibility
Risks & Boundaries
Limitations
Evaluations focus on Apache Flink jobs and internal datasets; cross-system generalization untested.
Relies on a locally hosted 13B model — performance will vary with different LLMs and infra.
When Not To Use
When you can safely use stronger external LLM APIs and prefer simpler fine-tuning pipelines.
When latency requirements forbid multi-step tool invocation and sampling.
Failure Modes
Hallucinated evidence if log expert fuzzy matching is bypassed.
High invalid action rate with stochastic decoding (nucleus sampling) or poor tool wrappers.

