Overview
Production Readiness
0.7
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
1
Why It Matters For Business
RCAgent provides stronger automated RCA for private cloud data while reducing failed agent actions and surfacing platform-level issues to SREs faster.
Summary TLDR
RCAgent is a tool-augmented LLM agent built on an internally deployed Vicuna-13B model. It adds three practical components for cloud Root Cause Analysis (RCA): OBSK snapshot keys to manage long context, LLM-based expert tools for code/log analysis, and trajectory-level Self-Consistency (TSC) to aggregate multiple action traces. On an Alibaba Flink dataset and live out-of-domain jobs, RCAgent beats a ReAct baseline across metrics, yields far fewer invalid actions, and is deployed as a feedback step for human SREs.
Problem Statement
Cloud RCA needs flexible decision-making over noisy, long logs and private production data. Existing LLM approaches either fine-tune large external models (privacy risk) or act only as analyzers, not autonomous agents. Challenges include data privacy (can't call external APIs), managing very long context, and preventing invalid tool actions from weaker local LLMs.
Main Contribution
RCAgent: a production-focused tool-augmented LLM agent framework using a locally hosted Vicuna-13B model for privacy.
OBSK (Observation Snapshot Key): a head+snapshot key and key-value store to keep prompts short while allowing full-observation retrieval.
Expert agents: LLM-based code and log analysis tools that run recursive searches and RAG over clustered log chunks.
Stabilizations: JsonRegen for robust JSON outputs and error-handling rules to reduce invalid actions.
Trajectory-level Self-Consistency (TSC): a mid-way sampling+aggregation method to improve final answers without sampling whole trajectories.
Real deployment and evaluation on Alibaba Cloud Flink jobs, including online OoD evaluation and human ratings.
Key Findings
RCAgent substantially improves root-cause text quality over ReAct on the Flink offline set.
RCAgent reduces invalid or malformed agent actions and increases successful runs.
Trajectory-level Self-Consistency (TSC) further boosts solution quality and human usefulness.
Online deployment produced higher human helpfulness and responsibility precision than baselines.
Results
Root cause quality (METEOR)
Solution quality (METEOR)
Evidence quality (METEOR)
Trajectory stability (Pass Rate)
Invalid action rate
Human helpfulness (H-Helpfulness)
Who Should Care
What To Try In 7 Days
Prototype OBSK: store large observations by snapshot key and pass only heads to the LLM prompt.
Replace raw query tools with semantically-minimal wrappers (ID-only inputs) to reduce invalid calls.
Add JsonRegen-like repair for structured outputs before tool execution to cut malformed interactions.
Agent Features
Memory
- Key-value store for OBSK snapshots (long observation retrieval)
Planning
- Thought-action-observation loop
- Trajectory-level Self-Consistency (mid-way sampling)
Tool Use
- Tool-augmented LLMs
- Function-like information-gathering tools (ID-only inputs)
- LLM expert agents for code/log analysis
Frameworks
- ReAct-style prompting (modified)
- Self-Consistency (SC) and Trajectory-level SC (TSC)
Is Agentic
true
Architectures
- Controller agent (thought-action-observation loop)
- Expert agents (LLM-based analytical tools)
Collaboration
- Controller invokes expert agents and retrieves observations
Optimization Features
Token Efficiency
- OBSK to reduce tokens in controller prompt
System Optimization
- Semantically-minimal tools to avoid heavy tool invocation
Inference Optimization
- vLLM backend for efficient serving
- Greedy decoding by default for stability
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Evaluations focus on Apache Flink jobs and internal datasets; cross-system generalization untested.
- Relies on a locally hosted 13B model — performance will vary with different LLMs and infra.
- Offline labeled set reduced to 161 jobs for balance; sample size for rare causes is limited.
When Not To Use
- When you can safely use stronger external LLM APIs and prefer simpler fine-tuning pipelines.
- When latency requirements forbid multi-step tool invocation and sampling.
Failure Modes
- Hallucinated evidence if log expert fuzzy matching is bypassed.
- High invalid action rate with stochastic decoding (nucleus sampling) or poor tool wrappers.
- Malformed structured outputs if JsonRegen is disabled.
Core Entities
Models
- Vicuna-13B-V1.5-16K
- GTE-LARGE (embedding)
- gpt-4-0613 (frozen, used as judge)
Metrics
- METEOR
- BLEURT
- BARTScore
- EmbScore
- NUBIA
- G-Correctness
- G-Helpfulness
- H-Helpfulness
- Pass Rate
- Invalid Rate
Datasets
- Collected 15,616 anomalous jobs (1 month); filtered ~5,000; labeled offline set 161 jobs (class-bala
Context Entities
Models
- Vicuna used for inference; random sampling during SC uses Vicuna defaults
Datasets
- Flink Advisor history subset (for retrieval examples, no overlap with labels)
- Platform/runtime/infrastructure logs in Alibaba SLS

