Hierarchical Agentic RAG: small LMs + prompt pools to boost forecasting, anomaly detection, and imputation

August 18, 20248 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

4

Authors

Chidaksh Ravuru, Sagar Srinivas Sakhinana, Venkataramana Runkana

Links

Abstract / PDF

Why It Matters For Business

A modular Agentic-RAG can reduce forecasting errors and improve anomaly detection on operational time-series (traffic, industrial telemetry), enabling better planning and faster incident detection while allowing independent updates to sub-modules.

Summary TLDR

The paper introduces an agentic Retrieval-Augmented Generation (Agentic-RAG) system for time-series tasks. A master agent routes queries to task-specialized sub-agents; each sub-agent is a small pre-trained language model (Gemma or Llama variants) fine-tuned with instruction tuning and Direct Preference Optimization (DPO). Sub-agents retrieve relevant key-value ‘‘prompt pools’’ (historical pattern snippets) via cosine similarity and concatenate retrieved prompts with input before projection. Experiments on traffic and industrial benchmarks (PeMSD*, METR-LA, PEMS-BAY, SWaT, WADI, SMAP, MSL, TEP, HAI, ETT) show consistent gains: for example PEMS-BAY horizon@3 RMSE drops to 1.62 (Agentic-RAG w/

Problem Statement

Time-series models struggle with high dimensionality, non-stationarity and the fixed-window assumption. Small pretrained language models can be cheaply adapted but lack time-series knowledge. Existing methods either use task-specific architectures or fixed-length history windows that fail under distribution shifts.

Main Contribution

Agentic-RAG: hierarchical master + specialized sub-agents that route tasks (forecasting, anomaly detection, imputation, classification).

Differentiable dynamic prompt pools: key-value prompt repositories that store distilled historical patterns and are retrieved by similarity.

Practical fine-tuning recipe: instruction tuning + PEFT (QLoRA) + Direct Preference Optimization (DPO) on small LMs (Gemma, Llama) to adapt them to time-series tasks.

Extensive empirical evaluation showing consistent improvements over standard baselines across forecasting, anomaly detection, imputation, and classification benchmarks.

Key Findings

Agentic-RAG reduces forecasting error on traffic benchmarks.

NumbersPEMS-BAY Horizon@3 RMSE 1.62 vs DGCRN 2.69 (Table 4)

Agentic-RAG improves anomaly detection F1 across industrial benchmarks.

NumbersSWaT F1 92.59% vs GRELEN 89.10% (Table 5)

Imputation degrades smoothly as missing rate increases.

NumbersPeMSD3 RMSE rises from 19.48 (0%) to 24.14 (50% point missing) (Table 7)

Each proposed component contributes to gains; removing instruction tuning hurts most.

NumbersAblation w/o instruction-tuning MAE 21.62 vs full MAE 13.01 on PeMSD3 (Table 14)

Results

Forecasting RMSE (PEMS-BAY horizon@3)

Value1.62 (Agentic-RAG w/Llama-8B)

BaselineDGCRN 2.69

Anomaly detection F1 (SWaT)

Value92.59%

BaselineGRELEN 89.10%

Imputation RMSE (PeMSD3)

Value19.48 (0% missing) -> 24.14 (50% point missing)

Ablation MAE (PeMSD3)

ValueFull: 13.01 vs w/o instruction-tuning: 21.62

Who Should Care

What To Try In 7 Days

Prototype a single sub-agent: fine-tune a small LM (Gemma or Llama-8B) on one time-series task using QLoRA and instruction tuning.

Build a small prompt pool of historical patterns (key vectors + value snippets) and implement top-K cosine retrieval to condition the model.

Run an ablation: compare model with and without prompt retrieval and with/without DPO to measure impact on your dataset.

Agent Features

Memory

  • differentiable prompt pools (retrieval memory: key-value prompts)

Planning

  • master agent orchestrates and routes tasks
  • supports chaining sub-agents for multi-step tasks (not exercised here)

Tool Use

  • retrieval from prompt pools
  • ReAct prompting for stepwise reasoning
  • external tools implemented as sub-agents

Frameworks

  • ReAct
  • Agentic-RAG

Is Agentic

true

Architectures

  • hierarchical master + specialized sub-agents

Collaboration

  • sub-agents specialize by task and are orchestrated by master agent

Optimization Features

Token Efficiency

  • use of grouped/neighbor attention (SelfExtend) to extend context without full finetuning

Infra Optimization

  • NVIDIA GPUs; reporting of GPU-hours and carbon estimates

Model Optimization

  • LoRA

System Optimization

  • gradient accumulation and small batch sizes to fit GPUs

Training Optimization

  • instruction tuning with PEFT
  • Direct Preference Optimization (DPO) for preference alignment

Inference Optimization

  • SelfExtend long-context technique to handle longer inputs

Reproducibility

Data Urls

  • PeMSD* (PeMS datasets)
  • METR-LA
  • PEMS-BAY
  • SWaT, WADI, SMAP, MSL, TEP, HAI
  • ETT (ETTh1/ETTh2/ETTm1/ETTm2)

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Needs substantial fine-tuning and prompt-pool construction effort per task/dataset.
  • Performance degrades as missing-rate increases; >30–50% missingness reduces accuracy notably.
  • Environmental and compute costs climb with larger variants and long-context training.
  • Claims focus on traffic and certain industrial datasets; cross-domain generality not fully evaluated.

When Not To Use

  • When latency or model size is strictly limited (real-time edge with tiny compute).
  • When datasets are extremely sparse (>50% missing) without strong external context.
  • If you cannot invest in instruction tuning and preference data, gains shrink.

Failure Modes

  • Wrong or irrelevant prompts retrieved leading to biased or incorrect outputs.
  • Overfitting to prompt pool patterns and failing on unseen regime shifts.
  • Large drop in accuracy if instruction tuning or DPO steps are omitted.
  • Higher false positives/negatives when anomaly segments differ from stored prompt patterns.

Core Entities

Models

  • Gemma-2B
  • Gemma-7B
  • Llama-8B (Llama 3-8B / SelfExtend)
  • SelfExtend long-context technique

Metrics

  • MAE
  • RMSE
  • MAPE
  • Accuracy
  • Precision
  • Recall
  • F1-score
  • Fault Detection Rate (FDR)

Datasets

  • PeMSD3
  • PeMSD4
  • PeMSD7
  • PeMSD7(M)
  • PeMSD8
  • METR-LA
  • PEMS-BAY
  • SWaT
  • WADI
  • SMAP
  • MSL
  • TEP (Tennessee Eastman)
  • HAI
  • ETTh1/ETTh2/ETTm1/ETTm2

Benchmarks

  • traffic forecasting (PeMS*, METR-LA, PEMS-BAY)
  • multivariate anomaly detection (SWaT, WADI, SMAP, MSL, HAI, TEP)
  • missing data imputation
  • time-series classification