Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
A modular Agentic-RAG can reduce forecasting errors and improve anomaly detection on operational time-series (traffic, industrial telemetry), enabling better planning and faster incident detection while allowing independent updates to sub-modules.
Summary TLDR
The paper introduces an agentic Retrieval-Augmented Generation (Agentic-RAG) system for time-series tasks. A master agent routes queries to task-specialized sub-agents; each sub-agent is a small pre-trained language model (Gemma or Llama variants) fine-tuned with instruction tuning and Direct Preference Optimization (DPO). Sub-agents retrieve relevant key-value ‘‘prompt pools’’ (historical pattern snippets) via cosine similarity and concatenate retrieved prompts with input before projection. Experiments on traffic and industrial benchmarks (PeMSD*, METR-LA, PEMS-BAY, SWaT, WADI, SMAP, MSL, TEP, HAI, ETT) show consistent gains: for example PEMS-BAY horizon@3 RMSE drops to 1.62 (Agentic-RAG w/
Problem Statement
Time-series models struggle with high dimensionality, non-stationarity and the fixed-window assumption. Small pretrained language models can be cheaply adapted but lack time-series knowledge. Existing methods either use task-specific architectures or fixed-length history windows that fail under distribution shifts.
Main Contribution
Agentic-RAG: hierarchical master + specialized sub-agents that route tasks (forecasting, anomaly detection, imputation, classification).
Differentiable dynamic prompt pools: key-value prompt repositories that store distilled historical patterns and are retrieved by similarity.
Practical fine-tuning recipe: instruction tuning + PEFT (QLoRA) + Direct Preference Optimization (DPO) on small LMs (Gemma, Llama) to adapt them to time-series tasks.
Extensive empirical evaluation showing consistent improvements over standard baselines across forecasting, anomaly detection, imputation, and classification benchmarks.
Key Findings
Agentic-RAG reduces forecasting error on traffic benchmarks.
Agentic-RAG improves anomaly detection F1 across industrial benchmarks.
Imputation degrades smoothly as missing rate increases.
Each proposed component contributes to gains; removing instruction tuning hurts most.
Results
Forecasting RMSE (PEMS-BAY horizon@3)
Anomaly detection F1 (SWaT)
Imputation RMSE (PeMSD3)
Ablation MAE (PeMSD3)
Who Should Care
What To Try In 7 Days
Prototype a single sub-agent: fine-tune a small LM (Gemma or Llama-8B) on one time-series task using QLoRA and instruction tuning.
Build a small prompt pool of historical patterns (key vectors + value snippets) and implement top-K cosine retrieval to condition the model.
Run an ablation: compare model with and without prompt retrieval and with/without DPO to measure impact on your dataset.
Agent Features
Memory
- differentiable prompt pools (retrieval memory: key-value prompts)
Planning
- master agent orchestrates and routes tasks
- supports chaining sub-agents for multi-step tasks (not exercised here)
Tool Use
- retrieval from prompt pools
- ReAct prompting for stepwise reasoning
- external tools implemented as sub-agents
Frameworks
- ReAct
- Agentic-RAG
Is Agentic
true
Architectures
- hierarchical master + specialized sub-agents
Collaboration
- sub-agents specialize by task and are orchestrated by master agent
Optimization Features
Token Efficiency
- use of grouped/neighbor attention (SelfExtend) to extend context without full finetuning
Infra Optimization
- NVIDIA GPUs; reporting of GPU-hours and carbon estimates
Model Optimization
- LoRA
System Optimization
- gradient accumulation and small batch sizes to fit GPUs
Training Optimization
- instruction tuning with PEFT
- Direct Preference Optimization (DPO) for preference alignment
Inference Optimization
- SelfExtend long-context technique to handle longer inputs
Reproducibility
Data Urls
- PeMSD* (PeMS datasets)
- METR-LA
- PEMS-BAY
- SWaT, WADI, SMAP, MSL, TEP, HAI
- ETT (ETTh1/ETTh2/ETTm1/ETTm2)
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Needs substantial fine-tuning and prompt-pool construction effort per task/dataset.
- Performance degrades as missing-rate increases; >30–50% missingness reduces accuracy notably.
- Environmental and compute costs climb with larger variants and long-context training.
- Claims focus on traffic and certain industrial datasets; cross-domain generality not fully evaluated.
When Not To Use
- When latency or model size is strictly limited (real-time edge with tiny compute).
- When datasets are extremely sparse (>50% missing) without strong external context.
- If you cannot invest in instruction tuning and preference data, gains shrink.
Failure Modes
- Wrong or irrelevant prompts retrieved leading to biased or incorrect outputs.
- Overfitting to prompt pool patterns and failing on unseen regime shifts.
- Large drop in accuracy if instruction tuning or DPO steps are omitted.
- Higher false positives/negatives when anomaly segments differ from stored prompt patterns.
Core Entities
Models
- Gemma-2B
- Gemma-7B
- Llama-8B (Llama 3-8B / SelfExtend)
- SelfExtend long-context technique
Metrics
- MAE
- RMSE
- MAPE
- Accuracy
- Precision
- Recall
- F1-score
- Fault Detection Rate (FDR)
Datasets
- PeMSD3
- PeMSD4
- PeMSD7
- PeMSD7(M)
- PeMSD8
- METR-LA
- PEMS-BAY
- SWaT
- WADI
- SMAP
- MSL
- TEP (Tennessee Eastman)
- HAI
- ETTh1/ETTh2/ETTm1/ETTm2
Benchmarks
- traffic forecasting (PeMS*, METR-LA, PEMS-BAY)
- multivariate anomaly detection (SWaT, WADI, SMAP, MSL, HAI, TEP)
- missing data imputation
- time-series classification

