Use multi-agent RAG plus a hybrid vector-graph memory to auto-generate traceable test plans and cases, cutting test-document work by ~85% in

October 12, 20258 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.8

Citation Count

0

Authors

Mohanakrishnan Hariharan, Satish Arvapalli, Seshu Barma, Evangeline Sheela

Links

Abstract / PDF

Why It Matters For Business

Automates time-consuming test-document work, preserves traceability for regulated enterprise projects, and can shrink timelines and costs if you can manage integration and KB upkeep.

Summary TLDR

This paper builds an enterprise software-testing system that combines retrieval-augmented generation (RAG) with a relationship-aware graph, plus multiple specialized LLM agents. In an SAP migration case study the authors report accuracy rising from ~65% (basic RAG) to 94.8% (Agentic RAG), an 85% reduction in artifact creation time, 35% cost savings, and improved traceability. The system uses a dual-store (vector + TigerGraph), multi-layer prompts, and dynamic model routing (Mistral 7B for cheap tasks, Gemini Pro for hard cases). Results are promising but come from internal deployments and proprietary integrations, so expect work to adapt and maintain the hybrid KB and integration plumbing.

Problem Statement

Quality engineers spend ~30–40% of time writing test artifacts. Traditional RAG loses business relationships during retrieval. Manual methods don’t scale for enterprise systems (e.g., SAP) and lack traceability across requirements, tests, and results.

Main Contribution

Hybrid vector-graph knowledge system that combines semantic search (vectors) with relationship-aware graph traversal to preserve business context.

Multi-agent orchestration layer with specialized agents for legacy analysis, mapping changes, integration points, test case creation, and compliance checks.

Enhanced contextualization engine: multi-stage context assembly, conflict resolution, and a seven-layer validation pipeline.

Comprehensive bidirectional traceability framework linking requirements, test cases, execution results, and change impact.

Real-world enterprise deployment on SAP migration projects with reported accuracy, efficiency, and cost metrics.

Key Findings

Agentic multi-agent RAG improves test artifact accuracy compared to Basic RAG.

NumbersBasic RAG 65.2% -> Agentic RAG 94.8%

Artifact creation time dropped dramatically in the reported deployments.

NumbersTime reduced 85% (240h -> 36h per project phase)

System-level outcomes and cost impacts reported for enterprise projects.

Numbers35% cost savings; 25,000 test cases created with 98.7% functional coverage

Ablation shows each component materially matters; contextualization has largest single impact.

NumbersRemoving Enhanced Contextualization degrades accuracy by 18.2%

Paper contains conflicting go-live acceleration claims.

NumbersAbstract: 2-month acceleration; Sec IV.C: 16-month acceleration

Results

Accuracy

Value94.8%

BaselineBasic RAG 65.2%

Accuracy

Value94.8%

BaselineBasic RAG 65%

Accuracy

Value92.3%

Traceability coverage (requirements->tests)

Value98.1%

BaselineManual 73.6%

Artifact creation time reduction

Value85% (240h -> 36h)

Baselinemanual process

Cost savings (reported)

Value35%

Baselineproject baseline

Who Should Care

What To Try In 7 Days

Run a small pilot: index 100–500 legacy test items into a vector DB and TigerGraph and measure retrieval relevance.

Implement a single specialized agent (e.g., Modernized Test Case Agent) to generate and validate 50 test cases from real requirements.

Set up a basic traceability matrix for one module and compare coverage before/after automated generation.

Agent Features

Memory

  • Hybrid retrieval memory (vector store for semantics, graph for relationships)
  • Bidirectional traceability as persistent links

Planning

  • Task decomposition into specialized agents
  • Strategy synthesis from objectives and history

Tool Use

  • TigerGraph for relationship traversal
  • Vector DB (Single Store) for semantic search
  • Sentence Transformer for embeddings
  • Kubernetes/Docker for orchestration

Frameworks

  • Multi-layer prompt engineering (context, spec, template, validation, enhancement)
  • Seven-layer context validation pipeline

Is Agentic

true

Architectures

  • Multi-agent orchestration layer
  • Dual-database (vector + graph) hybrid architecture
  • Dynamic model routing across LLMs

Collaboration

  • Agent-to-agent orchestration and handoff
  • Conflict resolution with rule-based priority system

Optimization Features

Infra Optimization

  • Containerized deployment (Docker + Kubernetes)
  • Distributed vector store and TigerGraph Cloud

System Optimization

  • Microservices architecture with health monitoring and auto-failover
  • Horizontal scaling for vector and graph stores

Inference Optimization

  • Dynamic model routing (use smaller LLMs for simple tasks, larger for complex reasoning)

Reproducibility

Open Source Status

  • no

Risks & Boundaries

Limitations

  • Domain specialization to SAP, employee, and finance systems; may need extra work to adapt to other domains.
  • Hybrid KB requires ongoing maintenance as business processes change.
  • Enterprise integration complexity: connecting 200+ interfaces and custom T-codes requires engineering effort.
  • Key timeline claims are inconsistent within the paper and need clarification.

When Not To Use

  • Small projects where integration overhead outweighs automation benefits.
  • Environments that cannot support a dual-database architecture or strict data redaction policies.
  • Use cases needing fully open-source toolchains or where code/data must be public.

Failure Modes

  • Context fragmentation if graph relationships are incomplete or poorly modeled.
  • Quality drop if enhanced contextualization or conflict-resolution rules are not tuned.
  • Drift or stale knowledge when the hybrid KB is not actively maintained.
  • Integration failures when enterprise connectors or PII redaction are misconfigured.

Core Entities

Models

  • Gemini Pro
  • Mistral 7B
  • GPT-4
  • Sentence Transformer

Metrics

  • Accuracy
  • Traceability coverage (%)
  • Time reduction (%)
  • Cost savings (%)
  • Functional coverage (%)
  • Defect detection change (%)

Datasets

  • Synthetic Test Dataset (5,000 scenarios)
  • Enterprise SAP S/4HANA dataset (1,000 cases requiring transformation)
  • 25,000 generated test cases (deployment output)

Context Entities

Models

  • Fusion-in-Decoder (related work)
  • Dense Passage Retrieval (related work)