Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
Automates time-consuming test-document work, preserves traceability for regulated enterprise projects, and can shrink timelines and costs if you can manage integration and KB upkeep.
Summary TLDR
This paper builds an enterprise software-testing system that combines retrieval-augmented generation (RAG) with a relationship-aware graph, plus multiple specialized LLM agents. In an SAP migration case study the authors report accuracy rising from ~65% (basic RAG) to 94.8% (Agentic RAG), an 85% reduction in artifact creation time, 35% cost savings, and improved traceability. The system uses a dual-store (vector + TigerGraph), multi-layer prompts, and dynamic model routing (Mistral 7B for cheap tasks, Gemini Pro for hard cases). Results are promising but come from internal deployments and proprietary integrations, so expect work to adapt and maintain the hybrid KB and integration plumbing.
Problem Statement
Quality engineers spend ~30–40% of time writing test artifacts. Traditional RAG loses business relationships during retrieval. Manual methods don’t scale for enterprise systems (e.g., SAP) and lack traceability across requirements, tests, and results.
Main Contribution
Hybrid vector-graph knowledge system that combines semantic search (vectors) with relationship-aware graph traversal to preserve business context.
Multi-agent orchestration layer with specialized agents for legacy analysis, mapping changes, integration points, test case creation, and compliance checks.
Enhanced contextualization engine: multi-stage context assembly, conflict resolution, and a seven-layer validation pipeline.
Comprehensive bidirectional traceability framework linking requirements, test cases, execution results, and change impact.
Real-world enterprise deployment on SAP migration projects with reported accuracy, efficiency, and cost metrics.
Key Findings
Agentic multi-agent RAG improves test artifact accuracy compared to Basic RAG.
Artifact creation time dropped dramatically in the reported deployments.
System-level outcomes and cost impacts reported for enterprise projects.
Ablation shows each component materially matters; contextualization has largest single impact.
Paper contains conflicting go-live acceleration claims.
Results
Accuracy
Accuracy
Accuracy
Traceability coverage (requirements->tests)
Artifact creation time reduction
Cost savings (reported)
Who Should Care
What To Try In 7 Days
Run a small pilot: index 100–500 legacy test items into a vector DB and TigerGraph and measure retrieval relevance.
Implement a single specialized agent (e.g., Modernized Test Case Agent) to generate and validate 50 test cases from real requirements.
Set up a basic traceability matrix for one module and compare coverage before/after automated generation.
Agent Features
Memory
- Hybrid retrieval memory (vector store for semantics, graph for relationships)
- Bidirectional traceability as persistent links
Planning
- Task decomposition into specialized agents
- Strategy synthesis from objectives and history
Tool Use
- TigerGraph for relationship traversal
- Vector DB (Single Store) for semantic search
- Sentence Transformer for embeddings
- Kubernetes/Docker for orchestration
Frameworks
- Multi-layer prompt engineering (context, spec, template, validation, enhancement)
- Seven-layer context validation pipeline
Is Agentic
true
Architectures
- Multi-agent orchestration layer
- Dual-database (vector + graph) hybrid architecture
- Dynamic model routing across LLMs
Collaboration
- Agent-to-agent orchestration and handoff
- Conflict resolution with rule-based priority system
Optimization Features
Infra Optimization
- Containerized deployment (Docker + Kubernetes)
- Distributed vector store and TigerGraph Cloud
System Optimization
- Microservices architecture with health monitoring and auto-failover
- Horizontal scaling for vector and graph stores
Inference Optimization
- Dynamic model routing (use smaller LLMs for simple tasks, larger for complex reasoning)
Reproducibility
Open Source Status
- no
Risks & Boundaries
Limitations
- Domain specialization to SAP, employee, and finance systems; may need extra work to adapt to other domains.
- Hybrid KB requires ongoing maintenance as business processes change.
- Enterprise integration complexity: connecting 200+ interfaces and custom T-codes requires engineering effort.
- Key timeline claims are inconsistent within the paper and need clarification.
When Not To Use
- Small projects where integration overhead outweighs automation benefits.
- Environments that cannot support a dual-database architecture or strict data redaction policies.
- Use cases needing fully open-source toolchains or where code/data must be public.
Failure Modes
- Context fragmentation if graph relationships are incomplete or poorly modeled.
- Quality drop if enhanced contextualization or conflict-resolution rules are not tuned.
- Drift or stale knowledge when the hybrid KB is not actively maintained.
- Integration failures when enterprise connectors or PII redaction are misconfigured.
Core Entities
Models
- Gemini Pro
- Mistral 7B
- GPT-4
- Sentence Transformer
Metrics
- Accuracy
- Traceability coverage (%)
- Time reduction (%)
- Cost savings (%)
- Functional coverage (%)
- Defect detection change (%)
Datasets
- Synthetic Test Dataset (5,000 scenarios)
- Enterprise SAP S/4HANA dataset (1,000 cases requiring transformation)
- 25,000 generated test cases (deployment output)
Context Entities
Models
- Fusion-in-Decoder (related work)
- Dense Passage Retrieval (related work)

