Agentic AI pipelines that generate test scenarios and search software project documents

February 4, 20267 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.6

Citation Count

0

Authors

Marian Kica, Lukas Radosky, David Slivka, Karin Kubinova, Daniel Dovhun, Tomas Uhercik, Erik Bircak, Ivan Polasek

Links

Abstract / PDF

Why It Matters For Business

Agentic pipelines can automate repetitive SE tasks (test-scenario creation and document search), cut manual labor, and speed onboarding; the systems are deployed internally but lack formal benchmarks.

Summary TLDR

The authors present two working agent-based LLM systems for software engineering tasks: (1) a test-scenario generator using a 6-agent star topology (supervisor + specialized workers) that preprocesses FSDs, writes scenarios, fact-checks, translates, and exports Excel; (2) a document-processing pipeline with a Delegator agent and four dedicated LLM agents (Search, Q&A, Trace, Reading) backed by a Qdrant document DB. Both systems use LangChain/LangGraph, handle images with a vision model, and are deployed daily in a medium-sized SE company. No formal benchmark or quantitative evaluation is reported.

Problem Statement

Writing test scenarios from long natural-language requirements is slow and costly. Finding and tracking information across many evolving SDLC documents is hard for newcomers and teams. The paper aims to automate both tasks using agentic LLM pipelines to reduce manual effort and speed information discovery.

Main Contribution

A practical agentic architecture for automatic test scenario generation: 6 agents in a star topology with a supervisor coordinating specialized workers.

A document-processing agent pipeline for SDLC documents: a Delegator plus four LLM agents (Search, Q&A, Trace, Reading) using a shared Qdrant database.

Engineering-level design choices: per-agent context/history, external artifact storage, fact-checker to reduce hallucinations, VLM for images, Excel writer for export.

A live deployment in a medium-sized software company and a plan to collect usage data and run formal benchmark evaluations in future work.

Key Findings

Test scenario generator implemented as a 6-agent star with a supervisor and specialized workers.

Numbers6 agents; star topology described in Sec. 3.1

Document-processing system supports four explicit use cases via dedicated agents and a shared vector DB.

Numbers4 use cases (Search, Q&A, Trace, Reading); Qdrant DB used

The pipeline includes a fact-checker agent and translation/export steps to reduce hallucination and match client needs.

NumbersFact-checker used before translation and Excel export; Excel writer non-LLM (hard-coded)

Both systems are deployed and used daily in a medium-sized SE company, but no formal evaluation is provided.

NumbersDeployed and utilized daily (paper conclusion)

Who Should Care

What To Try In 7 Days

Index one project’s documents in Qdrant and run a simple Search agent to surface key specs.

Prototype the 6-agent star for a single FSD chapter: retriever → writer → fact-checker → translator → Excel export.

Add a fact-checker step to any LLM output and log mismatches to measure hallucination rates.

Agent Features

Memory

  • per-agent context and history
  • external artifact storage to keep supervisor context small

Planning

  • ordered worker invocation enforced by supervisor
  • worker input validation and feedback loops

Tool Use

  • VLM for image processing
  • Qdrant vector DB
  • Excel writer (non-LLM)

Frameworks

  • LangChain
  • LangGraph

Is Agentic

true

Architectures

  • star topology (supervisor + workers)
  • delegator-based multi-agent pipeline

Collaboration

  • supervisor/Delegator mediates all communication
  • workers unaware of each other

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • No formal quantitative evaluation or benchmark results are reported.
  • Excel writer is hard-coded and not LLM-driven, limiting flexibility.
  • Models used are unnamed and may change, making reproducibility unclear.
  • Authors note hallucination risk and rely on human/agent fact-checking.

When Not To Use

  • For tasks needing provable correctness or regulatory guarantees.
  • When you cannot supply any project-specific documents for indexing.
  • If you require fully open-source, reproducible pipelines (code not provided).

Failure Modes

  • LLM hallucinations leading to incorrect scenarios despite fact-checker.
  • Supervisor misordering or incorrect prompts causing worker errors.
  • Poor retrieval quality from the DB yields irrelevant or missing facts.
  • Context window limits require careful block processing and note management.

Core Entities

Models

  • on-premise and cloud LLMs (unnamed)

Context Entities

Models

  • GPT-3.5 (related work)
  • GPT-4 (related work)
  • LLaMA, Mistral (related literature)