Overview
Production Readiness
0.5
Novelty Score
0.6
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
RIRO improves LLM output quality when labeled data is scarce, reducing manual test-writing and correction time while keeping compute costs lower via QLoRA.
Summary TLDR
RIRO is a simple two-layer pipeline: (1) a prompt-based reformulation step that normalizes varied user stories, (2) QLoRA fine-tuning on a small domain set, and (3) an optional output-reshaping step. Across a user-story → test-case task, RIRO (full pipeline) improved BLEU from 0.55 to 0.72 and cosine similarity from 0.816 to 0.891 versus a Phi-2 baseline. The method trades extra model passes and tuning effort for more consistent outputs in data-scarce settings.
Problem Statement
Fine-tuning LLMs on small, domain-specific datasets often yields inconsistent, noisy outputs because inputs vary widely in phrasing. Existing fixes (augmentation or strict templates) either add noise or reduce flexibility. The problem: get reliable, structured outputs (test cases) from LLMs when labeled data is limited.
Main Contribution
RIRO pipeline: input reformulation + QLoRA fine-tuning + output reshaping.
Ablation study comparing Reformulation-only, Reshaping-only, and full RIRO pipeline.
Empirical gains on a user-story→test-case task using Phi-2 and other backbones.
Use of QLoRA to make domain fine-tuning feasible on small datasets.
Key Findings
Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72
RIRO raises semantic similarity (cosine) from 0.816 to 0.891 on evaluated data
Character-edit distance to references improves: Levenshtein drops from 1157.62 to 1000.88
Ablation shows full pipeline (Reformulation+Fine-tune+Reshape) outperforms removing reformulation or reshaping
QLoRA is used to enable fine-tuning on small datasets with lower compute
Results
BLEU
ROUGE-1 (F1)
ROUGE-2 (F1)
ROUGE-L (F1)
Levenshtein Distance
Cosine Similarity
Who Should Care
What To Try In 7 Days
Add a prompt-based input normalizer that rewrites user stories into a fixed template.
Fine-tune a backbone with QLoRA on a small domain set to adapt behavior.
Add a lightweight output-reshaper pass to clean and format generated test cases.
Optimization Features
Model Optimization
- LoRA
Training Optimization
- LoRA
- Update subset of parameters (rank reduction)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Evaluation limited to a subset of a user-story dataset; generalization untested.
- No public code or reproducibility artifacts provided.
- Extra model passes increase inference cost and latency.
- Risk of overfitting small datasets despite QLoRA.
When Not To Use
- In high-stakes domains without rigorous human validation
- When you already have large labeled datasets for direct fine-tuning
- When strict low-latency inference is required
Failure Modes
- Reformulation changes user intent and causes incorrect test cases
- Overfitting to small domain leads to brittle behavior on new inputs
- Output reshaper may mask semantic errors while improving surface metrics
Core Entities
Models
- Phi-2
- Falcon 7B
- Falcon 1B
- LoRA
Metrics
- BLEU
- ROUGE-1
- ROUGE-2
- ROUGE-L
- Levenshtein Distance
- Cosine Similarity
Datasets
- subset of user story neodataset

