RIRO: reshape inputs then refine outputs to boost LLMs on tiny domain datasets

December 15, 20246 min

Overview

Decision SnapshotNeeds Validation

The approach is practical and improves automatic metrics on a single user-story dataset, but lacks broad validation and public code or compute-cost numbers.

Citations0

Evidence Strength0.60

Confidence0.72

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Ali Hamdi, Hozaifa Kassab, Mohamed Bahaa, Marwa Mohamed

Links

Abstract / PDF

Why It Matters For Business

RIRO improves LLM output quality when labeled data is scarce, reducing manual test-writing and correction time while keeping compute costs lower via QLoRA.

Who Should Care

Summary TLDR

RIRO is a simple two-layer pipeline: (1) a prompt-based reformulation step that normalizes varied user stories, (2) QLoRA fine-tuning on a small domain set, and (3) an optional output-reshaping step. Across a user-story → test-case task, RIRO (full pipeline) improved BLEU from 0.55 to 0.72 and cosine similarity from 0.816 to 0.891 versus a Phi-2 baseline. The method trades extra model passes and tuning effort for more consistent outputs in data-scarce settings.

Problem Statement

Fine-tuning LLMs on small, domain-specific datasets often yields inconsistent, noisy outputs because inputs vary widely in phrasing. Existing fixes (augmentation or strict templates) either add noise or reduce flexibility. The problem: get reliable, structured outputs (test cases) from LLMs when labeled data is limited.

Main Contribution

RIRO pipeline: input reformulation + QLoRA fine-tuning + output reshaping.

Ablation study comparing Reformulation-only, Reshaping-only, and full RIRO pipeline.

Key Findings

Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72

NumbersBLEU: Phi-2 0.55 → RIRO 0.72

Practical UseIf you need higher surface-level match to references, add input reformulation and output reshaping before/after fine-tuning.

Evidence RefTable 1

RIRO raises semantic similarity (cosine) from 0.816 to 0.891 on evaluated data

NumbersCosine: 0.8160.891

Practical UseTo better preserve meaning vs. references, the combined pipeline is preferable over single-pass fine-tuning.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
BLEU0.72 (RIRO)0.55 (Phi-2)+0.17user-story→test-case task (neodataset subset)Table 1 reports BLEU 0.72 for RIRO vs 0.55 for Phi-2Table 1
ROUGE-1 (F1)0.402 (RIRO)0.265 (Phi-2)+0.137user-story→test-case taskTable 1 ROUGE-1 valuesTable 1

What To Try In 7 Days

Add a prompt-based input normalizer that rewrites user stories into a fixed template.

Fine-tune a backbone with QLoRA on a small domain set to adapt behavior.

Add a lightweight output-reshaper pass to clean and format generated test cases.

Optimization Features

Model Optimization
LoRA
Training Optimization
LoRAUpdate subset of parameters (rank reduction)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

Evaluation limited to a subset of a user-story dataset; generalization untested.

No public code or reproducibility artifacts provided.

When Not To Use

In high-stakes domains without rigorous human validation

When you already have large labeled datasets for direct fine-tuning

Failure Modes

Reformulation changes user intent and causes incorrect test cases

Overfitting to small domain leads to brittle behavior on new inputs

Core Entities

Models

Phi-2Falcon 7BFalcon 1BLoRA

Metrics

BLEUROUGE-1ROUGE-2ROUGE-LLevenshtein DistanceCosine Similarity

Datasets

subset of user story neodataset