RIRO: reshape inputs then refine outputs to boost LLMs on tiny domain datasets

December 15, 20246 min

Overview

Production Readiness

0.5

Novelty Score

0.6

Cost Impact Score

0.6

Citation Count

0

Authors

Ali Hamdi, Hozaifa Kassab, Mohamed Bahaa, Marwa Mohamed

Links

Abstract / PDF

Why It Matters For Business

RIRO improves LLM output quality when labeled data is scarce, reducing manual test-writing and correction time while keeping compute costs lower via QLoRA.

Summary TLDR

RIRO is a simple two-layer pipeline: (1) a prompt-based reformulation step that normalizes varied user stories, (2) QLoRA fine-tuning on a small domain set, and (3) an optional output-reshaping step. Across a user-story → test-case task, RIRO (full pipeline) improved BLEU from 0.55 to 0.72 and cosine similarity from 0.816 to 0.891 versus a Phi-2 baseline. The method trades extra model passes and tuning effort for more consistent outputs in data-scarce settings.

Problem Statement

Fine-tuning LLMs on small, domain-specific datasets often yields inconsistent, noisy outputs because inputs vary widely in phrasing. Existing fixes (augmentation or strict templates) either add noise or reduce flexibility. The problem: get reliable, structured outputs (test cases) from LLMs when labeled data is limited.

Main Contribution

RIRO pipeline: input reformulation + QLoRA fine-tuning + output reshaping.

Ablation study comparing Reformulation-only, Reshaping-only, and full RIRO pipeline.

Empirical gains on a user-story→test-case task using Phi-2 and other backbones.

Use of QLoRA to make domain fine-tuning feasible on small datasets.

Key Findings

Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72

NumbersBLEU: Phi-2 0.55 → RIRO 0.72

RIRO raises semantic similarity (cosine) from 0.816 to 0.891 on evaluated data

NumbersCosine: 0.816 → 0.891

Character-edit distance to references improves: Levenshtein drops from 1157.62 to 1000.88

NumbersLevenshtein: 1157.62 → 1000.88

Ablation shows full pipeline (Reformulation+Fine-tune+Reshape) outperforms removing reformulation or reshaping

NumbersRIRO outperforms Reshaping and Refining across BLEU/ROUGE/Cosine

QLoRA is used to enable fine-tuning on small datasets with lower compute

NumbersMethod uses QLoRA (no compute numbers reported)

Results

BLEU

Value0.72 (RIRO)

Baseline0.55 (Phi-2)

ROUGE-1 (F1)

Value0.402 (RIRO)

Baseline0.265 (Phi-2)

ROUGE-2 (F1)

Value0.149 (RIRO)

Baseline0.128 (Phi-2)

ROUGE-L (F1)

Value0.257 (RIRO)

Baseline0.172 (Phi-2)

Levenshtein Distance

Value1000.88 (RIRO)

Baseline1157.62 (Phi-2)

Cosine Similarity

Value0.891 (RIRO)

Baseline0.816 (Phi-2)

Who Should Care

What To Try In 7 Days

Add a prompt-based input normalizer that rewrites user stories into a fixed template.

Fine-tune a backbone with QLoRA on a small domain set to adapt behavior.

Add a lightweight output-reshaper pass to clean and format generated test cases.

Optimization Features

Model Optimization

  • LoRA

Training Optimization

  • LoRA
  • Update subset of parameters (rank reduction)

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation limited to a subset of a user-story dataset; generalization untested.
  • No public code or reproducibility artifacts provided.
  • Extra model passes increase inference cost and latency.
  • Risk of overfitting small datasets despite QLoRA.

When Not To Use

  • In high-stakes domains without rigorous human validation
  • When you already have large labeled datasets for direct fine-tuning
  • When strict low-latency inference is required

Failure Modes

  • Reformulation changes user intent and causes incorrect test cases
  • Overfitting to small domain leads to brittle behavior on new inputs
  • Output reshaper may mask semantic errors while improving surface metrics

Core Entities

Models

  • Phi-2
  • Falcon 7B
  • Falcon 1B
  • LoRA

Metrics

  • BLEU
  • ROUGE-1
  • ROUGE-2
  • ROUGE-L
  • Levenshtein Distance
  • Cosine Similarity

Datasets

  • subset of user story neodataset