RIRO: reshape inputs then refine outputs to boost LLMs on tiny domain datasets

Overview

Decision SnapshotNeeds Validation

The approach is practical and improves automatic metrics on a single user-story dataset, but lacks broad validation and public code or compute-cost numbers.

Citations0

Evidence Strength0.60

Confidence0.72

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 60%

Authors

Ali Hamdi, Hozaifa Kassab, Mohamed Bahaa, Marwa Mohamed

Links

Abstract / PDF

Why It Matters For Business

RIRO improves LLM output quality when labeled data is scarce, reducing manual test-writing and correction time while keeping compute costs lower via QLoRA.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

RIRO is a simple two-layer pipeline: (1) a prompt-based reformulation step that normalizes varied user stories, (2) QLoRA fine-tuning on a small domain set, and (3) an optional output-reshaping step. Across a user-story → test-case task, RIRO (full pipeline) improved BLEU from 0.55 to 0.72 and cosine similarity from 0.816 to 0.891 versus a Phi-2 baseline. The method trades extra model passes and tuning effort for more consistent outputs in data-scarce settings.

Problem Statement

Fine-tuning LLMs on small, domain-specific datasets often yields inconsistent, noisy outputs because inputs vary widely in phrasing. Existing fixes (augmentation or strict templates) either add noise or reduce flexibility. The problem: get reliable, structured outputs (test cases) from LLMs when labeled data is limited.

Main Contribution

RIRO pipeline: input reformulation + QLoRA fine-tuning + output reshaping.

Ablation study comparing Reformulation-only, Reshaping-only, and full RIRO pipeline.

Key Findings

Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72

NumbersBLEU: Phi-2 0.55 → RIRO 0.72

Practical UseIf you need higher surface-level match to references, add input reformulation and output reshaping before/after fine-tuning.

Evidence RefTable 1

RIRO raises semantic similarity (cosine) from 0.816 to 0.891 on evaluated data

NumbersCosine: 0.816 → 0.891

Practical UseTo better preserve meaning vs. references, the combined pipeline is preferable over single-pass fine-tuning.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
BLEU	0.72 (RIRO)	0.55 (Phi-2)	+0.17	user-story→test-case task (neodataset subset)	Table 1 reports BLEU 0.72 for RIRO vs 0.55 for Phi-2	Table 1
ROUGE-1 (F1)	0.402 (RIRO)	0.265 (Phi-2)	+0.137	user-story→test-case task	Table 1 ROUGE-1 values	Table 1

What To Try In 7 Days

Add a prompt-based input normalizer that rewrites user stories into a fixed template.

Fine-tune a backbone with QLoRA on a small domain set to adapt behavior.

Add a lightweight output-reshaper pass to clean and format generated test cases.

Optimization Features

Model Optimization

LoRA

Training Optimization

LoRAUpdate subset of parameters (rank reduction)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation limited to a subset of a user-story dataset; generalization untested.

No public code or reproducibility artifacts provided.

When Not To Use

In high-stakes domains without rigorous human validation

When you already have large labeled datasets for direct fine-tuning

Failure Modes

Reformulation changes user intent and causes incorrect test cases

Overfitting to small domain leads to brittle behavior on new inputs

Core Entities

Models

Phi-2Falcon 7BFalcon 1BLoRA

Metrics

BLEUROUGE-1ROUGE-2ROUGE-LLevenshtein DistanceCosine Similarity

Datasets

subset of user story neodataset

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Full RIRO pipeline increases BLEU from 0.55 (Phi-2 baseline) to 0.72

RIRO raises semantic similarity (cosine) from 0.816 to 0.891 on evaluated data

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Use LLMs to synthesize context examples and cut expert annotation by ~40–60% for biomedical entity linking

Key finding

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

Key finding

SCORE: report accuracy ranges and consistency, not just one score

Key finding

Open-source, reproducible benchmark that compares 10+ LLMs on 20+ tasks and traces the path from GPT-3 to GPT-4

Key finding

KemenkeuGPT: a LangChain+RAG LLM for Indonesian finance that raised accuracy from 35% to 61%

Key finding