A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

April 18, 20248 min

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a practical, low-cost pipeline and releases models/data, but evaluations rely on closed judges (GPT‑4, Claude‑2.1), datasets are relatively small, and DPO was run only one epoch; treat results as promising prototypes, not production-ready guarantees.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 50%

Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can obtain a commercially usable, preference-aligned 3B chat model with modest compute and open licensing by combining synthetic instruction data, QLoRA adapters, and DPO — useful where cost, transparency, and permissive reuse matter.

Who Should Care

Summary TLDR

The authors build and release the OpenBezoar family: QLoRA-based adapters and merged checkpoints derived from OpenLLaMA 3B v2. They synthesize instruction data with an open Falcon-40B variant across three schemes (LaMini, Evol-Instruct, Orca), filter generations with GPT-4, fine-tune sequentially with QLoRA, then merge and apply Direct Preference Optimization (DPO) on a subset of the HH‑RLHF dataset. The final DPO checkpoint shows consistent small benchmark gains and strong human-preference alignment on MT‑Bench judged by Claude‑2.1. Code, checkpoints and generated datasets are publicly released.

Problem Statement

How to cheaply and openly produce a competitive 3B instruction-following and preference-aligned chat model using synthetic instruction data, resource‑efficient fine-tuning (QLoRA), and Direct Preference Optimization (DPO). The goal is practical: low compute, permissive licensing for downstream use, and measurable alignment to human preferences.

Main Contribution

A full recipe that turns OpenLLaMA 3B v2 into three released checkpoints (SFT, HH-RLHF-SFT, HH-RLHF-DPO) using QLoRA and DPO.

Synthetic instruction datasets generated from an open Falcon-40B variant via three schemes (LaMini, Evol-Instruct, Orca) and filtered with GPT‑4; datasets and checkpoints released.

Key Findings

DPO checkpoint improved LM‑Eval average over base model

Numbersavg 0.5926 vs base 0.5704 (Table 6)

Practical UseA small 3B model can gain ~0.022 absolute average score on standard benchmarks by applying the paper's SFT → merge → DPO pipeline; try this pipeline for modest benchmark boosts.

Evidence RefSection 4.1, Table 6

SFT improved specific tasks substantially versus base

NumbersTruthfulQA +14.18%, OpenBookQA +8.84%, MMLU +4.29%, avg +1.48%

Practical UseEven limited synthetic SFT can meaningfully raise accuracy on targeted tasks; focus SFT mixes on tasks you care about to magnify gains.

Evidence RefIntroduction and Table 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
LM-Eval average (base)0.5704LM-Eval-HarnessTable 6 average for OpenLLaMA 3B v2Table 6
LM-Eval average (OpenBezoar-HH-RLHF-DPO)0.5926base 0.5704+0.0222 absoluteLM-Eval-HarnessTable 6 average for OpenBezoar-HH-RLHF-DPOTable 6

What To Try In 7 Days

Run QLoRA on an accessible 3B base (OpenLLaMA 3B v2) using one synthetic instruction mix.

Filter generated instruction-response pairs with a strong judge (e.g., GPT‑4) and inspect quality manually.

Merge adapters into the base, then run one epoch of DPO on a small preference subset to test alignment gains.

Optimization Features

Infra Optimization
Experimentation on 2xT4 and Kaggle free resources to demonstrate low-budget feasibility
Model Optimization
LoRA
Training Optimization
LoRASingle-epoch DPO on merged checkpoint to reduce compute

Reproducibility

Risks & Boundaries

Limitations

Synthetic data generated from an open Falcon-40B variant contains noisy/irregular outputs requiring manual fixes.

Datasets are relatively small compared to top instruction-tuned models; results may not scale linearly.

When Not To Use

Where high-stakes, robust factual accuracy is required (medical, legal, safety-critical).

When strict open-source-only toolchains are required (paper uses closed-source GPT‑4/Claude for filtering/evaluation).

Failure Modes

Overfitting or degeneration if DPO or SFT is run beyond safe hyperparameter ranges.

Regression on certain tasks (observed MMLU drop after DPO).

Core Entities

Models

OpenLLaMA 3B v2SFTOpenBezoar-HH-RLHF-DPOh2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2

Metrics

AccuracyLM‑Eval average scoreMT‑Bench average scorejudge agreement (%)

Datasets

databricks/dolly-15kFLAN CollectionHH-RLHF (Anthropic)LaMini-style syntheticEvol-Instruct-style syntheticOrca-style synthetic

Benchmarks

LM-Eval-Harness (selected tasks)MT-Bench