A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

Overview

Decision SnapshotNeeds Validation

The paper demonstrates a practical, low-cost pipeline and releases models/data, but evaluations rely on closed judges (GPT‑4, Claude‑2.1), datasets are relatively small, and DPO was run only one epoch; treat results as promising prototypes, not production-ready guarantees.

Citations0

Evidence Strength0.70

Confidence0.78

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/7

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 30%

Novelty: 50%

Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can obtain a commercially usable, preference-aligned 3B chat model with modest compute and open licensing by combining synthetic instruction data, QLoRA adapters, and DPO — useful where cost, transparency, and permissive reuse matter.

Who Should Care

ML Engineer Product Manager Founder CTO

Summary TLDR

The authors build and release the OpenBezoar family: QLoRA-based adapters and merged checkpoints derived from OpenLLaMA 3B v2. They synthesize instruction data with an open Falcon-40B variant across three schemes (LaMini, Evol-Instruct, Orca), filter generations with GPT-4, fine-tune sequentially with QLoRA, then merge and apply Direct Preference Optimization (DPO) on a subset of the HH‑RLHF dataset. The final DPO checkpoint shows consistent small benchmark gains and strong human-preference alignment on MT‑Bench judged by Claude‑2.1. Code, checkpoints and generated datasets are publicly released.

Problem Statement

How to cheaply and openly produce a competitive 3B instruction-following and preference-aligned chat model using synthetic instruction data, resource‑efficient fine-tuning (QLoRA), and Direct Preference Optimization (DPO). The goal is practical: low compute, permissive licensing for downstream use, and measurable alignment to human preferences.

Main Contribution

A full recipe that turns OpenLLaMA 3B v2 into three released checkpoints (SFT, HH-RLHF-SFT, HH-RLHF-DPO) using QLoRA and DPO.

Synthetic instruction datasets generated from an open Falcon-40B variant via three schemes (LaMini, Evol-Instruct, Orca) and filtered with GPT‑4; datasets and checkpoints released.

Key Findings

DPO checkpoint improved LM‑Eval average over base model

Numbersavg 0.5926 vs base 0.5704 (Table 6)

Practical UseA small 3B model can gain ~0.022 absolute average score on standard benchmarks by applying the paper's SFT → merge → DPO pipeline; try this pipeline for modest benchmark boosts.

Evidence RefSection 4.1, Table 6

SFT improved specific tasks substantially versus base

NumbersTruthfulQA +14.18%, OpenBookQA +8.84%, MMLU +4.29%, avg +1.48%

Practical UseEven limited synthetic SFT can meaningfully raise accuracy on targeted tasks; focus SFT mixes on tasks you care about to magnify gains.

Evidence RefIntroduction and Table 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
LM-Eval average (base)	0.5704	—	—	LM-Eval-Harness	Table 6 average for OpenLLaMA 3B v2	Table 6
LM-Eval average (OpenBezoar-HH-RLHF-DPO)	0.5926	base 0.5704	+0.0222 absolute	LM-Eval-Harness	Table 6 average for OpenBezoar-HH-RLHF-DPO	Table 6

What To Try In 7 Days

Run QLoRA on an accessible 3B base (OpenLLaMA 3B v2) using one synthetic instruction mix.

Filter generated instruction-response pairs with a strong judge (e.g., GPT‑4) and inspect quality manually.

Merge adapters into the base, then run one epoch of DPO on a small preference subset to test alignment gains.

Optimization Features

Infra Optimization

Experimentation on 2xT4 and Kaggle free resources to demonstrate low-budget feasibility

Model Optimization

LoRA

Training Optimization

LoRASingle-epoch DPO on merged checkpoint to reduce compute

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://bitbucket.org/paladinanalytics/qlora-finetuning https://bitbucket.org/paladinanalytics/direct-preference-optimization https://bitbucket.org/paladinanalytics/notebooks https://bitbucket.org/paladinanalytics/fastchat

Data URLs

https://huggingface.co/datasets/chansurgeplus/ (paper states generated datasets published on HuggingFace)

Risks & Boundaries

Limitations

Synthetic data generated from an open Falcon-40B variant contains noisy/irregular outputs requiring manual fixes.

Datasets are relatively small compared to top instruction-tuned models; results may not scale linearly.

When Not To Use

Where high-stakes, robust factual accuracy is required (medical, legal, safety-critical).

When strict open-source-only toolchains are required (paper uses closed-source GPT‑4/Claude for filtering/evaluation).

Failure Modes

Overfitting or degeneration if DPO or SFT is run beyond safe hyperparameter ranges.

Regression on certain tasks (observed MMLU drop after DPO).

Core Entities

Models

OpenLLaMA 3B v2SFTOpenBezoar-HH-RLHF-DPOh2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2

Metrics

AccuracyLM‑Eval average scoreMT‑Bench average scorejudge agreement (%)

Datasets

databricks/dolly-15kFLAN CollectionHH-RLHF (Anthropic)LaMini-style syntheticEvol-Instruct-style syntheticOrca-style synthetic

Benchmarks

LM-Eval-Harness (selected tasks)MT-Bench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

DPO checkpoint improved LM‑Eval average over base model

SFT improved specific tasks substantially versus base

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Automatically pick high-quality instruction examples to finetune LLMs and cut training cost

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

Let LLMs label and correct themselves: filter unknowns, prefer better answers, and reduce hallucinations

Key finding

Pick 5–15% of instruction data using gradient signal-to-noise from a LoRA ensemble to match or beat full-data fine-tuning

Key finding

MedInjection-FR: 571K French biomedical instruction pairs show native data helps most; mixed sources add value

Key finding