A practical recipe that turns a 3B open base model into competitive instruction- and preference-aligned chat models using QLoRA, synthetic-m

April 18, 20248 min

Overview

Production Readiness

0.3

Novelty Score

0.5

Cost Impact Score

0.7

Citation Count

0

Authors

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake

Links

Abstract / PDF

Why It Matters For Business

You can obtain a commercially usable, preference-aligned 3B chat model with modest compute and open licensing by combining synthetic instruction data, QLoRA adapters, and DPO — useful where cost, transparency, and permissive reuse matter.

Summary TLDR

The authors build and release the OpenBezoar family: QLoRA-based adapters and merged checkpoints derived from OpenLLaMA 3B v2. They synthesize instruction data with an open Falcon-40B variant across three schemes (LaMini, Evol-Instruct, Orca), filter generations with GPT-4, fine-tune sequentially with QLoRA, then merge and apply Direct Preference Optimization (DPO) on a subset of the HH‑RLHF dataset. The final DPO checkpoint shows consistent small benchmark gains and strong human-preference alignment on MT‑Bench judged by Claude‑2.1. Code, checkpoints and generated datasets are publicly released.

Problem Statement

How to cheaply and openly produce a competitive 3B instruction-following and preference-aligned chat model using synthetic instruction data, resource‑efficient fine-tuning (QLoRA), and Direct Preference Optimization (DPO). The goal is practical: low compute, permissive licensing for downstream use, and measurable alignment to human preferences.

Main Contribution

A full recipe that turns OpenLLaMA 3B v2 into three released checkpoints (SFT, HH-RLHF-SFT, HH-RLHF-DPO) using QLoRA and DPO.

Synthetic instruction datasets generated from an open Falcon-40B variant via three schemes (LaMini, Evol-Instruct, Orca) and filtered with GPT‑4; datasets and checkpoints released.

Empirical evaluation on LM‑Eval‑Harness (10 tasks) and MT‑Bench (LLM-as-a-judge with Claude‑2.1) showing small but consistent gains and good judged alignment.

Practical notes on cost-effective fine-tuning (QLoRA on consumer GPUs), adapter merging, and applying DPO to a merged model.

Key Findings

DPO checkpoint improved LM‑Eval average over base model

Numbersavg 0.5926 vs base 0.5704 (Table 6)

SFT improved specific tasks substantially versus base

NumbersTruthfulQA +14.18%, OpenBookQA +8.84%, MMLU +4.29%, avg +1.48%

DPO improved overall alignment scores on MT‑Bench

NumbersOpenBezoar‑HH‑RLHF‑DPO avg 4.12 (Table 7)

Judge validation: Claude‑2.1 shows high agreement with humans

Numbers88% agreement with human majority (no ties)

On MT‑Bench against similar open 3B models, DPO model ranks competitively

NumbersDPO avg 4.12 vs RedPajama‑Chat 1.45; MiniChat‑2‑3B scored 6.43 (Table 8)

Results

LM-Eval average (base)

Value0.5704

LM-Eval average (OpenBezoar-HH-RLHF-DPO)

Value0.5926

Baselinebase 0.5704

Accuracy

Value+14.18%

Baselinebase

Accuracy

Value+8.84%

Baselinebase

MT-Bench average (OpenBezoar-HH-RLHF-DPO)

Value4.12

Claude-2.1 agreement with human majority (no ties)

Value88%

MT-Bench comparison vs other 3B models

ValueDPO avg 4.12; RedPajama 1.45; MiniChat-2-3B 6.43; Phi-2 3.72

Baselineother open 3B models

Who Should Care

What To Try In 7 Days

Run QLoRA on an accessible 3B base (OpenLLaMA 3B v2) using one synthetic instruction mix.

Filter generated instruction-response pairs with a strong judge (e.g., GPT‑4) and inspect quality manually.

Merge adapters into the base, then run one epoch of DPO on a small preference subset to test alignment gains.

Optimization Features

Infra Optimization

  • Experimentation on 2xT4 and Kaggle free resources to demonstrate low-budget feasibility

Model Optimization

  • LoRA

Training Optimization

  • LoRA
  • Single-epoch DPO on merged checkpoint to reduce compute

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Synthetic data generated from an open Falcon-40B variant contains noisy/irregular outputs requiring manual fixes.
  • Datasets are relatively small compared to top instruction-tuned models; results may not scale linearly.
  • Filtering used GPT‑4 and evaluation used Claude‑2.1, so the workflow depends partly on closed-source services.
  • DPO training was limited to one epoch and merged adapters were used naively; more tuning could change outcomes.
  • Models can still loop, repeat, or respond out-of-context unless the fine-tuned system prompt is used exactly.

When Not To Use

  • Where high-stakes, robust factual accuracy is required (medical, legal, safety-critical).
  • When strict open-source-only toolchains are required (paper uses closed-source GPT‑4/Claude for filtering/evaluation).
  • For production chatbots without additional safety, monitoring, and human oversight.

Failure Modes

  • Overfitting or degeneration if DPO or SFT is run beyond safe hyperparameter ranges.
  • Regression on certain tasks (observed MMLU drop after DPO).
  • Dependence on the precise system prompt learned during SFT; wrong prompts can produce nonsense.
  • Residual hallucinations and toxicity despite alignment steps.

Core Entities

Models

  • OpenLLaMA 3B v2
  • SFT
  • OpenBezoar-HH-RLHF-DPO
  • h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2

Metrics

  • Accuracy
  • LM‑Eval average score
  • MT‑Bench average score
  • judge agreement (%)

Datasets

  • databricks/dolly-15k
  • FLAN Collection
  • HH-RLHF (Anthropic)
  • LaMini-style synthetic
  • Evol-Instruct-style synthetic
  • Orca-style synthetic

Benchmarks

  • LM-Eval-Harness (selected tasks)
  • MT-Bench