Overview
Production Readiness
0.3
Novelty Score
0.5
Cost Impact Score
0.7
Citation Count
0
Why It Matters For Business
You can obtain a commercially usable, preference-aligned 3B chat model with modest compute and open licensing by combining synthetic instruction data, QLoRA adapters, and DPO — useful where cost, transparency, and permissive reuse matter.
Summary TLDR
The authors build and release the OpenBezoar family: QLoRA-based adapters and merged checkpoints derived from OpenLLaMA 3B v2. They synthesize instruction data with an open Falcon-40B variant across three schemes (LaMini, Evol-Instruct, Orca), filter generations with GPT-4, fine-tune sequentially with QLoRA, then merge and apply Direct Preference Optimization (DPO) on a subset of the HH‑RLHF dataset. The final DPO checkpoint shows consistent small benchmark gains and strong human-preference alignment on MT‑Bench judged by Claude‑2.1. Code, checkpoints and generated datasets are publicly released.
Problem Statement
How to cheaply and openly produce a competitive 3B instruction-following and preference-aligned chat model using synthetic instruction data, resource‑efficient fine-tuning (QLoRA), and Direct Preference Optimization (DPO). The goal is practical: low compute, permissive licensing for downstream use, and measurable alignment to human preferences.
Main Contribution
A full recipe that turns OpenLLaMA 3B v2 into three released checkpoints (SFT, HH-RLHF-SFT, HH-RLHF-DPO) using QLoRA and DPO.
Synthetic instruction datasets generated from an open Falcon-40B variant via three schemes (LaMini, Evol-Instruct, Orca) and filtered with GPT‑4; datasets and checkpoints released.
Empirical evaluation on LM‑Eval‑Harness (10 tasks) and MT‑Bench (LLM-as-a-judge with Claude‑2.1) showing small but consistent gains and good judged alignment.
Practical notes on cost-effective fine-tuning (QLoRA on consumer GPUs), adapter merging, and applying DPO to a merged model.
Key Findings
DPO checkpoint improved LM‑Eval average over base model
SFT improved specific tasks substantially versus base
DPO improved overall alignment scores on MT‑Bench
Judge validation: Claude‑2.1 shows high agreement with humans
On MT‑Bench against similar open 3B models, DPO model ranks competitively
Results
LM-Eval average (base)
LM-Eval average (OpenBezoar-HH-RLHF-DPO)
Accuracy
Accuracy
MT-Bench average (OpenBezoar-HH-RLHF-DPO)
Claude-2.1 agreement with human majority (no ties)
MT-Bench comparison vs other 3B models
Who Should Care
What To Try In 7 Days
Run QLoRA on an accessible 3B base (OpenLLaMA 3B v2) using one synthetic instruction mix.
Filter generated instruction-response pairs with a strong judge (e.g., GPT‑4) and inspect quality manually.
Merge adapters into the base, then run one epoch of DPO on a small preference subset to test alignment gains.
Optimization Features
Infra Optimization
- Experimentation on 2xT4 and Kaggle free resources to demonstrate low-budget feasibility
Model Optimization
- LoRA
Training Optimization
- LoRA
- Single-epoch DPO on merged checkpoint to reduce compute
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Synthetic data generated from an open Falcon-40B variant contains noisy/irregular outputs requiring manual fixes.
- Datasets are relatively small compared to top instruction-tuned models; results may not scale linearly.
- Filtering used GPT‑4 and evaluation used Claude‑2.1, so the workflow depends partly on closed-source services.
- DPO training was limited to one epoch and merged adapters were used naively; more tuning could change outcomes.
- Models can still loop, repeat, or respond out-of-context unless the fine-tuned system prompt is used exactly.
When Not To Use
- Where high-stakes, robust factual accuracy is required (medical, legal, safety-critical).
- When strict open-source-only toolchains are required (paper uses closed-source GPT‑4/Claude for filtering/evaluation).
- For production chatbots without additional safety, monitoring, and human oversight.
Failure Modes
- Overfitting or degeneration if DPO or SFT is run beyond safe hyperparameter ranges.
- Regression on certain tasks (observed MMLU drop after DPO).
- Dependence on the precise system prompt learned during SFT; wrong prompts can produce nonsense.
- Residual hallucinations and toxicity despite alignment steps.
Core Entities
Models
- OpenLLaMA 3B v2
- SFT
- OpenBezoar-HH-RLHF-DPO
- h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
Metrics
- Accuracy
- LM‑Eval average score
- MT‑Bench average score
- judge agreement (%)
Datasets
- databricks/dolly-15k
- FLAN Collection
- HH-RLHF (Anthropic)
- LaMini-style synthetic
- Evol-Instruct-style synthetic
- Orca-style synthetic
Benchmarks
- LM-Eval-Harness (selected tasks)
- MT-Bench

