Why Fine-Tune Instead of Prompting?
General-purpose language models are excellent at a broad range of tasks, but they are optimized to be generalists. When your application needs consistent output formatting, domain-specific vocabulary, proprietary terminology, or a behavioral style that prompt engineering cannot reliably enforce, fine-tuning changes the model's weights rather than its instructions.
The distinction from RAG is important: Retrieval-Augmented Generation supplements a frozen model with retrieved context at inference time — useful for keeping the model current and reducing hallucinations on factual queries. Fine-tuning bakes knowledge, style, and behavior into the weights themselves — better for consistent format adherence, specialized reasoning patterns, and lower inference latency since no retrieval hop is required.
| Signal | Choose |
|---|---|
| Need consistent output format (JSON, YAML, specific schema) | Fine-tune |
| Domain terminology the base model consistently gets wrong | Fine-tune |
| High-volume inference where latency and cost matter | Fine-tune (smaller model) |
| Private knowledge base updated weekly | RAG |
| Task solvable with better system prompt and examples | Prompt / Few-shot |
| Safety-critical tone or guardrails prompts cannot enforce | Fine-tune |
Choosing the Right Base Model
The base model determines your ceiling. A fine-tune can specialize a capable model but cannot teach reasoning capabilities that the base model lacks. For domain adaptation tasks, starting with a 7B–8B instruction-tuned checkpoint is almost always the right default — it fits in 24 GB VRAM with QLoRA, trains in hours rather than days, and serves at under 100ms latency on a single A10G.
Llama 3.1 / 3.2 (Meta)
Best general-purpose base for English-heavy tasks. Strong instruction following and function-calling. The 8B variant fits on 1× A10G with QLoRA. License allows commercial use with restrictions above 700M monthly active users.
Mistral 7B / Mixtral 8×7B
Sliding window attention makes Mistral efficient at long context. The Mixtral MoE architecture activates only 2 of 8 experts per token — high quality at lower inference cost. Apache 2.0 license — fully permissive for commercial use.
Gemma 2 (Google)
Excellent multilingual and code performance. The 9B model outperforms Llama 3.1 8B on many academic benchmarks. Gemma license permits commercial use; requires accepting Google's terms of service on Hugging Face.
Phi-3 / Phi-3.5 (Microsoft)
3.8B model trained on synthetic high-quality data. Exceptional reasoning per parameter — competitive with 7B models on many tasks. Fits on a single consumer GPU. MIT license. Ideal for edge deployment or tight compute budgets.
Note
meta-llama/Meta-Llama-3.1-8B-Instruct), not the raw pretrained base. Instruction tuning has already taught the model to follow directions — your fine-tune only needs to adapt style and domain, not rebuild instruction-following from scratch.Data Preparation: Quality Over Quantity
A clean dataset of 500–2,000 high-quality instruction pairs routinely outperforms a noisy dataset of 50,000 examples. The goal is not to teach the model new knowledge — it already has broad domain knowledge from pretraining — but to teach it how to respond in your domain's voice, format, and scope.
The industry standard for instruction fine-tuning is the chat template format — a structured conversation between a system prompt, user, and assistant turns. Hugging Face's apply_chat_template method handles tokenization correctly per model family, including special tokens like <|begin_of_text|>.
# Training data format — chat template (JSONL)
# Each line is one independent training example.
{"messages": [
{"role": "system", "content": "You are a PostgreSQL expert for the DataOps team. Answer concisely and always include the minimum PostgreSQL version required."},
{"role": "user", "content": "How do I create a non-blocking index on a large table?"},
{"role": "assistant", "content": "Use CREATE INDEX CONCURRENTLY (PostgreSQL 8.2+). This takes a ShareUpdateExclusiveLock instead of AccessExclusiveLock, allowing reads and writes during index construction.
```sql
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
```
CONCURRENTLY cannot run inside a transaction block — execute it outside BEGIN/COMMIT."}
]}
{"messages": [
{"role": "system", "content": "You are a PostgreSQL expert for the DataOps team. Answer concisely and always include the minimum PostgreSQL version required."},
{"role": "user", "content": "What causes index bloat and how do I fix it?"},
{"role": "assistant", "content": "Index bloat occurs when UPDATE and DELETE operations leave dead tuples that autovacuum has not yet reclaimed.
Fix options:
1. REINDEX CONCURRENTLY (PG 12+) — rebuilds without a long lock:
```sql
REINDEX INDEX CONCURRENTLY idx_orders_user_id;
```
2. pg_repack — rewrites the table and its indexes online with minimal locking.
Measure bloat first with the pgstattuple extension before deciding."}
]}Before training, deduplicate and filter your dataset. Near-duplicate examples waste model capacity and can cause overfitting on repeated patterns. The script below uses MinHash LSH for fuzzy deduplication — it catches paraphrased duplicates that exact-match deduplication misses.
from datasets import load_dataset, Dataset
from datasketch import MinHash, MinHashLSH
import re
def tokenize_for_minhash(text: str) -> list[str]:
"""Overlapping 5-character shingles for fuzzy matching."""
text = re.sub(r"s+", " ", text.lower().strip())
return [text[i : i + 5] for i in range(len(text) - 4)]
def build_minhash(text: str, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for shingle in tokenize_for_minhash(text):
m.update(shingle.encode("utf-8"))
return m
def deduplicate(examples: list[dict], threshold: float = 0.85) -> list[dict]:
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique = []
for i, ex in enumerate(examples):
text = " ".join(m["content"] for m in ex["messages"])
m_hash = build_minhash(text)
key = str(i)
if not lsh.query(m_hash):
lsh.insert(key, m_hash)
unique.append(ex)
print(f"Kept {len(unique)}/{len(examples)} examples after dedup.")
return unique
raw = load_dataset("json", data_files="training_data.jsonl", split="train")
deduped = deduplicate(raw.to_list())
Dataset.from_list(deduped).to_json("training_data_deduped.jsonl")Note
LoRA and QLoRA: Fine-Tuning on a Single GPU
Full fine-tuning updates every parameter in the model — for a 7B model that is 28 GB of gradients and optimizer states in bf16, before counting activations. For most domain adaptation tasks, this is unnecessary. LoRA (Low-Rank Adaptation, Hu et al., 2021) freezes the base weights and trains small rank-decomposed adapter matrices instead — typically 0.5–2% of total parameters.
QLoRA (Dettmers et al., 2023) extends this by loading the frozen base in 4-bit NormalFloat quantization (nf4), reducing VRAM requirements by ~75%. A Llama 3.1 8B QLoRA fine-tune fits comfortably in 16–20 GB VRAM — a single A10G or consumer RTX 4090.
Full fine-tune
All parameters updated. Best quality ceiling. Requires 4–8× A100 80GB for 7B models. Use when you need to fundamentally alter behavior across many task types.
LoRA (bf16 base)
Adapter matrices only. Under 1% of parameters trainable. Comparable quality to full fine-tune for instruction and style tasks. Fits in 24 GB VRAM for 7B models. Fast to train and swap.
QLoRA (nf4 base + bf16 adapters)
Base model in 4-bit, adapters in bf16. Highest VRAM efficiency. Slight quality loss vs LoRA on complex reasoning (<1 point on most evals). Default choice for single-GPU and budget setups.
Training with Hugging Face PEFT and TRL
The PEFT library handles LoRA configuration and adapter injection into the model. TRL's SFTTrainer wraps the training loop with sequence packing, gradient checkpointing, and chat-template-aware formatting. Together they provide a production-grade fine-tuning pipeline in under 100 lines.
# train.py — QLoRA fine-tune with PEFT + TRL SFTTrainer
# pip install transformers peft trl accelerate bitsandbytes flash-attn
import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from trl import SFTTrainer
BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./fine-tuned-adapters"
DATASET_PATH = "training_data_deduped.jsonl"
# 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # nested quantization: saves ~0.4 GB extra
)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# LoRA adapter config
lora_config = LoraConfig(
r=16, # rank: 8–64; higher = more capacity + VRAM
lora_alpha=32, # scaling: convention is alpha = 2 * r
target_modules=[ # which linear layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,278,400 || trainable%: 1.03
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.train_test_split(test_size=0.05, seed=42)
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch size = 16
gradient_checkpointing=True, # trade compute for VRAM
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
bf16=True,
optim="paged_adamw_8bit", # memory-efficient optimizer for QLoRA
report_to="wandb",
run_name="llama-3.1-8b-domain-v1",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
max_seq_length=2048,
packing=True, # pack multiple short examples per sequence
)
trainer.train()
trainer.save_model(OUTPUT_DIR)Note
gradient_checkpointing=True to reduce activation memory by ~60% at the cost of ~20% slower training — almost always the right trade-off on a single GPU. Also: packing=True concatenates multiple short training examples into one 2048-token sequence, dramatically improving GPU utilization on small datasets where most examples are well under max length.Merging Adapters for Deployment
LoRA adapters are stored separately from the base model. For production serving, merge them into a single checkpoint — this eliminates the adapter overhead at inference time and produces a self-contained model that any inference engine can load directly.
# merge_and_export.py — merge LoRA adapters into base weights
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
ADAPTER_PATH = "./fine-tuned-adapters"
MERGED_PATH = "./merged-model"
# Load on CPU to avoid VRAM OOM during the merge operation
model = AutoPeftModelForCausalLM.from_pretrained(
ADAPTER_PATH,
torch_dtype=torch.bfloat16,
device_map="cpu",
)
merged = model.merge_and_unload()
merged.save_pretrained(MERGED_PATH, safe_serialization=True)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(MERGED_PATH)
param_count = sum(p.numel() for p in merged.parameters())
print(f"Merged model saved to {MERGED_PATH}")
print(f"Parameters: {param_count / 1e9:.1f}B | Size (bf16): ~{param_count * 2 / 1e9:.1f} GB")Evaluation: Domain Benchmarks and LLM-as-Judge
Generic benchmarks (MMLU, HellaSwag) measure general capability but will not tell you whether your fine-tune performs better on your actual task. Build a domain-specific held-out evaluation set from the same distribution as your training data — 100–200 examples with ground-truth answers is sufficient for most classification and generation tasks.
For generation quality where ground truth is subjective, use LLM-as-judge: send the model's output alongside a reference answer and evaluation criteria to a frontier model that scores responses on a rubric. This correlates well with human evaluation at a fraction of the cost.
# eval.py — domain eval with LLM-as-judge using Anthropic Claude
# pip install anthropic datasets
import anthropic
import json
from datasets import load_dataset
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
JUDGE_PROMPT = """You are evaluating an AI assistant's response to a technical question.
Question: {question}
Reference answer: {reference}
Model response: {response}
Score the response from 1 to 5:
1 = Completely wrong or missing critical information
3 = Partially correct — minor errors or key omissions
5 = Correct, complete, and well-formatted
Respond with JSON only: {{"score": <int>, "reason": "<one sentence>"}}"""
def judge(question: str, reference: str, response: str) -> dict:
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[{
"role": "user",
"content": JUDGE_PROMPT.format(
question=question,
reference=reference,
response=response,
),
}],
)
return json.loads(msg.content[0].text)
# eval_set.jsonl has fields: question, reference_answer, model_response
eval_set = load_dataset("json", data_files="eval_set.jsonl", split="train")
scores = []
for ex in eval_set:
result = judge(
question=ex["question"],
reference=ex["reference_answer"],
response=ex["model_response"],
)
scores.append(result["score"])
print(f" {result['score']}/5 — {result['reason']}")
avg = sum(scores) / len(scores)
print(f"
Average score: {avg:.2f}/5 across {len(scores)} examples")
# Gate promotion if avg < 4.0For systematic benchmarking across many standardized tasks, use LM Evaluation Harness (EleutherAI). It supports custom task definitions in YAML and runs reproducible zero-shot or few-shot evaluations with perplexity and accuracy metrics — plugging directly into CI pipelines.
Deployment: Quantization and Production Serving
After merging adapters, quantize the model to reduce serving VRAM and increase throughput. GGUF format with llama.cpp is the standard for local and edge deployments. For production API serving with continuous batching and streaming, vLLM is the industry default.
# Quantize with llama.cpp (local / edge deployment)
# git clone https://github.com/ggerganov/llama.cpp && cmake -B build && cmake --build build
# Step 1: Convert HuggingFace model to GGUF (F16)
python llama.cpp/convert_hf_to_gguf.py ./merged-model --outfile ./model-f16.gguf --outtype f16
# Step 2: Quantize (Q4_K_M — best quality/size tradeoff)
./build/bin/llama-quantize ./model-f16.gguf ./model-q4_k_m.gguf Q4_K_M
# Llama 3.1 8B sizes after quantization:
# F16: ~16.1 GB — reference quality
# Q8_0: ~8.5 GB — near-lossless
# Q4_K_M: ~4.7 GB — best practical tradeoff
# Q3_K_M: ~3.5 GB — noticeable quality loss on complex tasks
# Step 3: Serve locally with OpenAI-compatible API
./build/bin/llama-server --model ./model-q4_k_m.gguf --port 8080 --ctx-size 4096 --n-gpu-layers 33 # offload all layers to GPU# vLLM — production API serving with continuous batching
# pip install vllm
# Launch (merged bf16 model, OpenAI-compatible endpoint)
vllm serve ./merged-model --port 8000 --dtype bfloat16 --max-model-len 4096 --gpu-memory-utilization 0.90 --enable-chunked-prefill --served-model-name domain-assistant
# Test with curl
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "domain-assistant",
"messages": [
{"role": "system", "content": "You are a PostgreSQL expert."},
{"role": "user", "content": "How do I check for index bloat?"}
],
"max_tokens": 512,
"temperature": 0.1
}'
# For batched serving with 4-bit (AWQ — better throughput than GGUF)
# First quantize: pip install autoawq
# python -c "from awq import AutoAWQForCausalLM; ..."
vllm serve ./merged-model-awq --quantization awq_marlin --dtype float16 --max-model-len 4096 --max-num-seqs 64Note
--gpu-memory-utilization 0.90 in vLLM to leave headroom for KV cache growth under concurrent load. Use --max-num-seqsto cap concurrent sequences and prevent OOM during traffic spikes. vLLM's PagedAttention scheduler handles batching automatically — you do not need batching logic in your application layer.Production Checklist
Version-control your training data
Every training run should be reproducible. Store your dataset in a versioned artifact — DVC, Hugging Face Hub private dataset, or S3 with object versioning. Log the exact dataset hash alongside each model checkpoint so you can trace any production regression to a specific data change.
Monitor for catastrophic forgetting
Fine-tuning on a narrow domain can degrade performance on tasks the base model handled well. Run your full eval suite against both the base model and the fine-tune before promoting to production. If general capability drops more than 5 points, increase regularization: lower learning rate, reduce epochs, or add regularization examples from the base distribution.
Use a conservative learning rate
For instruction fine-tuning on a pretrained base, 1e-4 to 2e-4 is the standard LoRA range. Going higher risks overwriting general capabilities. Use cosine LR decay with a 3% warmup and watch your eval loss curve — if it starts rising while train loss keeps falling, you are overfitting on the training set.
Gate promotions with automated evals in CI
Every training run should automatically trigger your domain eval set and block promotion if the average score falls below threshold. Wire this into your CI/CD pipeline: new data commit triggers training, training triggers eval, eval score gates staging deployment. Use lm-evaluation-harness for standardized reproducible metrics.
Log all training metadata
Log LoRA config, base model ID, dataset hash, all hyperparameters, and eval metrics for every run in W&B or MLflow. You will want to compare runs when a production regression appears six months after the last training. Without metadata logging, diagnosing model regressions becomes guesswork.
Building a specialized LLM for your domain and need help with the full pipeline?
We design and implement end-to-end fine-tuning pipelines — from dataset curation and LoRA/QLoRA training to quantization, evaluation, and production serving with vLLM. Let’s talk.
Get in Touch