Back to Blog
Fine-TuningLLMMachine LearningOpen SourceHugging FaceLoRA

Fine-Tuning Open Models for Domain-Specific Tasks

A practical guide to fine-tuning open-source LLMs for production: choosing the right base model, curating training data, LoRA and QLoRA adapter training with Hugging Face PEFT, domain-specific evaluation, GGUF quantization, and production serving with vLLM.

2026-04-21

Why Fine-Tune Instead of Prompting?

General-purpose language models are excellent at a broad range of tasks, but they are optimized to be generalists. When your application needs consistent output formatting, domain-specific vocabulary, proprietary terminology, or a behavioral style that prompt engineering cannot reliably enforce, fine-tuning changes the model's weights rather than its instructions.

The distinction from RAG is important: Retrieval-Augmented Generation supplements a frozen model with retrieved context at inference time — useful for keeping the model current and reducing hallucinations on factual queries. Fine-tuning bakes knowledge, style, and behavior into the weights themselves — better for consistent format adherence, specialized reasoning patterns, and lower inference latency since no retrieval hop is required.

SignalChoose
Need consistent output format (JSON, YAML, specific schema)Fine-tune
Domain terminology the base model consistently gets wrongFine-tune
High-volume inference where latency and cost matterFine-tune (smaller model)
Private knowledge base updated weeklyRAG
Task solvable with better system prompt and examplesPrompt / Few-shot
Safety-critical tone or guardrails prompts cannot enforceFine-tune

Choosing the Right Base Model

The base model determines your ceiling. A fine-tune can specialize a capable model but cannot teach reasoning capabilities that the base model lacks. For domain adaptation tasks, starting with a 7B–8B instruction-tuned checkpoint is almost always the right default — it fits in 24 GB VRAM with QLoRA, trains in hours rather than days, and serves at under 100ms latency on a single A10G.

Llama 3.1 / 3.2 (Meta)

Best general-purpose base for English-heavy tasks. Strong instruction following and function-calling. The 8B variant fits on 1× A10G with QLoRA. License allows commercial use with restrictions above 700M monthly active users.

Mistral 7B / Mixtral 8×7B

Sliding window attention makes Mistral efficient at long context. The Mixtral MoE architecture activates only 2 of 8 experts per token — high quality at lower inference cost. Apache 2.0 license — fully permissive for commercial use.

Gemma 2 (Google)

Excellent multilingual and code performance. The 9B model outperforms Llama 3.1 8B on many academic benchmarks. Gemma license permits commercial use; requires accepting Google's terms of service on Hugging Face.

Phi-3 / Phi-3.5 (Microsoft)

3.8B model trained on synthetic high-quality data. Exceptional reasoning per parameter — competitive with 7B models on many tasks. Fits on a single consumer GPU. MIT license. Ideal for edge deployment or tight compute budgets.

Note

Start with an instruction-tuned checkpoint (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct), not the raw pretrained base. Instruction tuning has already taught the model to follow directions — your fine-tune only needs to adapt style and domain, not rebuild instruction-following from scratch.

Data Preparation: Quality Over Quantity

A clean dataset of 500–2,000 high-quality instruction pairs routinely outperforms a noisy dataset of 50,000 examples. The goal is not to teach the model new knowledge — it already has broad domain knowledge from pretraining — but to teach it how to respond in your domain's voice, format, and scope.

The industry standard for instruction fine-tuning is the chat template format — a structured conversation between a system prompt, user, and assistant turns. Hugging Face's apply_chat_template method handles tokenization correctly per model family, including special tokens like <|begin_of_text|>.

# Training data format — chat template (JSONL)
# Each line is one independent training example.

{"messages": [
  {"role": "system", "content": "You are a PostgreSQL expert for the DataOps team. Answer concisely and always include the minimum PostgreSQL version required."},
  {"role": "user", "content": "How do I create a non-blocking index on a large table?"},
  {"role": "assistant", "content": "Use CREATE INDEX CONCURRENTLY (PostgreSQL 8.2+). This takes a ShareUpdateExclusiveLock instead of AccessExclusiveLock, allowing reads and writes during index construction.

```sql
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
```

CONCURRENTLY cannot run inside a transaction block — execute it outside BEGIN/COMMIT."}
]}

{"messages": [
  {"role": "system", "content": "You are a PostgreSQL expert for the DataOps team. Answer concisely and always include the minimum PostgreSQL version required."},
  {"role": "user", "content": "What causes index bloat and how do I fix it?"},
  {"role": "assistant", "content": "Index bloat occurs when UPDATE and DELETE operations leave dead tuples that autovacuum has not yet reclaimed.

Fix options:

1. REINDEX CONCURRENTLY (PG 12+) — rebuilds without a long lock:
```sql
REINDEX INDEX CONCURRENTLY idx_orders_user_id;
```

2. pg_repack — rewrites the table and its indexes online with minimal locking.

Measure bloat first with the pgstattuple extension before deciding."}
]}

Before training, deduplicate and filter your dataset. Near-duplicate examples waste model capacity and can cause overfitting on repeated patterns. The script below uses MinHash LSH for fuzzy deduplication — it catches paraphrased duplicates that exact-match deduplication misses.

from datasets import load_dataset, Dataset
from datasketch import MinHash, MinHashLSH
import re


def tokenize_for_minhash(text: str) -> list[str]:
    """Overlapping 5-character shingles for fuzzy matching."""
    text = re.sub(r"s+", " ", text.lower().strip())
    return [text[i : i + 5] for i in range(len(text) - 4)]


def build_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for shingle in tokenize_for_minhash(text):
        m.update(shingle.encode("utf-8"))
    return m


def deduplicate(examples: list[dict], threshold: float = 0.85) -> list[dict]:
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique = []

    for i, ex in enumerate(examples):
        text = " ".join(m["content"] for m in ex["messages"])
        m_hash = build_minhash(text)
        key = str(i)

        if not lsh.query(m_hash):
            lsh.insert(key, m_hash)
            unique.append(ex)

    print(f"Kept {len(unique)}/{len(examples)} examples after dedup.")
    return unique


raw = load_dataset("json", data_files="training_data.jsonl", split="train")
deduped = deduplicate(raw.to_list())
Dataset.from_list(deduped).to_json("training_data_deduped.jsonl")

Note

Avoid including examples where the assistant answer is wrong or inconsistent with your domain ground truth. A single contradictory example has outsized negative impact on small datasets. If generating synthetic training data with a frontier model (GPT-4o, Claude Sonnet), validate samples against domain experts before including them in training.

LoRA and QLoRA: Fine-Tuning on a Single GPU

Full fine-tuning updates every parameter in the model — for a 7B model that is 28 GB of gradients and optimizer states in bf16, before counting activations. For most domain adaptation tasks, this is unnecessary. LoRA (Low-Rank Adaptation, Hu et al., 2021) freezes the base weights and trains small rank-decomposed adapter matrices instead — typically 0.5–2% of total parameters.

QLoRA (Dettmers et al., 2023) extends this by loading the frozen base in 4-bit NormalFloat quantization (nf4), reducing VRAM requirements by ~75%. A Llama 3.1 8B QLoRA fine-tune fits comfortably in 16–20 GB VRAM — a single A10G or consumer RTX 4090.

A

Full fine-tune

All parameters updated. Best quality ceiling. Requires 4–8× A100 80GB for 7B models. Use when you need to fundamentally alter behavior across many task types.

B

LoRA (bf16 base)

Adapter matrices only. Under 1% of parameters trainable. Comparable quality to full fine-tune for instruction and style tasks. Fits in 24 GB VRAM for 7B models. Fast to train and swap.

C

QLoRA (nf4 base + bf16 adapters)

Base model in 4-bit, adapters in bf16. Highest VRAM efficiency. Slight quality loss vs LoRA on complex reasoning (<1 point on most evals). Default choice for single-GPU and budget setups.

Training with Hugging Face PEFT and TRL

The PEFT library handles LoRA configuration and adapter injection into the model. TRL's SFTTrainer wraps the training loop with sequence packing, gradient checkpointing, and chat-template-aware formatting. Together they provide a production-grade fine-tuning pipeline in under 100 lines.

# train.py — QLoRA fine-tune with PEFT + TRL SFTTrainer
# pip install transformers peft trl accelerate bitsandbytes flash-attn

import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
OUTPUT_DIR = "./fine-tuned-adapters"
DATASET_PATH = "training_data_deduped.jsonl"

# 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # nested quantization: saves ~0.4 GB extra
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA adapter config
lora_config = LoraConfig(
    r=16,                        # rank: 8–64; higher = more capacity + VRAM
    lora_alpha=32,               # scaling: convention is alpha = 2 * r
    target_modules=[             # which linear layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 83,886,080 || all params: 8,114,278,400 || trainable%: 1.03

dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.train_test_split(test_size=0.05, seed=42)

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,    # effective batch size = 16
    gradient_checkpointing=True,      # trade compute for VRAM
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    bf16=True,
    optim="paged_adamw_8bit",         # memory-efficient optimizer for QLoRA
    report_to="wandb",
    run_name="llama-3.1-8b-domain-v1",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    max_seq_length=2048,
    packing=True,          # pack multiple short examples per sequence
)

trainer.train()
trainer.save_model(OUTPUT_DIR)

Note

Enable gradient_checkpointing=True to reduce activation memory by ~60% at the cost of ~20% slower training — almost always the right trade-off on a single GPU. Also: packing=True concatenates multiple short training examples into one 2048-token sequence, dramatically improving GPU utilization on small datasets where most examples are well under max length.

Merging Adapters for Deployment

LoRA adapters are stored separately from the base model. For production serving, merge them into a single checkpoint — this eliminates the adapter overhead at inference time and produces a self-contained model that any inference engine can load directly.

# merge_and_export.py — merge LoRA adapters into base weights

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

ADAPTER_PATH = "./fine-tuned-adapters"
MERGED_PATH = "./merged-model"

# Load on CPU to avoid VRAM OOM during the merge operation
model = AutoPeftModelForCausalLM.from_pretrained(
    ADAPTER_PATH,
    torch_dtype=torch.bfloat16,
    device_map="cpu",
)
merged = model.merge_and_unload()
merged.save_pretrained(MERGED_PATH, safe_serialization=True)

tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
tokenizer.save_pretrained(MERGED_PATH)

param_count = sum(p.numel() for p in merged.parameters())
print(f"Merged model saved to {MERGED_PATH}")
print(f"Parameters: {param_count / 1e9:.1f}B  |  Size (bf16): ~{param_count * 2 / 1e9:.1f} GB")

Evaluation: Domain Benchmarks and LLM-as-Judge

Generic benchmarks (MMLU, HellaSwag) measure general capability but will not tell you whether your fine-tune performs better on your actual task. Build a domain-specific held-out evaluation set from the same distribution as your training data — 100–200 examples with ground-truth answers is sufficient for most classification and generation tasks.

For generation quality where ground truth is subjective, use LLM-as-judge: send the model's output alongside a reference answer and evaluation criteria to a frontier model that scores responses on a rubric. This correlates well with human evaluation at a fraction of the cost.

# eval.py — domain eval with LLM-as-judge using Anthropic Claude
# pip install anthropic datasets

import anthropic
import json
from datasets import load_dataset

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

JUDGE_PROMPT = """You are evaluating an AI assistant's response to a technical question.

Question: {question}
Reference answer: {reference}
Model response: {response}

Score the response from 1 to 5:
1 = Completely wrong or missing critical information
3 = Partially correct — minor errors or key omissions
5 = Correct, complete, and well-formatted

Respond with JSON only: {{"score": <int>, "reason": "<one sentence>"}}"""


def judge(question: str, reference: str, response: str) -> dict:
    msg = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                question=question,
                reference=reference,
                response=response,
            ),
        }],
    )
    return json.loads(msg.content[0].text)


# eval_set.jsonl has fields: question, reference_answer, model_response
eval_set = load_dataset("json", data_files="eval_set.jsonl", split="train")
scores = []

for ex in eval_set:
    result = judge(
        question=ex["question"],
        reference=ex["reference_answer"],
        response=ex["model_response"],
    )
    scores.append(result["score"])
    print(f"  {result['score']}/5 — {result['reason']}")

avg = sum(scores) / len(scores)
print(f"
Average score: {avg:.2f}/5 across {len(scores)} examples")
# Gate promotion if avg < 4.0

For systematic benchmarking across many standardized tasks, use LM Evaluation Harness (EleutherAI). It supports custom task definitions in YAML and runs reproducible zero-shot or few-shot evaluations with perplexity and accuracy metrics — plugging directly into CI pipelines.

Deployment: Quantization and Production Serving

After merging adapters, quantize the model to reduce serving VRAM and increase throughput. GGUF format with llama.cpp is the standard for local and edge deployments. For production API serving with continuous batching and streaming, vLLM is the industry default.

# Quantize with llama.cpp (local / edge deployment)
# git clone https://github.com/ggerganov/llama.cpp && cmake -B build && cmake --build build

# Step 1: Convert HuggingFace model to GGUF (F16)
python llama.cpp/convert_hf_to_gguf.py ./merged-model   --outfile ./model-f16.gguf   --outtype f16

# Step 2: Quantize (Q4_K_M — best quality/size tradeoff)
./build/bin/llama-quantize ./model-f16.gguf ./model-q4_k_m.gguf Q4_K_M

# Llama 3.1 8B sizes after quantization:
#   F16:    ~16.1 GB  — reference quality
#   Q8_0:   ~8.5 GB   — near-lossless
#   Q4_K_M: ~4.7 GB   — best practical tradeoff
#   Q3_K_M: ~3.5 GB   — noticeable quality loss on complex tasks

# Step 3: Serve locally with OpenAI-compatible API
./build/bin/llama-server   --model ./model-q4_k_m.gguf   --port 8080   --ctx-size 4096   --n-gpu-layers 33     # offload all layers to GPU
# vLLM — production API serving with continuous batching
# pip install vllm

# Launch (merged bf16 model, OpenAI-compatible endpoint)
vllm serve ./merged-model   --port 8000   --dtype bfloat16   --max-model-len 4096   --gpu-memory-utilization 0.90   --enable-chunked-prefill   --served-model-name domain-assistant

# Test with curl
curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "domain-assistant",
    "messages": [
      {"role": "system", "content": "You are a PostgreSQL expert."},
      {"role": "user", "content": "How do I check for index bloat?"}
    ],
    "max_tokens": 512,
    "temperature": 0.1
  }'

# For batched serving with 4-bit (AWQ — better throughput than GGUF)
# First quantize: pip install autoawq
# python -c "from awq import AutoAWQForCausalLM; ..."
vllm serve ./merged-model-awq   --quantization awq_marlin   --dtype float16   --max-model-len 4096   --max-num-seqs 64

Note

Set --gpu-memory-utilization 0.90 in vLLM to leave headroom for KV cache growth under concurrent load. Use --max-num-seqsto cap concurrent sequences and prevent OOM during traffic spikes. vLLM's PagedAttention scheduler handles batching automatically — you do not need batching logic in your application layer.

Production Checklist

Version-control your training data

Every training run should be reproducible. Store your dataset in a versioned artifact — DVC, Hugging Face Hub private dataset, or S3 with object versioning. Log the exact dataset hash alongside each model checkpoint so you can trace any production regression to a specific data change.

Monitor for catastrophic forgetting

Fine-tuning on a narrow domain can degrade performance on tasks the base model handled well. Run your full eval suite against both the base model and the fine-tune before promoting to production. If general capability drops more than 5 points, increase regularization: lower learning rate, reduce epochs, or add regularization examples from the base distribution.

Use a conservative learning rate

For instruction fine-tuning on a pretrained base, 1e-4 to 2e-4 is the standard LoRA range. Going higher risks overwriting general capabilities. Use cosine LR decay with a 3% warmup and watch your eval loss curve — if it starts rising while train loss keeps falling, you are overfitting on the training set.

Gate promotions with automated evals in CI

Every training run should automatically trigger your domain eval set and block promotion if the average score falls below threshold. Wire this into your CI/CD pipeline: new data commit triggers training, training triggers eval, eval score gates staging deployment. Use lm-evaluation-harness for standardized reproducible metrics.

Log all training metadata

Log LoRA config, base model ID, dataset hash, all hyperparameters, and eval metrics for every run in W&B or MLflow. You will want to compare runs when a production regression appears six months after the last training. Without metadata logging, diagnosing model regressions becomes guesswork.

Building a specialized LLM for your domain and need help with the full pipeline?

We design and implement end-to-end fine-tuning pipelines — from dataset curation and LoRA/QLoRA training to quantization, evaluation, and production serving with vLLM. Let’s talk.

Get in Touch

Related Articles