From Scratch vs Fine-tuning¶

What you'll learn

From scratch vs fine-tuning — the decision tree
Memory math for fine-tuning on a laptop (Ch 11 applied)
Continued pre-training vs SFT vs LoRA — where each one fits
Why the capstone takes the LoRA path

Prerequisites

Ch 22 Choosing an sLLM, Ch 11 Parameters & Memory.

Four fine-tuning paths compared

1. Concept — Four Paths¶

Path	Starting point	Data needed	Cost
From-scratch	random init	100B+ tokens	very high
Continued pre-training	off-the-shelf base	1B–10B domain tokens	high
Full fine-tuning (SFT)	base/instruct	10K–1M pairs	medium
LoRA / QLoRA	base/instruct	100–10K pairs	low

Going down the list: (a) less data, (b) faster and cheaper, (c) less domain adaptation power.

Where this book sits:

Parts 1–6: from-scratch (10M)
Part 7 + capstone: LoRA (on Qwen 2.5-0.5B)

2. Decision Tree¶

1. Is the domain general enough that an off-the-shelf model already covers it?
   Yes → Use the Ch 22 decision tree → done (no LoRA needed either)
   No  → continue

2. How much domain data do you have?
   100B+ tokens        → from-scratch (requires large GPU cluster)
   1B–10B tokens       → continued pre-training
   10K–1M pairs        → full SFT or LoRA
   100–10K pairs       → LoRA / QLoRA

3. Does it need to fit on a laptop?
   Yes → LoRA / QLoRA (Ch 24)
   No  → full SFT is possible (single A100+)

4. Does the license need to stay separate? (share only the adapter)
   Yes → LoRA (adapter saved independently)
   No  → full SFT is fine too

3. How Much Fine-tuning Fits on a Laptop¶

From Ch 11, the memory formula is 14N — but for LoRA, N refers to the adapter size, not the full model.

Full SFT memory¶

Base model	params (bf16)	grads	Adam	activations (B=4, T=1024)	total
0.5B	1.0 GB	1.0 GB	4.0 GB	1.5 GB	7.5 GB
1.5B	3.0 GB	3.0 GB	12 GB	3 GB	21 GB
3B	6.0 GB	6.0 GB	24 GB	5 GB	41 GB
7B	14 GB	14 GB	56 GB	10 GB	94 GB

T4 (16 GB): only 0.5B fits for full SFT. A100 (80 GB): up to 7B.

LoRA memory¶

LoRA freezes the base weights and trains only a small adapter. At r=16, the adapter is under 1% of the base.

Base model	base (bf16, frozen)	LoRA params/grads/Adam	activations	total
0.5B	1.0 GB	0.05 GB	1.5 GB	2.5 GB
1.5B	3.0 GB	0.15 GB	3 GB	6 GB
3B	6.0 GB	0.3 GB	5 GB	11 GB
7B	14 GB	0.7 GB	10 GB	25 GB

T4 (16 GB): 3B LoRA fits. Laptop (24 GB+): 7B LoRA fits.

QLoRA memory¶

Quantize the base to int4 → memory cut to 1/4.

Base model	base (int4)	LoRA + activations	total
7B	3.5 GB	8 GB	11.5 GB
13B	6.5 GB	12 GB	18.5 GB
70B	35 GB	30 GB	65 GB

T4 (16 GB): 7B QLoRA fits. A100 (80 GB): 70B QLoRA fits.

4. The Book's Capstone Decision¶

Capstone (Korean fairytale generator):

1. Does a general Korean model suffice?       No → fairytale domain needs specialization
2. How much data?                             5K–50K synthetic fairytale pairs
3. Does it need to fit on a laptop?           Yes → LoRA / QLoRA
4. License separation needed?                Yes (only the adapter goes to HF Hub)

Answer: LoRA on Qwen 2.5-0.5B-Instruct.

Model size choice:

0.5B is fast for laptop inference (10+ tokens/sec on M2 CPU)
1.5B gives better capability but slower inference
Start with 0.5B → scale up to 1.5B if results aren't good enough

5. Common Failure Points¶

1. "Full SFT is always better than LoRA" — With under 10K examples, LoRA performs almost the same and is more stable. SFT overfits on small datasets.

2. Getting drawn to from-scratch — If you've worked through Parts 1–6, you know training a 1B+ model takes days. It won't fit your schedule.

3. Not enough data for continued pre-training — Below 1B tokens, the effect is negligible. You need that much domain raw text to justify CPT.

4. Running continued pre-training on an instruct model — CPT belongs on a base model. Instruct models have already been SFT'd — CPT breaks their format.

5. Assuming bigger r is better — r=64+ inflates memory and causes overfitting. r=8–16 is the standard.

6. Skipping memory math — A100 results don't transfer directly to a laptop. Do the math first.

6. Ops Checklist¶

After making the decision:

Choose one of the four paths (via the decision tree)
Decide on base model (Ch 22)
Data volume + format (raw text vs pairs)
Memory math (Full vs LoRA vs QLoRA)
Estimate training time (Ch 3, Ch 15 patterns)
Decide evaluation method (Part 5)
(Optional) Compare full SFT vs LoRA in an experiment

7. Exercises¶

Measure your domain dataset (pairs or token count). Which of the four paths fits?
Calculate LoRA memory for 0.5B / 1.5B / 3B on your GPU.
Train SmolLM2-360M on 100 pairs with LoRA vs full SFT (if feasible). How different are the results?
(Think about it) How does having built the 10M from-scratch model shape your intuition when deciding to LoRA a 1.5B? What does "building it yourself" give you that reading a tutorial wouldn't?

References¶

Hu et al. (2021). LoRA. arXiv:2106.09685
Dettmers et al. (2023). QLoRA. arXiv:2305.14314
Gururangan et al. (2020). Don't Stop Pretraining. arXiv:2004.10964 (continued pre-training)
HuggingFace peft library docs