Domain Summarization and Generation (Decoder LoRA + Continued Pre-training)¶
What you'll learn
- Continued pre-training (CPT) — one more pass over domain raw text before instruction tuning
- Decoder LoRA SFT — domain instruction pairs on Qwen 2.5-0.5B-Instruct
- Evaluation — domain probes from Part 5 + LLM judge
- The bridge to the capstone — adapter + GGUF + HF Hub
Prerequisites
1. Concept — Two-Stage Domain Adaptation¶
The standard path for adapting an instruction model to a domain:
1. Continued pre-training (CPT) ← optional; for domain vocabulary and style
raw text, 1B+ tokens
↓
2. Domain SFT (LoRA) ← required; for domain task format
instruction pairs, 1K–100K
↓
3. Evaluate + save adapter
The book's capstone uses Stage 2 only (not enough raw text for CPT) — it relies on Qwen 2.5-0.5B-Instruct's existing Korean capability and applies LoRA on instruction pairs.
2. When CPT Is Necessary¶
| Situation | CPT needed | Reason |
|---|---|---|
| Base has no domain vocabulary (medical, legal) | ◎ | Vocabulary expansion |
| Base handles general Korean but weak on domain | △ | Only if you have 1B+ domain raw text |
| Base handles domain broadly, just format alignment needed | × | LoRA only |
| Book's capstone (Korean fairytales) | × | Qwen 2.5 Korean is sufficient |
To run CPT, you need a minimum of 100M domain tokens of raw text. CPT on smaller data has minimal effect and risks degrading the base model's general capability.
How to run CPT (briefly)¶
The key for CPT: include gate/up/down_proj (FFN) in target_modules. Domain vocabulary learning happens in the FFN layers.
3. Domain SFT (LoRA) — The Book's Capstone Path¶
Training time: T4, 1,000 pairs × 3 epochs ≈ 30 minutes.
4. Book's Capstone — Korean Fairytale Pairs¶
10K fairytales → 10K pairs. 30 minutes to train.
Varying instruction templates¶
TEMPLATES = [
"Write one Korean fairytale for children ages 3–5.",
"Write a short fairytale featuring {character}.",
"Write a 200-character fairytale about {keyword}.",
"Write a warm bedtime story for a child.",
]
More instruction diversity → the LoRA learns instruction format better.
5. Evaluation — Applying Part 5¶
Expected results (after training on 10K pairs):
| Metric | Base (Qwen 0.5B) | LoRA |
|---|---|---|
| Korean PPL (val) | 18.5 | 9.2 |
| Story probe pass@5 | 12/30 | 24/30 |
| 5-axis average (LLM judge) | 2.8 | 4.1 |
| Story tone naturalness | △ | ○ |
LoRA captures both the base model's capability and the domain's tone.
6. Merge + GGUF Conversion¶
The bridge to the capstone — path to Ch 20 GGUF.
# GGUF conversion from Ch 20
python llama.cpp/convert_hf_to_gguf.py merged_model \
--outfile dist/tiny-tale-ko.gguf --outtype f16
./llama.cpp/llama-quantize \
dist/tiny-tale-ko.gguf dist/tiny-tale-ko-q4km.gguf Q4_K_M
5 MB GGUF → instant laptop inference → capstone demo.
7. Common Failure Points¶
1. Skipping CPT when the base has weak domain vocabulary — For specialized domains like medical or legal, LoRA alone won't teach vocabulary the base never saw.
2. Not enough instruction template diversity — 10K pairs all using the same single prompt = the LoRA learns only that one prompt. 5–20 templates is the target.
3. Mixing up base vs instruct — CPT goes on base models. SFT goes on instruct models (or base + your own chat template).
4. Forgetting to do GGUF conversion after training — An adapter alone can't be used by llama.cpp. Do merge_and_unload once, then convert to GGUF.
5. Eval set distribution overlaps with training set — Using the same characters and keywords in both = self-evaluation. Use a separate random seed for the eval set.
6. Base model capability regression (catastrophic forgetting) — An overly aggressive LoRA degrades general Korean ability. Keep r moderate + epochs moderate.
8. Ops Checklist¶
Domain LoRA gate:
- Decide whether CPT is needed
- Choose base model (Ch 22)
- Instruction pair diversity (5+ templates)
- LoRA r / alpha / target settings (Ch 24)
- Compare PPL before and after training
- Measure domain probe pass@5
- (Optional) Blind LLM judge comparison
- Regression check on base capability (5 general Korean prompts)
- merge_and_unload + GGUF conversion
- HF Hub upload (capstone §4)
9. Exercises¶
- Run LoRA on Qwen 2.5-0.5B-Instruct with 1,000 pairs from your own domain. How does PPL change?
- Regression check — Compare base vs LoRA responses on 10 general Korean prompts. Which is more natural?
- Compare training results at r=8 / 16 / 32. Where's the sweet spot?
- CPT (100K raw fairytales) → SFT (pairs) vs SFT only. How much does CPT help?
- (Think about it) The book's 10M from-scratch model vs Qwen 0.5B + LoRA — both trained on the same fairytale domain. Which one is better? In what ways are they different?
References¶
- Hu et al. (2021). LoRA. arXiv:2106.09685
- Gururangan et al. (2020). Don't Stop Pretraining. arXiv:2004.10964 (CPT)
- HuggingFace
peftmerge_and_unloaddocs - Qwen 2.5 model card