Avatar

Supervised Fine-Tuning (SFT)

Jan 2 · 7min

#LLM #Fine-tuning

1. What is Instruction Tuning

The process of adapting pre-trained language models to follow human instructions and engage in conversations.

  • This transformation from a text completion model to an instruction-following assistant is achieved through supervised fine-tuning on carefully curated datasets.
  • The journey from base to instruct model involves:
    1. Chat template
    2. Supervised fine-tuning

2. Chat Templates

provide a consistent format for structuring interactions between language models, users, and external tools. Think of them as the “grammar” that teaches models how to understand conversations, distinguish between different speakers, and respond appropriately.

Base Models vs Instruct Models

  • Base Model : Trained on raw text to predict the next token
  • Instruct Model : Fine-tuned to follow instructions and engage in conversations.

Key components

  1. Special tokens : <|im_start|> and ``<|im_end|>
  2. Roles : system, user`,assistant(and ``toolfor function calling)
  3. Content : The actual message text between the role declaration and ``<|im_end|>`

Dual-mode Reasoning support

  • Standard Mode (no_think):
< |im_start| >user
< |im_start| >assistant
15 × 24 = 360< |im_end| >
What is 15 × 24?< |im_end| >
  • Thinking Mode (think):
< |im_start| >user
What is 15 × 24?< |im_end| >
< |im_start| >assistant
< |thinking| >
I need to multiply 15 by 24. Let me break this down:
15 × 24 = 15 × (20 + 4) = (15 × 20) + (15 × 4) = 300 + 60 = 360
</|thinking| >

15 × 24 = 360< |im_end| >
```### Generation Prompts
- For inference: Use ``add_generation_prompt=True` `when you want the model to generate a response.
- For training: Use ``add_generation_prompt=False` `when preparing training data with complete conversations.
- For evaluation: Use ``add_generation_prompt=True` `to test model responses.
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
formatted_without = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=False
)

3. Supervised Fine-Tuning (SFT)

The key insight behind SFT is that we’re not teaching the model new knowledge from scratch. Instead, we’re `reshaping how existing knowledge is applied.

Why SFT Works

  • Behavioral Adaptation: The model learns to recognize instruction patterns and respond appropriately. This involves updating the attention mechanisms to focus on instruction cues in language and adjusting the output distribution to favor the desired responses. Research has shown that instruction tuning primarily affects the model’s surface-level behavior rather than its underlying knowledge
  • Task Specialization: Rather than learning entirely new concepts, the model learns to apply its existing knowledge in specific contexts. This is why SFT is much more efficient than pre-training - we’re refining existing capabilities rather than building them from scratch. Studies indicate that most of the factual knowledge comes from pre-training, while SFT teaches the model how to format and present this knowledge appropriately
  • Safety Alignment: Through exposure to carefully curated examples, the model learns to be more helpful, harmless, and honest. This involves both learning what to say and what not to say in various situations. The effectiveness of this approach has been demonstrated in works like InstructGPT

The mathematical foundation involves minimizing the cross-entropy loss between the model’s predictions and the target responses in your training dataset.

The SFT Process

  1. Dataset Preparation and Selection
    • Minimum: 1,000 high-quality examples for basic fine-tuning.
    • Quality over quantity: 1,000 well-curated examples often outperform 10,000 mediocre ones.
  2. Environment Setup and Configuration
  3. Training Configuration
    • Learning Rate `(5e-5 to 1e-4): Controls how much the model weights change with each update
      • Start with 5e-5 for SmolLM3; this is conservative and stable.
      • Too high: The model becomes unstable; loss oscillates or explodes.
      • Too low: The model learns very slowly and may not converge in reasonable time.
    • Batch Size `(4-16): Number of examples processed simultaneously
      • Larger batches: More stable gradients, but require more GPU memory.
      • Smaller batches: Less memory usage, but noisier gradients.
      • Use gradient accumulation to achieve larger effective batch sizes.
    • Max Sequence Length `(2048-4096): Maximum tokens per training example
      • Longer sequences: Can handle more complex conversations.
      • Shorter sequences: Faster training, less memory usage.
      • Match your use case: Use the typical length of your target conversations.
    • Training Steps `(1000-5000): Total number of parameter updates
      • Depends on dataset size: More data usually requires more steps.
      • Monitor validation loss: Stop when it stops improving.
      • Rule of thumb: Three to five epochs through your dataset.
    • Warmup Steps `(10% of total): Gradual learning rate increase at start
      • Prevents early instability: Helps the model adapt gradually.
      • Typical range: 100-500 steps for most SFT tasks.
  4. Monitoring and Evaluation
    • Training Loss: Should decrease steadily but not too rapidly
      • Healthy pattern: Smooth, gradual decrease.
      • Warning signs: Sudden spikes, oscillations, or plateaus.
      • Typical range: Starts around 2-4, should decrease to 0.5-1.5.
    • Validation Loss: Most important metric for preventing overfitting
      • Should track training loss: A small gap indicates good generalization.
      • Growing gap: Sign of overfitting; the model may be memorizing training data.
      • Use for early stopping: Stop training when validation loss stops improving.
    • Sample Outputs: Regular qualitative checks are essential
      • Generate responses: Test the model on held-out prompts during training.
      • Check format consistency: Ensure the model follows desired response patterns.
      • Monitor for degradation: Watch for repetitive or nonsensical outputs.
    • Resource Usage: Track GPU memory and training speed
      • Memory spikes: May indicate batch size is too large.
      • Slow training: Could suggest inefficient data loading or processing.
>