Unboxing LLMs > loading...

August 18, 2024

Fine-Tuning SmolLM on a Single GPU

Key Takeaways

  • SmolLM (135M – 360M – 1.7B): These models exhibit surprisingly competent behavior and can live on a laptop or entry-level gaming rig.
  • You can bend the 135M & 360M variants to your will in a few hours on commodity hardware like an RTX 4070. No cloud tithe required.
  • Supervised Fine-Tuning (SFT) is the lever to coax instruction-following behavior from a raw base model.
  • Small models are unforgiving. They demand carefully formatted prompts, high-signal synthetic data, and (counter-intuitively) gradient checkpointing to survive.
  • Evaluate with held-out tasks. Blind faith in a single benchmark like Winogrande is a recipe for delusion.

Why Bother With Tiny Models?

The large-scale, open-weight apex predators like Llama-2 70B and Mistral-7B are undeniably potent-if you belong to the H100 priesthood or are willing to pay the cloud bill. In the trenches of real-world projects, my needs are often more prosaic. I need an assistant that:

  • Boots inside 8–10 GB of VRAM.
  • Answers domain-specific questions locally, without phoning home.
  • Can be retrained overnight on a desk-side machine I actually own.

Hugging Face’s SmolLM series meets that bar. The 135M and 360M checkpoints are lean yet capable, a consequence of a deep-and-narrow MobileLLM-style architecture. This isn’t about chasing leaderboards; it’s about digital sovereignty and tangible utility.

Model Params Layers × Dim Pre-train tokens Context Epochs (effective)
SmolLM-135M 135M 48 × 768 600B 2048 ≈ 2.4
SmolLM-360M 360M 64 × 1024 600B 2048 ≈ 2.4
SmolLM-1.7B 1.7B 80 × 2048 1T 2048 ≈ 4

Training corpus: Cosmopedia v2 (28B), FineWeb-Edu (220B), Python-Edu (4B).


Hardware & Tooling

To replicate this, you’ll need:

  • GPU: At least 12 GB of VRAM. An RTX 3060 laptop is sufficient. RTX 4070 would be great.
  • The usual suspects: PyTorch ≥ 2.2, Transformers ≥ 4.40, TRL, datasets, accelerate. This is the standard arsenal for modern NLP.
  • Optional but recommended: FlashAttention-2 if your GPU has a Compute Capability ≥ 8.0 (Ampere or newer). It materially accelerates the attention mechanism.

Supervised Fine-Tuning (SFT)

This is where we impose discipline on the base model, teaching it to follow instructions. I’m using Dolly-15k, a high-quality instruction set from Databricks that provides a diverse range of tasks. The objective is to shift the model’s behavior from merely predicting the next token to completing a user’s request.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
import torch
import re
import platform

# --- Determine hardware capabilities first ---
use_bf16 = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
attn_impl = "eager"
if torch.cuda.is_available():
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        # For GPUs with Compute Capability >= 8.0 (Ampere, Ada, Hopper)
        # Install flash-attn for a significant speedup.
        # !pip install -q flash-attn
        attn_impl = "flash_attention_2"
        print("Using FlashAttention-2.")
    else:
        print("GPU compute capability less than 8.0. Using eager attention.")
else:
    print("No CUDA GPU found. Using eager attention on CPU.")

print(f"Processor: {platform.processor()} -> BF16 available: {use_bf16}, Attention impl: {attn_impl}")


# --- Model and Tokenizer Setup ---
model_id = "HuggingFaceTB/SmolLM-135M"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
if tok.pad_token is None:
    tok.pad_token = tok.eos_token # Base models often lack a pad token
tok.padding_side = "left" # Critical for decoder-only architectures

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation=attn_impl,
    device_map="auto", # Maps model to GPU/CPU automatically
    torch_dtype=torch.bfloat16 if use_bf16 else torch.float16
)
# A non-negotiable for memory savings. Trades compute for VRAM.
model.gradient_checkpointing_enable()

# --- Dataset Preparation ---
ds_raw = load_dataset("databricks/databricks-dolly-15k", split="train")

def format_dolly(example):
    # Imposing a standard conversational format on the raw data.
    instruction = example["instruction"]
    context = example["context"]
    response = example["response"]
    
    if context and isinstance(context, str) and context.strip():
        # Clean up the noise - Wikipedia citation numbers aren't signal.
        clean_context = re.sub(r'\[\d+\]', '', context).strip()
        formatted = f"User: {instruction}\nContext: {clean_context}\nAssistant: {response}"
    else:
        formatted = f"User: {instruction}\nAssistant: {response}"
    
    return {"text": formatted}

# Process and split the dataset
ds_processed = ds_raw.map(format_dolly, remove_columns=list(ds_raw.features))
train_val_split = ds_processed.train_test_split(test_size=0.1, seed=42)
ds_train = train_val_split["train"]
ds_val = train_val_split["test"]
print(f"Training on {len(ds_train)} examples, validating on {len(ds_val)} examples.")

# --- Training Configuration ---
config = SFTConfig(
    output_dir="sft_smollm_135m",
    per_device_train_batch_size=4,   # Keep this low for ~12GB VRAM
    gradient_accumulation_steps=4,   # Effective batch size = 4 * 4 = 16
    num_train_epochs=3,              # Dolly is small enough for multiple epochs
    learning_rate=2e-5,
    fp16=not use_bf16,               # Use FP16 if BF16 is not available
    bf16=use_bf16,                   # Use BF16 on Ampere+ for speed and stability
    logging_steps=50,
    save_steps=500,
    eval_steps=50,
    evaluation_strategy="steps",     # Required if using eval_steps
    max_seq_length=2048,             # Match the model's context window
    packing=True,                    # Pack multiple short examples into one sequence for efficiency
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tok,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    args=config,
    dataset_text_field="text",       # Explicitly point to our formatted text column
)

# To run the training, uncomment the line below.
# trainer.train()
print("SFT Trainer configured. The machine is ready to impose some discipline.")

Tip: For Dolly-15k, three full epochs is a reasonable starting point. With ~15k examples and an effective batch size of 16, one epoch is roughly 940 steps. I always recommend a dry run with a small number of steps (e.g., max_steps=100) to validate the entire pipeline before committing to a multi-hour training job.


Practical Guidelines

My time in the trenches with these smaller models has yielded a few hard-won heuristics:

  1. Curate Ruthlessly: Tiny models have no capacity for garbage. They are exquisitely sensitive to data quality. They benefit immensely from high-signal, narrowly-scoped data like Dolly, which is composed of human-curated examples across useful categories (creative writing, classification, information extraction). Noise will kill your performance.
  2. Batch Size Matters: Even for a 135M model, achieving a stable global batch size (e.g., 64+ sequences per optimization step) is crucial for stable training. Gradient accumulation is your primary tool here, simulating a larger batch size without demanding more VRAM.
  3. Mind the Template: As of mid-2024, many base models on Hugging Face lack a defined chat_template. You will likely need to inject one manually before using higher-level abstractions like apply_chat_template. Failure to do so results in models that don’t understand conversational turns.
  4. Learning Rate Schedule: For SFT, a learning rate between 1e-5 and 5e-5 with a linear decay schedule is a robust starting point. It helps stabilize training as the model converges.
  5. Quantize for Deployment: Once you are satisfied with the fine-tuned model’s performance, quantizing it is the final step to weaponize it for inference. Converting to a format like INT4-GGUF with llama.cpp can shrink the 360M model to fit comfortably within 3 GB of RAM.

Limitations & Next Steps

Let’s be clear about what these models are not. Don’t bother with Direct Preference Optimization (DPO) on models this small; they lack the parameter capacity to effectively learn from preference pairs without suffering catastrophic forgetting. Dolly provides a solid SFT foundation, but the limitations are real:

  • Context Window: The SmolLM series is locked at a 2048-token context. For long-document Q&A, you are still bound to a Retrieval Augmented Generation (RAG) architecture.
  • Multilingualism: The pre-training data is overwhelmingly English. Performance in other languages will require dedicated fine-tuning on translated or newly generated instruction sets.
  • Complex Reasoning: Performance on multi-step logical reasoning (like GSM8K math problems) will hit a hard ceiling. For these tasks, chasing scale or exploring radically different architectures like Mixture-of-Experts (MoE) ensembles is likely a more productive path.

Conclusion

The SmolLM series is an existence proof that “good enough” language models are now accessible to builders outside of large, well-capitalized labs. With a weekend of tinkering on a consumer GPU, it’s possible to craft a personalized assistant that respects both data privacy and budgetary reality. The Supervised Fine-Tuning workflow is a repeatable recipe for imprinting your intent onto a base model-a process that is not only effective but a potent reminder that value can be created without asking for anyone’s permission.

Posted in AI / ML, LLM Intermediate