Key Takeaways
- SmolLM (135M – 360M – 1.7B): These models exhibit surprisingly competent behavior and can live on a laptop or entry-level gaming rig.
- You can bend the 135M & 360M variants to your will in a few hours on commodity hardware like an RTX 4070. No cloud tithe required.
- Supervised Fine-Tuning (SFT) is the lever to coax instruction-following behavior from a raw base model.
- Small models are unforgiving. They demand carefully formatted prompts, high-signal synthetic data, and (counter-intuitively) gradient checkpointing to survive.
- Evaluate with held-out tasks. Blind faith in a single benchmark like Winogrande is a recipe for delusion.
Why Bother With Tiny Models?
The large-scale, open-weight apex predators like Llama-2 70B and Mistral-7B are undeniably potent-if you belong to the H100 priesthood or are willing to pay the cloud bill. In the trenches of real-world projects, my needs are often more prosaic. I need an assistant that:
- Boots inside 8–10 GB of VRAM.
- Answers domain-specific questions locally, without phoning home.
- Can be retrained overnight on a desk-side machine I actually own.
Hugging Face’s SmolLM series meets that bar. The 135M and 360M checkpoints are lean yet capable, a consequence of a deep-and-narrow MobileLLM-style architecture. This isn’t about chasing leaderboards; it’s about digital sovereignty and tangible utility.
Model | Params | Layers × Dim | Pre-train tokens | Context | Epochs (effective) |
---|---|---|---|---|---|
SmolLM-135M | 135M | 48 × 768 | 600B | 2048 | ≈ 2.4 |
SmolLM-360M | 360M | 64 × 1024 | 600B | 2048 | ≈ 2.4 |
SmolLM-1.7B | 1.7B | 80 × 2048 | 1T | 2048 | ≈ 4 |
Training corpus: Cosmopedia v2 (28B), FineWeb-Edu (220B), Python-Edu (4B).
Hardware & Tooling
To replicate this, you’ll need:
- GPU: At least 12 GB of VRAM. An RTX 3060 laptop is sufficient. RTX 4070 would be great.
- The usual suspects:
PyTorch ≥ 2.2
,Transformers ≥ 4.40
,TRL
,datasets
,accelerate
. This is the standard arsenal for modern NLP. - Optional but recommended: FlashAttention-2 if your GPU has a Compute Capability ≥ 8.0 (Ampere or newer). It materially accelerates the attention mechanism.
Supervised Fine-Tuning (SFT)
This is where we impose discipline on the base model, teaching it to follow instructions. I’m using Dolly-15k, a high-quality instruction set from Databricks that provides a diverse range of tasks. The objective is to shift the model’s behavior from merely predicting the next token to completing a user’s request.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer
import torch
import re
import platform
# --- Determine hardware capabilities first ---
= torch.cuda.is_available() and torch.cuda.is_bf16_supported()
use_bf16 = "eager"
attn_impl if torch.cuda.is_available():
= torch.cuda.get_device_capability()
major, _ if major >= 8:
# For GPUs with Compute Capability >= 8.0 (Ampere, Ada, Hopper)
# Install flash-attn for a significant speedup.
# !pip install -q flash-attn
= "flash_attention_2"
attn_impl print("Using FlashAttention-2.")
else:
print("GPU compute capability less than 8.0. Using eager attention.")
else:
print("No CUDA GPU found. Using eager attention on CPU.")
print(f"Processor: {platform.processor()} -> BF16 available: {use_bf16}, Attention impl: {attn_impl}")
# --- Model and Tokenizer Setup ---
= "HuggingFaceTB/SmolLM-135M"
model_id
= AutoTokenizer.from_pretrained(model_id, use_fast=True)
tok if tok.pad_token is None:
= tok.eos_token # Base models often lack a pad token
tok.pad_token = "left" # Critical for decoder-only architectures
tok.padding_side
= AutoModelForCausalLM.from_pretrained(
model
model_id,=attn_impl,
attn_implementation="auto", # Maps model to GPU/CPU automatically
device_map=torch.bfloat16 if use_bf16 else torch.float16
torch_dtype
)# A non-negotiable for memory savings. Trades compute for VRAM.
model.gradient_checkpointing_enable()
# --- Dataset Preparation ---
= load_dataset("databricks/databricks-dolly-15k", split="train")
ds_raw
def format_dolly(example):
# Imposing a standard conversational format on the raw data.
= example["instruction"]
instruction = example["context"]
context = example["response"]
response
if context and isinstance(context, str) and context.strip():
# Clean up the noise - Wikipedia citation numbers aren't signal.
= re.sub(r'\[\d+\]', '', context).strip()
clean_context = f"User: {instruction}\nContext: {clean_context}\nAssistant: {response}"
formatted else:
= f"User: {instruction}\nAssistant: {response}"
formatted
return {"text": formatted}
# Process and split the dataset
= ds_raw.map(format_dolly, remove_columns=list(ds_raw.features))
ds_processed = ds_processed.train_test_split(test_size=0.1, seed=42)
train_val_split = train_val_split["train"]
ds_train = train_val_split["test"]
ds_val print(f"Training on {len(ds_train)} examples, validating on {len(ds_val)} examples.")
# --- Training Configuration ---
= SFTConfig(
config ="sft_smollm_135m",
output_dir=4, # Keep this low for ~12GB VRAM
per_device_train_batch_size=4, # Effective batch size = 4 * 4 = 16
gradient_accumulation_steps=3, # Dolly is small enough for multiple epochs
num_train_epochs=2e-5,
learning_rate=not use_bf16, # Use FP16 if BF16 is not available
fp16=use_bf16, # Use BF16 on Ampere+ for speed and stability
bf16=50,
logging_steps=500,
save_steps=50,
eval_steps="steps", # Required if using eval_steps
evaluation_strategy=2048, # Match the model's context window
max_seq_length=True, # Pack multiple short examples into one sequence for efficiency
packing
)
= SFTTrainer(
trainer =model,
model=tok,
tokenizer=ds_train,
train_dataset=ds_val,
eval_dataset=config,
args="text", # Explicitly point to our formatted text column
dataset_text_field
)
# To run the training, uncomment the line below.
# trainer.train()
print("SFT Trainer configured. The machine is ready to impose some discipline.")
Tip: For Dolly-15k, three full epochs is a reasonable starting point. With ~15k examples and an effective batch size of 16, one epoch is roughly 940 steps. I always recommend a dry run with a small number of steps (e.g.,
max_steps=100
) to validate the entire pipeline before committing to a multi-hour training job.
Practical Guidelines
My time in the trenches with these smaller models has yielded a few hard-won heuristics:
- Curate Ruthlessly: Tiny models have no capacity for garbage. They are exquisitely sensitive to data quality. They benefit immensely from high-signal, narrowly-scoped data like Dolly, which is composed of human-curated examples across useful categories (creative writing, classification, information extraction). Noise will kill your performance.
- Batch Size Matters: Even for a 135M model, achieving a stable global batch size (e.g., 64+ sequences per optimization step) is crucial for stable training. Gradient accumulation is your primary tool here, simulating a larger batch size without demanding more VRAM.
- Mind the Template: As of mid-2024, many base models on Hugging Face lack a defined
chat_template
. You will likely need to inject one manually before using higher-level abstractions likeapply_chat_template
. Failure to do so results in models that don’t understand conversational turns. - Learning Rate Schedule: For SFT, a learning rate between 1e-5 and 5e-5 with a linear decay schedule is a robust starting point. It helps stabilize training as the model converges.
- Quantize for Deployment: Once you are satisfied with the fine-tuned model’s performance, quantizing it is the final step to weaponize it for inference. Converting to a format like INT4-GGUF with
llama.cpp
can shrink the 360M model to fit comfortably within 3 GB of RAM.
Limitations & Next Steps
Let’s be clear about what these models are not. Don’t bother with Direct Preference Optimization (DPO) on models this small; they lack the parameter capacity to effectively learn from preference pairs without suffering catastrophic forgetting. Dolly provides a solid SFT foundation, but the limitations are real:
- Context Window: The SmolLM series is locked at a 2048-token context. For long-document Q&A, you are still bound to a Retrieval Augmented Generation (RAG) architecture.
- Multilingualism: The pre-training data is overwhelmingly English. Performance in other languages will require dedicated fine-tuning on translated or newly generated instruction sets.
- Complex Reasoning: Performance on multi-step logical reasoning (like GSM8K math problems) will hit a hard ceiling. For these tasks, chasing scale or exploring radically different architectures like Mixture-of-Experts (MoE) ensembles is likely a more productive path.
Conclusion
The SmolLM series is an existence proof that “good enough” language models are now accessible to builders outside of large, well-capitalized labs. With a weekend of tinkering on a consumer GPU, it’s possible to craft a personalized assistant that respects both data privacy and budgetary reality. The Supervised Fine-Tuning workflow is a repeatable recipe for imprinting your intent onto a base model-a process that is not only effective but a potent reminder that value can be created without asking for anyone’s permission.