Decoding Distillation Scaling Laws : Compute‑Optimal Teacher–Student Training

Introduction

Scaling laws are the brutal physics of our field. They tell us what is possible and, more importantly, what it will cost. We’ve moved from Kaplan’s early prescriptions for compute-optimal training to the Chinchilla doctrine of token-to-parameter ratios. But that entire conversation largely ignores the endgame: the crushing, continuous cost of inference.

Distillation Scaling Laws (Busbridge et al., 2025) forces the issue by asking a question of pure economic pragmatism:

Given a fixed amount of compute, how should I split it between training a teacher and distilling a student so that the student ends up as good as possible?

The paper’s answer is more than just another data point; it’s a new empirical law, validated by a staggering 5.4 × 10²³ FLOPs of computation. It provides recipes that finally demystify when distillation is a sound capital investment and when it’s just wishful thinking.

This is my breakdown of the core ideas. We’ll translate the math into plain English and distill it into a checklist for practitioners who, like me, measure progress in GPU-hours and dollars, not just academic citations.

Key Takeaways

Performance is governed by a broken power-law. The student’s loss improves as a power-law of its size and data, but the teacher’s influence is more complex. Its benefit switches regimes once the student’s own learning capacity becomes the dominant factor.
The “capacity gap” is demystified. That infamous U-shaped curve in performance isn’t about raw parameter counts. It’s a function of the ratio between the teacher’s and student’s learning capacities.
Distillation has a price. If you have to train the teacher from scratch, it’s only more compute-efficient than standard pre-training below a certain FLOP threshold. This threshold depends on the student’s size. The game changes if you can reuse the teacher.
No more guesswork. The authors provide closed-form rules for picking the right teacher size, data budget, and distillation tokens for a given student, turning a dark art into an engineering discipline.

From Supervised to Distillation Scaling Laws

The work begins with the familiar scaling law for supervised pre-training:

$L(N,D)=E+\biggl(\tfrac{A}{N^{\alpha}} + \tfrac{B}{D^{\beta}}\biggr)^{\gamma},$

where $N$ is the model’s non-embedding parameters and $D$ is the number of training tokens.

From here, they derive the distillation equivalent. The core insight is to treat the teacher’s final loss, $L_T$ , as the primary signal guiding the student. This yields the distillation scaling law:

$\boxed{\;L_S(N_S,D_S,L_T)=L_T+\frac1{L_T^{c_0}}\left(1+\left(\tfrac{L_T}{\tilde L_S d_1}\right)^{1/f_1}\right)^{-c_{1}f_1}\left(\tfrac{A}{N_S^{\alpha'}}+\tfrac{B}{D_S^{\beta'}}\right)^{\gamma'}\;}$

(Eq. 8 in the paper)

Here, $\tilde L_S$ is the cross-entropy the same student would have achieved with standard supervised training on $D_S$ tokens. The equation’s structure reveals a transition point at $L_T/\tilde L_S=d_{1}$ , which marks where the capacity gap kicks in.

Why it matters

This isn’t just elegant mathematics. This formula predicts downstream student accuracy to within ±1% across a vast range of models: teachers from 143M to 12.6B parameters and students from 143M to 12.6B. It provides a Rosetta Stone, allowing us to plan distillation budgets with confidence instead of burning capital on blind grid searches.

Experimental Design in a Nutshell

Axis	Range
Teacher sizes	198 M → 7.75 B
Student sizes	143 M → 12.6 B
Data	English-only C4
Sequences	4,096 tokens
Protocols	(i) Fixed- $M$ teachers, IsoFLOP students; (ii) IsoFLOP teachers, fixed- $M$ students; (iii) grid of fixed- $M$ pairs

The training regimen was rigorous, using Maximal Update Parameterisation (µP), RMSNorm, RoPE, and AdamW. All distillation runs were pure knowledge distillation ( $\lambda=1, \tau=1$ ), as initial ablations showed no clear benefit from hard labels or higher temperatures.

What the Scaling Law Reveals

1 When Distillation Wins

The law clarifies the cold economics of distillation. If you must also bear the cost of training the teacher, distillation only beats supervised pre-training under specific conditions.

For any given student size, the break-even compute budget-where distillation becomes more expensive than supervised training-grows roughly linearly with the student’s parameter count.

2 Optimal Compute Split

For small students (<3B), it’s optimal to spend most of the compute budget (<10²¹ FLOPs) on the teacher. As the budget grows, the balance shifts towards investing more in the student.
For large students (>10B), the opposite holds. The optimal strategy is to always spend the majority (≥60%) of FLOPs on the student, using a teacher that is only slightly larger and trained on 2×–4× its Chinchilla-optimal token count.

3 Teacher Size vs. Strength

This is a critical insight. The law demonstrates that it is the teacher’s final loss, not its parameter count, that dictates student quality. A well-trained 3B-parameter teacher, pushed past its Chinchilla point to 4T tokens, can be a more effective educator than a 7B model trained conventionally. We’ve seen this pattern before: a proxy metric (size) is often a poor substitute for the real one (capability).

Practitioner’s Checklist

Don’t train a teacher you already own. If you already operate a large, capable model, its training cost is sunk. Distillation becomes an immediately attractive proposition for creating smaller, specialized variants.
Price the trade-off. Use the provided heuristics to estimate the break-even budget. The Python snippet below gives a rough idea of whether distillation is likely to beat supervised training for your target student size and compute budget.
Optimize for strength, not size. A smaller, over-trained teacher can be superior. Use the scaling law’s principles to target a specific teacher loss ( $L_T$ ), not just a parameter count.
Don’t drown the student. If your student model is small (<500M), an excessively powerful teacher can actually harm performance. Watch for the U-shaped capacity gap curve in pilot runs.
Amortize your capital. The cost of training a single, powerful teacher can be spread across the distillation of many different on-device students, fundamentally changing the economic calculation in its favor.

import math

# Simplified break-even heuristic from the paper's findings.
# This approximates the FLOPs budget where supervised training
# becomes more efficient than distillation (including teacher training).
break_f = lambda N_s: 3e20 * (N_s / 1e9)

# Your target student size (e.g., 1 billion parameters)
N_s = 1.0e9

# Your total available compute budget in FLOPs
budget = 8e20

print("Is distillation likely to beat supervised training?", budget < break_f(N_s))

Relation to Earlier Work

Patient & Consistent KD (Beyer et al., 2022) argued for long training with a consistent teacher view. This new law refines that: patience helps, but only up to the break-even point for distillation tokens ( $D_S$ ). Beyond that, supervised training catches up and overtakes it.
Capacity Gap studies (Zhang et al., 2023) proposed a linear rule of thumb for teacher-to-student size. This work generalizes that to a more fundamental, loss-based rule, which explains why a tiny but deeply trained teacher can be optimal.
Transfer Scaling (Hernandez et al., 2021) showed diminishing returns when transferring knowledge across different domains. These distillation scaling laws can be seen as the self-transfer case, where the teacher provides a high-fidelity approximation of the ground truth distribution.

Limitations & Open Questions

Domain & Modality. The experiments were confined to English text from C4. Transferring these laws to vision, code, or multilingual data will likely require recalibrating the coefficients.
Pure KD Only. The study focuses on pure knowledge distillation. The dynamics of using hard labels, sequence-level distillation, or targets from an RLHF-tuned teacher remain open questions.
Inference-Aware Training. The FLOPs-based compute model ignores critical inference metrics like activation memory and latency, which are paramount for on-device deployment.

Where’s the Code?

At the time of publication, Apple had not open-sourced the training scripts, and Papers-with-Code lists no official repository. An unofficial reproduction effort exists at I-Halder/knowledge-distillation-of-large-language-models, though it appears to focus on standard scaling laws. Keep an eye on the Apple ML Research page for any official release.

Closing Thoughts

For years, distillation has felt more like alchemy than engineering, a dark art guided by heuristics and expensive trial-and-error. This work drags it into the light.

The value here is the conversion of a high-stakes judgment call into a solvable equation. It gives us a price for our decisions. We were always trading compute between teacher and student; now we know the correct exchange rate. And in a world where FLOPs are capital, knowing the price is everything.

Posted in AI / ML, LLM Intermediate by Rakshit Kalra