Unboxing LLMs > loading...

June 1, 2024

Self‑Augmenting Llama 3: Stretching Models to 120B with Layer Surgery

Introduction

The economics of AI are governed by a brutal physics: training a frontier model from scratch is a formidable undertaking, consuming nation-state levels of capital and compute. This reality exerts a gravitational pull towards a more pragmatic question: can we make existing models bigger, stronger, better, without paying the full sticker price? The idea of growing a pre-trained model post-hoc is a near-irresistible circumvention of the standard rules.

So when, in early 2024, an unofficial 120 billion-parameter variant of Llama 3 materialized on Hugging Face-despite only official 8B and 70B models existing-it wasn’t magic. It was a feat of architectural brute force, achieved with nothing more sophisticated than file-level surgery.

Here, I dissect the mechanics of this self-augmentation. We’ll explore how to monkey-patch a computation graph using tools like Mergekit, what happens when you duplicate layers, and why the surprising success of Llama 3 120B is more of a cautionary tale than a universal blueprint.

TL;DR ✔️ Duplicating layers can increase model capacity, but this act of architectural violence almost always shatters the delicate layer-to-layer alignment forged during pre-training. ✔️ Fine-tuning isn’t just recommended; it’s the mandatory, often painful, process of teaching the augmented model to work again. 📊 The creation of models like the Llama 3 120B through layer surgery reveals a fascinating path for model scaling, but success is an outlier, not the norm.


1 Why Bother with Self‑Augmentation?

The asymmetry is staggering. Training a new transformer from the ground up costs millions. Duplicating layers costs minutes on a CPU and a few gigabytes of disk space. The intuition is seductive: stacking identical layers should, in theory, approximate the original function f(x). Since each duplicated layer L_{i}' is a perfect clone of its parent L_{i}, the logic seems sound:

L_{i}'(h) = L_{i}(h) \qquad\Longrightarrow\qquad f_{dup}(x) \approx f_{orig}(x).

What Actually Happens: This elegant assumption collides with a messy reality. The output distribution fed into a duplicated layer is no longer the pristine, on-distribution data it saw during its original training. The assumption breaks, often catastrophically.

  • Layer duplication bloats the parameter count, but this adds no effective expressivity until the new layers are forced to adapt. They are initially just redundant echoes.
  • Fine-tuning becomes a salvage operation, essential for re-aligning the duplicated layers to the new, distorted flow of information and residual pathways.

Why This Matters:

  • Self-augmentation is a powerful, high-risk method for “growing” models post-hoc. It offers a shortcut, but one paved with sharp tradeoffs in stability and convergence.
  • Understanding these mechanics is the difference between clever optimization and wasted compute.

Mechanics of Layer Duplication and Model Surgery

  • Layer duplication is the simple act of copying the weights and biases of existing transformer blocks and grafting them back into the computation graph.
  • Tools like Mergekit aren’t performing arcane magic; they operate at the file level, manipulating PyTorch checkpoint dictionaries by remapping keys and tensor slices.
  • The siren call of this method is that no retraining is required to create the larger model. But this is a trap. Substantial fine-tuning is almost always necessary to recover, let alone surpass, the original model’s performance.
  • Residual connections are the ghost in this machine. Duplicated layers are not functionally independent; their outputs are added back into the main pathway. This means output drift and gradient mismatch can rapidly destabilize the entire system during fine-tuning.

Limitations:

  • Without adaptation, duplicated layers are dead weight, contributing nothing but increased latency and memory usage.
  • Aggressive duplication or chaotic reordering is a recipe for disaster, risking vanishing/exploding gradients, sky-high validation loss, or a model that simply refuses to converge.
  • Not all architectures tolerate such surgery. Llama models appear robust, but other architectures can be far more brittle.

2 The Mergekit Passthrough Method

Mergekit’s passthrough mode is the scalpel for this operation. It copies weight tensors verbatim, re-indexing them according to a YAML recipe. The core logic is brutally simple. A toy 6-layer model becomes a 10-layer one like this:

Original[Original 6‑Layer Transformer]

Each block retains its exact weights; the new topology is purely a function of order and count.

Practical Notes:

  • Mergekit recipes offer arbitrary control to duplicate, prune, or reorder layers. Be warned: non-contiguous or overlapping slices can create bizarre and unpredictable residual paths.
  • The memory and disk footprint of the new model scales linearly with the number of new layers. This is the only part of the process that is simple and predictable.
  • File-level surgery is fast, but loading the resulting behemoths may require adjusting system ulimits or using memory-mapped weights to avoid crashing your machine.
  • Always validate the final model structure (layer count, order) before burning compute on fine-tuning. A typo in the YAML can silently produce a broken model that will waste your time and money.

The actual recipe to create a ~120B model from the 70B Llama 3 is more complex, using overlapping slices to build a deeper network:

slices:
  - sources: [{layer_range: [0, 20],  model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [10, 30], model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [20, 40], model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [30, 50], model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [40, 60], model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [50, 70], model: meta-llama/Meta-Llama-3-70B-Instruct}]
  - sources: [{layer_range: [60, 80], model: meta-llama/Meta-Llama-3-70B-Instruct}]
merge_method: passthrough
dtype: float16

This recipe takes the 80-layer Llama 3 70B and constructs a new model with 7 slices of 20 layers each, resulting in a 140-layer architecture. The parameter count grows from N=70 \textrm{B} to approximately N_{new} = (140/80) \times 70 \textrm{B} \approx 122.5 \textrm{B}. Creation time on a standard CPU: a mere ~20 minutes.


Deployment and Fine-Tuning Advice

  • Fine-tune. Always. Even the smallest modification demands it.
  • Use a smaller learning rate for duplicated parameters to avoid immediate divergence (more below).
  • Monitor validation loss and output quality religiously. Catastrophic drift can occur suddenly.
  • For very large models, consider progressive augmentation: duplicate a few layers, fine-tune until stable, then repeat. All-at-once surgery is a high-stakes gamble.
  • Evaluate the cost. More layers mean more VRAM for inference and higher latency. The model is both bigger and slower as a consequence.

3 Theory: When Duplication Helps-And When It Doesn’t

3.1 The Illusion of Identity

In a transformer, each block is wrapped in a residual pathway. When we clone a block, the residual stream is forced to pass through the same transformation twice. Let the original block implement a function F(h) and let \alpha be the residual scaling coefficient (usually 1). After duplication, the new state becomes:

h'\;=\;h\; +\; \alpha F(h) \; +\; \alpha F(h + \alpha F(h))

Because F was trained on the pre-duplication distribution, its second invocation sees an out-of-distribution input. This causes systematic drift. Early layers in a network, which handle more general features, tend to be more robust. Deeper layers, which operate on highly abstract representations, are far more brittle and will amplify this distributional shift.

3.2 The Gradient Storm

During fine-tuning, the problem compounds. Since duplicated layers share weights, they receive identical gradients from the backward pass. The effective update for a shared parameter block scales with the number of its copies, k:

\Delta W_{dup} \approx k \Delta W_{orig}

Without adjusting the learning rate, optimization will overshoot, likely catastrophically. The simplest remedy is to divide the learning rate by k for the shared parameters, or to decouple the clones entirely by giving them unique parameter IDs before training begins.

3.3 Parameters Aren’t Power

Raw parameter count is a vanity metric. What matters is effective capacity, C, which grows sub-linearly with the number of parameters, N. A common heuristic, echoing scaling laws like Chinchilla, is:

C \propto N^{\beta}, \; 0<\beta<1

Empirically, \beta is often found to be around 0.7 for language modeling tasks. Doubling the parameters does not double the intelligence.


4 Mitigation Strategies for Layer Surgery

If you must perform this kind of surgery, think of it as taming a beast. Here are some tools to keep it from turning on you.

Identity-Init Clones:

  • Don’t just copy layers; re-initialize the duplicated ones to approximate the identity function. This minimizes the initial output drift and gives the model a fighting chance at stability.

Scaled Learning Rate:

  • Reduce the learning rate for duplicated parameters by a factor of 1/k (where k is the number of copies). This is non-negotiable to prevent immediate gradient explosion.

Knowledge Distillation:

  • Use the logits from the original, smaller model as a “teacher.” This provides soft targets that guide the augmented model’s output, helping to re-align the cloned layers during fine-tuning.

Gated Residuals:

  • A more sophisticated approach: add a learnable gate g after each cloned block. This allows the network itself to adaptively decide how much influence each duplicated block should have.

Progressive Growth:

  • Don’t go for the full 120B in one shot. Duplicate a small number of layers, fine-tune until the loss stabilizes, and then repeat. This staged approach allows for gradual adaptation and significantly reduces the risk of training collapse.

Other Hard-Won Lessons:

  • Always validate the forward pass and output tensor shapes after surgery. Silent shape mismatches are a source of subtle, maddening bugs.
  • Monitor gradient norms and loss spikes like a hawk during the first few fine-tuning steps. They are your early warning system for an impending collapse.
  • Keep a “frozen” copy of the original model on hand for ablation studies and distillation. You need a baseline to know if you’re making things better or just bigger.

Comparison to Other Scaling Strategies

  • Self-augmentation is the guerrilla tactic: fast, flexible, and cheap, but highly unstable.
  • Mixture-of-Experts (MoE) is the standing army: it increases capacity with intelligent routing, but requires significant architectural changes and more complex training infrastructure.
  • Distillation and pruning are defensive maneuvers: they shrink or adapt models, but cannot increase raw parameter count.
  • Layer surgery is unique. It’s a powerful prototyping tool for exploring the limits of a given architecture, but it’s no substitute for the rigor of large-scale pre-training.

5 Practical Advice

Scenario Recommended Action
Small model (<15B) Identity clones + distillation; limit growth to +25% layers.
Medium (15–70B) Gated residuals are worth considering; cap growth at +30–50%.
Large (>70B) Progressive growth is your best bet; monitor KV-cache usage.

6 Conclusion

In my own experiments, layer cloning works. It can, indeed, enlarge an LLM-if you are prepared to aggressively stabilize gradients and painstakingly re-align the model’s internal pathways. The process is a microcosm of engineering itself: a simple, elegant idea that conceals a mountain of messy, practical complexity.

Posted in AI / ML, LLM Advanced