Unboxing LLMs > loading...

October 26, 2023

Wanda: A Simple Yet Powerful Approach to Pruning Large Language Models

Wanda: A Simple Yet Powerful Approach to Pruning Large Language Models

Reference
Title: A Simple and Effective Pruning Approach for Large Language Models
Authors: Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
Conference: ICLR 2024 (arXiv 2306.11695v3)

Executive Summary

Large Language Models (LLMs) have revolutionized AI but come with enormous computational demands. This article explores Wanda, a remarkably simple yet effective pruning technique that can reduce LLM size by 50% with minimal performance loss—all without any retraining. Unlike traditional approaches, Wanda considers both weight magnitude and activation patterns, allowing it to identify truly important connections in these massive networks.

Flowchart Diagram

1. The Growing Challenge of LLM Size

Modern language models have expanded dramatically in size—from hundreds of millions to hundreds of billions of parameters:

  • GPT-4: Estimated 1.7 trillion parameters
  • LLaMA-2-70B: 70 billion parameters
  • PaLM: 540 billion parameters

This growth has brought unprecedented capabilities but also substantial challenges:

  • Computational requirements: Training and inference demand enormous computing resources
  • Memory consumption: Even running these models requires significant GPU memory
  • Energy usage: The environmental footprint of deploying these models at scale is concerning
  • Accessibility: Many organizations simply cannot afford the infrastructure needed

Neural network pruning—removing less important weights—has been studied for decades as a way to slim down models. However, applying traditional pruning methods to LLMs has proven particularly challenging.

Flowchart Diagram

2. Why Traditional Pruning Methods Fall Short for LLMs

Existing approaches to model pruning typically suffer from three critical limitations when applied to billion-parameter LLMs:

2.1 Prohibitive Retraining Costs

Many pruning methods operate in cycles of:
1. Train the model
2. Prune less important weights
3. Retrain to recover accuracy

This iterative process becomes completely infeasible when each training cycle might cost hundreds of thousands of dollars for the largest models.

Flowchart Diagram

2.2 Computational Complexity of Advanced Methods

More sophisticated pruning techniques like Optimal Brain Surgeon (OBS) use second-order derivatives or complex weight-reconstruction steps. While effective for smaller networks, these methods:

  • Require computing and storing enormous Hessian matrices
  • Have memory requirements that scale quadratically with model size
  • Become computationally intractable at the scale of billions of parameters

2.3 Naive Magnitude Pruning Doesn’t Work Well

The simplest pruning approach—removing weights with the smallest absolute values—works surprisingly well for many smaller networks. However, for LLMs, it falls apart:

At 50% sparsity using naive magnitude pruning:
– Small CNN (CIFAR-10): ~1% accuracy drop
– LLaMA-7B: Perplexity increases from ~7 to ~17-20 (catastrophic degradation)

This failure points to a fundamental issue: in LLMs, a weight’s absolute value alone doesn’t fully capture its importance.

3. The Wanda Pruning Method: Simple Yet Powerful

The key insight behind Wanda comes from a critical observation about LLMs: they exhibit extremely uneven activation patterns across hidden dimensions.

3.1 The Outlier Feature Phenomenon

Research has shown that in transformer-based LLMs, some hidden features can be 100× larger in magnitude than others. This “outlier feature” phenomenon means:

  • A small weight connected to a frequently-activated dimension can be crucial
  • A large weight connected to a rarely-activated dimension might be expendable

Magnitude pruning misses this nuance entirely, which explains its poor performance.

Activation Magnitudes in LLMs

3.2 The Wanda Importance Metric

Wanda introduces a remarkably simple metric that accounts for both weight magnitude and activation patterns:

Importance(Wij) = |Wij|×∥Xj2

Where: – LaTeX: W_{ij} is the weight connecting the LaTeX: j-th input to the LaTeX: i-th output
LaTeX: X_j represents all activations in the LaTeX: j-th dimension over a calibration batch
LaTeX: \|X_j\|_2 is the L2 norm of those activations (capturing how “active” that dimension is)

This elegantly combines two crucial factors: – How large the weight is (LaTeX: |W_{ij}|) – How important its input dimension typically is (LaTeX: \|X_j\|_2)

Flowchart Diagram

3.3 The Complete Wanda Algorithm

Implementing Wanda requires just four straightforward steps:

  1. Gather a small calibration dataset (e.g., 128 diverse text sequences)
  2. Run a forward pass through the model to compute activation norms for each feature
  3. Calculate the importance score for each weight using the formula above
  4. Prune per output row: Remove the lowest-scoring weights within each output neuron’s connections, rather than globally

This row-wise pruning ensures balanced sparsification across all output features, which empirically works better than pruning globally or layer-wise.

Flowchart Diagram

3.4 No Fine-tuning Required

Perhaps the most remarkable aspect of Wanda is what it doesn’t do: there’s no retraining, no iterative updates, and no complex weight adjustment. Pruned weights are simply set to zero while everything else remains untouched.

4. Experimental Results: How Well Does It Work?

The authors evaluated Wanda across multiple LLaMA variants (7B to 70B parameters) on standard benchmarks:

4.1 Zero-Shot Task Performance

At 50% sparsity (half the weights removed):

Model Magnitude Pruning Wanda Original Dense
LLaMA-7B Severe degradation 92% of original performance 100%
LLaMA-13B Severe degradation 95% of original performance 100%

4.2 Language Modeling Perplexity

Perplexity measures how well a language model predicts text (lower is better):

Model Original Magnitude Pruning (50%) Wanda (50%) SparseGPT (50%)
LLaMA-7B ~7 ~17-20 ~7-8 ~7-8
LLaMA-13B ~6 ~15-18 ~6-7 ~6-7

Wanda nearly matches the more complex SparseGPT approach while being significantly faster to execute.

4.3 Structured Sparsity Results

Beyond unstructured pruning, Wanda can be adapted to enforce hardware-friendly patterns like N:M sparsity (where N out of every M consecutive weights are kept):

  • 2:4 sparsity pattern: Wanda maintains performance within 98% of unstructured pruning
  • 4:8 sparsity pattern: Similar results, making it compatible with NVIDIA’s sparse tensor cores
4:8 Structured Sparsity Pattern

4.4 Scaling with Model Size

An interesting finding is that larger models tolerate pruning better:

  • LLaMA-7B: Shows noticeable degradation at 60%+ sparsity
  • LLaMA-70B: Maintains strong performance even at 60-70% sparsity

This suggests that as models grow, they become increasingly overparameterized and amenable to pruning.

Pruning Tolerance vs Model Size

5. Practical Implications and Implementation

5.1 Computational Efficiency

Wanda’s simplicity translates to dramatic speed advantages:

  • 100-300× faster than methods requiring weight reconstruction
  • Pruning LLaMA-7B takes minutes rather than hours or days
  • Memory overhead is minimal (just storing activation norms)

5.2 Implementation Details

Implementing Wanda requires very little code:

  1. Run a forward pass with a small batch through the model
  2. Compute L2 norms of each input dimension’s activations
  3. For each weight matrix, calculate importance scores
  4. Set the bottom 50% (or target sparsity) of weights to zero, row by row
Row-wise Pruning

5.3 Practical Usage Scenarios

Wanda enables several practical use cases:

  • Deployment on resource-constrained devices: Pruned models require less memory
  • Faster inference: Properly implemented sparse operations can reduce computation time
  • Reduced cloud computing costs: Less memory usage means cheaper deployment
  • Foundation for further compression: Wanda can be combined with quantization for even smaller models

6. Limitations and Future Directions

While Wanda represents a significant advancement, it has limitations:

6.1 Current Limitations

  • High sparsity degradation: At very high sparsities (70-80%+), performance still degrades substantially
  • Calibration data requirement: Though minimal, you need representative text samples
  • Implementation challenges: Efficiently leveraging sparsity requires specialized sparse matrix libraries

6.2 Promising Research Directions

Looking ahead, several promising directions emerge:

  1. Light post-pruning fine-tuning: Even brief fine-tuning after Wanda pruning might recover more performance
  2. Pruning + quantization pipelines: Combining both techniques could yield extremely efficient models
  3. Hardware-aware pruning patterns: Further adapting Wanda to specific hardware accelerators
  4. Automated calibration set selection: Developing methods to identify the most informative calibration samples
  5. Activation pruning: Extending the approach to prune entire feature dimensions
Flowchart Diagram

7. Conclusion: A Practical Path to Slimmer LLMs

Wanda represents a rare innovation: a method that’s both simpler and more effective than its predecessors. By recognizing the uneven activation patterns in LLMs and incorporating this insight into a straightforward pruning metric, it achieves what complex methods do—at a fraction of the computational cost.

For practitioners and researchers, Wanda opens new possibilities for deploying LLMs in resource-constrained environments. A 50% reduction in model size with minimal performance impact could be the difference between a model that fits in memory and one that doesn’t, or between a responsive user experience and a laggy one.

As LLMs continue growing in size and importance, techniques like Wanda will be crucial in making these powerful models more accessible, efficient, and environmentally sustainable. The most elegant solutions are often the simplest, and Wanda proves this principle holds even in the complex world of billion-parameter language models.

Posted in AI / ML, LLM Advanced, LLM Research
Write a comment