Wanda: A Simple Yet Powerful Approach to Pruning Large Language Models
Reference
Title: A Simple and Effective Pruning Approach for Large Language Models
Authors: Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
Conference: ICLR 2024 (arXiv 2306.11695v3)
Executive Summary
Large Language Models (LLMs) have revolutionized AI but come with enormous computational demands. This article explores Wanda, a remarkably simple yet effective pruning technique that can reduce LLM size by 50% with minimal performance loss—all without any retraining. Unlike traditional approaches, Wanda considers both weight magnitude and activation patterns, allowing it to identify truly important connections in these massive networks.

1. The Growing Challenge of LLM Size
Modern language models have expanded dramatically in size—from hundreds of millions to hundreds of billions of parameters:
- GPT-4: Estimated 1.7 trillion parameters
- LLaMA-2-70B: 70 billion parameters
- PaLM: 540 billion parameters
This growth has brought unprecedented capabilities but also substantial challenges:
- Computational requirements: Training and inference demand enormous computing resources
- Memory consumption: Even running these models requires significant GPU memory
- Energy usage: The environmental footprint of deploying these models at scale is concerning
- Accessibility: Many organizations simply cannot afford the infrastructure needed
Neural network pruning—removing less important weights—has been studied for decades as a way to slim down models. However, applying traditional pruning methods to LLMs has proven particularly challenging.

2. Why Traditional Pruning Methods Fall Short for LLMs
Existing approaches to model pruning typically suffer from three critical limitations when applied to billion-parameter LLMs:
2.1 Prohibitive Retraining Costs
Many pruning methods operate in cycles of:
1. Train the model
2. Prune less important weights
3. Retrain to recover accuracy
This iterative process becomes completely infeasible when each training cycle might cost hundreds of thousands of dollars for the largest models.

2.2 Computational Complexity of Advanced Methods
More sophisticated pruning techniques like Optimal Brain Surgeon (OBS) use second-order derivatives or complex weight-reconstruction steps. While effective for smaller networks, these methods:
- Require computing and storing enormous Hessian matrices
- Have memory requirements that scale quadratically with model size
- Become computationally intractable at the scale of billions of parameters
2.3 Naive Magnitude Pruning Doesn’t Work Well
The simplest pruning approach—removing weights with the smallest absolute values—works surprisingly well for many smaller networks. However, for LLMs, it falls apart:
At 50% sparsity using naive magnitude pruning:
– Small CNN (CIFAR-10): ~1% accuracy drop
– LLaMA-7B: Perplexity increases from ~7 to ~17-20 (catastrophic degradation)
This failure points to a fundamental issue: in LLMs, a weight’s absolute value alone doesn’t fully capture its importance.
3. The Wanda Pruning Method: Simple Yet Powerful
The key insight behind Wanda comes from a critical observation about LLMs: they exhibit extremely uneven activation patterns across hidden dimensions.
3.1 The Outlier Feature Phenomenon
Research has shown that in transformer-based LLMs, some hidden features can be 100× larger in magnitude than others. This “outlier feature” phenomenon means:
- A small weight connected to a frequently-activated dimension can be crucial
- A large weight connected to a rarely-activated dimension might be expendable
Magnitude pruning misses this nuance entirely, which explains its poor performance.

3.2 The Wanda Importance Metric
Wanda introduces a remarkably simple metric that accounts for both weight magnitude and activation patterns:
Importance(Wij) = |Wij|×∥Xj∥2
Where: – is the weight connecting the
-th input to the
-th output
– represents all activations in the
-th dimension over a calibration batch
– is the L2 norm of those activations (capturing how “active” that dimension is)
This elegantly combines two crucial factors: – How large the weight is () – How important its input dimension typically is (
)

3.3 The Complete Wanda Algorithm
Implementing Wanda requires just four straightforward steps:
- Gather a small calibration dataset (e.g., 128 diverse text sequences)
- Run a forward pass through the model to compute activation norms for each feature
- Calculate the importance score for each weight using the formula above
- Prune per output row: Remove the lowest-scoring weights within each output neuron’s connections, rather than globally
This row-wise pruning ensures balanced sparsification across all output features, which empirically works better than pruning globally or layer-wise.

3.4 No Fine-tuning Required
Perhaps the most remarkable aspect of Wanda is what it doesn’t do: there’s no retraining, no iterative updates, and no complex weight adjustment. Pruned weights are simply set to zero while everything else remains untouched.
4. Experimental Results: How Well Does It Work?
The authors evaluated Wanda across multiple LLaMA variants (7B to 70B parameters) on standard benchmarks:
4.1 Zero-Shot Task Performance
At 50% sparsity (half the weights removed):
Model | Magnitude Pruning | Wanda | Original Dense |
---|---|---|---|
LLaMA-7B | Severe degradation | 92% of original performance | 100% |
LLaMA-13B | Severe degradation | 95% of original performance | 100% |
4.2 Language Modeling Perplexity
Perplexity measures how well a language model predicts text (lower is better):
Model | Original | Magnitude Pruning (50%) | Wanda (50%) | SparseGPT (50%) |
---|---|---|---|---|
LLaMA-7B | ~7 | ~17-20 | ~7-8 | ~7-8 |
LLaMA-13B | ~6 | ~15-18 | ~6-7 | ~6-7 |
Wanda nearly matches the more complex SparseGPT approach while being significantly faster to execute.
4.3 Structured Sparsity Results
Beyond unstructured pruning, Wanda can be adapted to enforce hardware-friendly patterns like N:M sparsity (where N out of every M consecutive weights are kept):
- 2:4 sparsity pattern: Wanda maintains performance within 98% of unstructured pruning
- 4:8 sparsity pattern: Similar results, making it compatible with NVIDIA’s sparse tensor cores

4.4 Scaling with Model Size
An interesting finding is that larger models tolerate pruning better:
- LLaMA-7B: Shows noticeable degradation at 60%+ sparsity
- LLaMA-70B: Maintains strong performance even at 60-70% sparsity
This suggests that as models grow, they become increasingly overparameterized and amenable to pruning.

5. Practical Implications and Implementation
5.1 Computational Efficiency
Wanda’s simplicity translates to dramatic speed advantages:
- 100-300× faster than methods requiring weight reconstruction
- Pruning LLaMA-7B takes minutes rather than hours or days
- Memory overhead is minimal (just storing activation norms)
5.2 Implementation Details
Implementing Wanda requires very little code:
- Run a forward pass with a small batch through the model
- Compute L2 norms of each input dimension’s activations
- For each weight matrix, calculate importance scores
- Set the bottom 50% (or target sparsity) of weights to zero, row by row

5.3 Practical Usage Scenarios
Wanda enables several practical use cases:
- Deployment on resource-constrained devices: Pruned models require less memory
- Faster inference: Properly implemented sparse operations can reduce computation time
- Reduced cloud computing costs: Less memory usage means cheaper deployment
- Foundation for further compression: Wanda can be combined with quantization for even smaller models
6. Limitations and Future Directions
While Wanda represents a significant advancement, it has limitations:
6.1 Current Limitations
- High sparsity degradation: At very high sparsities (70-80%+), performance still degrades substantially
- Calibration data requirement: Though minimal, you need representative text samples
- Implementation challenges: Efficiently leveraging sparsity requires specialized sparse matrix libraries
6.2 Promising Research Directions
Looking ahead, several promising directions emerge:
- Light post-pruning fine-tuning: Even brief fine-tuning after Wanda pruning might recover more performance
- Pruning + quantization pipelines: Combining both techniques could yield extremely efficient models
- Hardware-aware pruning patterns: Further adapting Wanda to specific hardware accelerators
- Automated calibration set selection: Developing methods to identify the most informative calibration samples
- Activation pruning: Extending the approach to prune entire feature dimensions

7. Conclusion: A Practical Path to Slimmer LLMs
Wanda represents a rare innovation: a method that’s both simpler and more effective than its predecessors. By recognizing the uneven activation patterns in LLMs and incorporating this insight into a straightforward pruning metric, it achieves what complex methods do—at a fraction of the computational cost.
For practitioners and researchers, Wanda opens new possibilities for deploying LLMs in resource-constrained environments. A 50% reduction in model size with minimal performance impact could be the difference between a model that fits in memory and one that doesn’t, or between a responsive user experience and a laggy one.
As LLMs continue growing in size and importance, techniques like Wanda will be crucial in making these powerful models more accessible, efficient, and environmentally sustainable. The most elegant solutions are often the simplest, and Wanda proves this principle holds even in the complex world of billion-parameter language models.