Mastering GPU Memory Constraints for Large-Scale Model Training
1. Introduction
Deep neural networks—especially modern Transformers—can contain billions of parameters, making them extremely demanding on GPU memory. Even on high-end hardware like NVIDIA A100 (80GB) or H100 GPUs, memory quickly becomes a bottleneck when dealing with large models and the substantial batch sizes often required for stable and efficient training.
To put this in perspective:
– GPT-3 (175B parameters) would require approximately 700GB of memory just to store the model parameters and optimizer states during training
– A model like BERT-Large (340M parameters) requires around 3.6GB for parameters alone, but training with AdamW can push total memory usage well above 10GB
A variety of strategies can mitigate these memory constraints:
- Gradient Accumulation – Effectively increase batch size without loading massive batches at once
- Low Memory Optimization (LOMO) – Reduce or eliminate the overhead of adaptive optimizers like Adam during fine-tuning
- Mixed Precision Training – Using lower precision (FP16/BF16) to reduce memory footprint
- Activation Checkpointing – Trading computation for memory by recomputing activations
This article explores the first two techniques in detail, explaining how they tackle different aspects of GPU memory usage and providing implementations for practitioners seeking to train or fine-tune large models with limited computational resources.
2. Sources of GPU Memory Usage
Understanding where memory is consumed during training is crucial for effective optimization. Let’s break down the four major sources of GPU memory usage:
2.1 Parameters
- Model Weights: For a Transformer, these include embeddings, multi-head attention layers, feed-forward networks, and more.
- Large-scale architectures can reach hundreds of millions (or billions) of parameters. For example:
- BERT-Large: 340M parameters (~1.3GB in FP32)
- GPT-2 Large: 1.5B parameters (~6GB in FP32)
- Llama 2 70B: 70B parameters (~280GB in FP32)
2.2 Gradients
- During backpropagation, a separate gradient tensor of the same size as the model weights must be stored (unless you fuse or reduce them on-the-fly).
- For each parameter
, there is a corresponding
, effectively doubling the memory overhead for each trainable parameter.
- In a typical training run of BERT-Large, gradients alone would require another 1.3GB of GPU memory.
2.3 Optimizer States
Different optimizers have varying memory requirements:
- Adam and AdamW keep two extra tensors (first-moment and second-moment estimates) for each parameter. This effectively triples the memory usage beyond parameters and gradients.
- For a 1B parameter model, Adam requires approximately:
- 4GB for parameters (FP32)
- 4GB for gradients (FP32)
- 8GB for optimizer states (FP32)
- Total: ~16GB just for these components
- For a 1B parameter model, Adam requires approximately:
- SGD with momentum requires storing an additional momentum tensor (1x parameter size). Plain SGD has no additional state.
2.4 Activations
- Forward activations are required to compute gradients in the backward pass. Their memory requirement grows with:
- Batch size
- Sequence length (in NLP) or resolution (in vision tasks)
- Model depth
- For large Transformer models, activations can easily become the dominant memory consumer, especially with long sequences.
- Example: Training a BERT-Large model on a sequence length of 512 with batch size 32 can require 30-45GB just for activations.
- These grow linearly with batch size and often quadratically with sequence length in attention-based models.
2.5 Memory Usage Visualization
The following diagram visualizes the relative memory consumption of different components when training a large language model with Adam optimizer:

3. Gradient Accumulation: Virtual Large Batches
3.1 How It Works
In a standard training loop, you load a batch of examples, compute gradients, and update the model parameters immediately. Gradient Accumulation defers the parameter update for several mini-batches:
- Split the effective batch
into
micro-batches.
- For each micro-batch, compute gradients and accumulate them in memory (summing them up).
- After processing
micro-batches, apply one parameter update based on the accumulated gradient.
This approach allows you to simulate training with a large batch size while keeping peak memory usage low.
3.2 Benefits
- Simulates a Larger Batch: Stabilizes training in many scenarios where large batch sizes are helpful (e.g., when using the LAMB optimizer or for large-scale pretraining).
- Maintains Training Dynamics: Produces nearly identical results to training with the equivalent large batch, since gradients are summed.
- No Major Peak Memory Increase: Since each micro-batch is smaller, your memory usage for that step is limited to micro-batch size, not the full effective batch.
- Multi-GPU Compatible: Works well with distributed training setups, especially when some nodes have less memory than others.
3.3 Example Implementation
Here’s a more detailed PyTorch implementation:
def train_with_gradient_accumulation(model, optimizer, dataloader,
accum_steps=4, device="cuda"):
model.train()
# Set running values
running_loss = 0.0
global_step = 0
# Clear gradients
optimizer.zero_grad()
for step, (inputs, targets) in enumerate(dataloader):
# Move inputs to device
inputs = inputs.to(device)
targets = targets.to(device)
# Forward pass
outputs = model(inputs)
loss = loss_function(outputs, targets)
# Normalize loss to account for batch accumulation
loss = loss / accum_steps
# Backward pass
loss.backward()
# Update running loss
running_loss += loss.item()
# Update weights after accumulating gradients for accum_steps
if (step + 1) % accum_steps == 0:
# Apply gradient clipping (optional but common practice)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update parameters
optimizer.step()
# Clear gradients
optimizer.zero_grad()
# Log progress
global_step += 1
if global_step % 100 == 0:
print(f"Step {global_step}: Loss = {running_loss}")
running_loss = 0.0
3.3.1 Gradient Accumulation Visualization
The following diagram illustrates how gradient accumulation works to simulate a larger batch size while maintaining lower memory usage:

3.4 Calculating Memory Savings
Let’s calculate the memory savings using gradient accumulation with a concrete example:
Assume we’re training a 1B parameter model with AdamW:
– Target effective batch size: 128
– Available GPU memory: 24GB
– Single example memory footprint: 0.4GB
Without gradient accumulation:
– Memory needed for batch size 128: 128 × 0.4GB = 51.2GB ❌ (exceeds available memory)
With gradient accumulation (micro-batch size 16):
– Memory needed for micro-batch: 16 × 0.4GB = 6.4GB ✓
– Accumulation steps: 8 (to reach effective batch size 128)
– Total memory usage stays within the 24GB limit
4. Low Memory Optimization (LOMO)
4.1 Motivation and Background
Large-scale fine-tuning is especially memory-intensive because:
- Many tasks require moderately large batch sizes for stable adaptation
- Common optimizers like Adam maintain additional buffers (momentum and variance) that consume significant memory
- Fine-tuning often requires keeping most of the original model’s parameters, limiting the effectiveness of techniques like LoRA (Low-Rank Adaptation)
LOMO was introduced in the paper “LOMO: LOw-Memory Optimization” by Dettmers et al. It proposes that for fine-tuning, where the model is already pretrained, momentum-based or adaptive methods may be less critical. The approach suggests using plain SGD (which has no additional momentum or variance states) and “fusing” gradient computation with updates in a single step, drastically cutting down memory usage.
4.2 Core Principles
- Eliminate Optimizer States:
- Using plain SGD instead of Adam or AdamW avoids storing momentum and second-moment buffers
- This can reduce memory requirements by 2-3x compared to Adam-based approaches
- Works particularly well for fine-tuning scenarios where models are already close to optimal parameters
- Fused Gradient Computation and Updates:
- Gradients are computed and immediately used to update the parameters, minimizing the time those gradients remain in memory
- This approach releases memory as soon as a layer’s parameters are updated
- The backward pass and parameter updates happen in a single pass through the model
- Fine-Tuning Assumption:
- Since the model is starting from a pretrained checkpoint, being stuck in saddle points (a common concern with plain SGD) is less likely
- LOMO can converge well in a relatively small number of steps for many downstream tasks
- Learning rate schedules can help compensate for the lack of adaptive behavior
4.2.1 LOMO Process Visualization
The following diagram illustrates how LOMO differs from standard optimization approaches by eliminating optimizer states and immediately applying updates:

The key memory advantage comes from the immediate parameter updates and gradient discarding in LOMO, which prevents the simultaneous storage of all gradients and optimizer states.
4.3 Memory Comparison
Let’s examine the memory requirements for training a 7B parameter model like Llama 2 7B:
Optimizer | Parameters | Gradients | Optimizer States | Activations | Total |
---|---|---|---|---|---|
AdamW | 28GB | 28GB | 56GB | 40GB | 152GB |
SGD w/ Momentum | 28GB | 28GB | 28GB | 40GB | 124GB |
LOMO | 28GB | ~0.5GB | 0GB | 40GB | ~68.5GB |
Note: Values assume FP32 precision for all components. Mixed precision would reduce these requirements further.
The memory savings from LOMO come primarily from:
1. Not storing optimizer states (momentum/variance)
2. Releasing gradient memory immediately after each layer update
4.4 Implementation Example
Below is a simplified implementation of the LOMO approach:
def lomo_training_step(model, inputs, targets, learning_rate=1e-5):
# Forward pass
outputs = model(inputs)
loss = loss_function(outputs, targets)
# Custom backward pass with immediate updates
for param in reversed(list(model.parameters())):
if param.grad is not None:
param.grad = None
# Note: This is a simplified implementation for illustration
# A production implementation would hook into PyTorch's autograd engine
# more directly to avoid potential issues with the backward graph
loss.backward()
# Immediately apply SGD update after each parameter's gradient is computed
with torch.no_grad():
for param in model.parameters():
if param.grad is not None:
param.data = param.data - learning_rate * param.grad
param.grad = None # Free memory immediately
return loss.item()
Important Note: The LOMO implementation above is highly simplified for educational purposes. A production implementation would require custom C++ extensions to PyTorch’s autograd system to properly fuse operations. The paper authors provide their implementation at https://github.com/OpenLMLab/LOMO.
4.5 Limitations and Considerations
While LOMO provides significant memory benefits, it has some limitations:
- Training Stability: Plain SGD can be less stable than adaptive methods like Adam for some tasks
- Convergence Speed: May require more iterations to reach the same performance as Adam-based methods
- Implementation Complexity: Properly implementing fused operations requires careful handling of PyTorch’s autograd system
- Task Dependency: Works best for fine-tuning tasks; less suitable for training from scratch
5. Practical Considerations and Combined Strategies
5.1 Combining Techniques for Maximum Efficiency
For the most challenging scenarios, combining multiple memory optimization approaches yields the best results:
- Gradient Accumulation + LOMO
- Use gradient accumulation to handle a larger effective batch size
- Apply LOMO principles to minimize memory for each micro-batch
- Example: Fine-tuning Llama 2 70B on consumer hardware becomes possible with this combination
- Mixed Precision + Gradient Accumulation
- Using FP16/BF16 immediately halves memory requirements
- Combined with gradient accumulation, can enable much larger batch sizes
- Implementation example using PyTorch’s automatic mixed precision:
from torch.cuda.amp import autocast, GradScaler
# Initialize scaler for mixed precision
scaler = GradScaler()
def train_with_mixed_precision_and_grad_accum(model, optimizer, dataloader,
accum_steps=4):
model.train()
optimizer.zero_grad()
for step, (inputs, targets) in enumerate(dataloader):
# Use autocast for mixed precision forward pass
with autocast():
outputs = model(inputs)
loss = loss_function(outputs, targets) / accum_steps
# Scale loss and perform backward pass
scaler.scale(loss).backward()
if (step + 1) % accum_steps == 0:
# Unscale gradients and clip
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights with scaled gradients
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
- Activation Checkpointing + LOMO
- Activation checkpointing trades computation for memory by recomputing activations during backprop
- When combined with LOMO, enables training of extremely large models
- Particularly effective for Transformer-based architectures with many layers
5.1.1 Combined Strategies Visualization
The following diagram illustrates how different memory optimization techniques can be combined for maximum efficiency:

This progressive approach allows you to apply increasingly aggressive memory optimization techniques only as needed, maintaining training efficiency when possible.
5.2 Real-World Case Studies
Case Study 1: Fine-tuning 7B Models on Consumer GPUs
– Hardware: Single NVIDIA RTX 3090 (24GB)
– Model: Llama 2 7B
– Task: Fine-tuning for instruction following
– Solution:
– Mixed Precision (BF16)
– Gradient Accumulation (micro-batch 1, accumulation steps 32)
– LOMO optimizer approach
– Achieved: Successful fine-tuning with 15GB peak memory usage
Case Study 2: Training Large Vision Transformers
– Hardware: NVIDIA A100 (40GB)
– Model: Vision Transformer (ViT) with 2B parameters
– Challenge: High resolution images with large feature maps
– Solution:
– Gradient accumulation (micro-batch 2, accumulation steps 16)
– Activation checkpointing on every other Transformer block
– Achieved: 3x larger model than possible with standard training
5.3 Trade-offs and Decision Matrix
Technique | Memory Reduction | Training Speed Impact | Implementation Complexity | Best For |
---|---|---|---|---|
Gradient Accumulation | Medium | Minimal | Low | Large batch requirements |
LOMO | High | Medium | Medium-High | Fine-tuning large models |
Mixed Precision | Medium | Positive (faster) | Low | All training scenarios |
Activation Checkpointing | High | Negative (slower) | Low | Very deep networks |
All Combined | Very High | Medium | High | Extreme cases (70B+ models) |
6. Emerging Techniques and Future Directions
While gradient accumulation and LOMO provide substantial memory benefits, the field of efficient large model training continues to evolve:
6.1 Parameter-Efficient Fine-Tuning (PEFT)
Approaches like LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) significantly reduce memory requirements by:
– Freezing most of the pretrained model parameters
– Adding small trainable adapters or low-rank matrices
– Combining these with techniques like LOMO and gradient accumulation for even greater efficiency
6.2 Model Sharding and Offloading
- ZeRO (Zero Redundancy Optimizer): Distributes optimizer states, gradients, and parameters across multiple GPUs
- CPU Offloading: Temporarily moves optimizer states or even model parameters to CPU RAM when not in use
- These approaches can be complementary to the techniques discussed in this article
6.3 Specialized Hardware Considerations
Different hardware platforms benefit from different optimization strategies:
– Multi-GPU Setups: Favor data parallelism combined with gradient accumulation
– CPU+GPU Hybrid: Can leverage CPU offloading with LOMO
– Apple Silicon (M1/M2): Benefits greatly from unified memory architecture with custom mixed precision approaches
7. Conclusion
GPU memory constraints are a fundamental challenge in training large-scale deep learning models, especially when dealing with Transformers that can have billions of parameters. By understanding the sources of memory consumption and applying targeted optimization techniques, practitioners can overcome these limitations.
Strategies like Gradient Accumulation effectively build large batches across multiple micro-batches, alleviating memory pressure at each step. For fine-tuning massive pretrained models, LOMO (Low Memory Optimization) can dramatically cut memory usage by eliminating optimizer states and fusing gradient computation with parameter updates.
In practice, achieving the best results often involves a combination of these techniques—plus complementary methods like mixed precision and activation checkpointing—to strike the right balance of memory, speed, and performance. As deep learning architectures continue to scale beyond hundreds of billions of parameters, mastering these optimization strategies will remain essential for efficient and cost-effective model training.
The democratization of large language model training and fine-tuning depends on such memory-efficient approaches, enabling researchers and developers with limited computational resources to work with models that would otherwise be inaccessible. As hardware capabilities continue to evolve, so too will these software-based optimization techniques, pushing the boundaries of what’s possible in deep learning.