Unboxing LLMs > loading...

August 12, 2023

Large-Scale AI Infrastructure on Kubernetes: Scaling Training for Modern LLMs

Large-Scale AI Infrastructure on Kubernetes: Scaling Training for Modern LLMs

Introduction: The Infrastructure Challenge of Massive AI Models

As AI model parameters have increased exponentially—from millions to billions and now trillions—the infrastructure requirements for training have grown dramatically. Modern large language models (LLMs) like GPT-4 and PaLM 2 require enormous computational resources, often demanding thousands of high-end GPUs working in concert for weeks or months.

This article examines the architectural patterns and engineering considerations behind the Kubernetes-based infrastructure that powers today’s largest AI training operations. We’ll explore why certain design choices are made, technical limitations that must be overcome, and emerging solutions that are reshaping how organizations approach large-scale AI training.

Growth of LLM Scale

Kubernetes as the Foundation for AI Infrastructure

Why Kubernetes Has Become the Standard for Large-Scale Training

Kubernetes was originally designed as a container orchestration platform for microservices, but it has evolved into the de facto standard for managing large-scale machine learning workloads. Several factors drive this adoption:

  1. Declarative Resource Management
    Kubernetes allows organizations to express their intent through manifests that define exactly what computational resources they need, how those resources should be configured, and how they should be connected.
  2. Abstraction Over Hardware Heterogeneity
    AI training clusters often combine different generations of GPUs, various CPU architectures, and specialized accelerators. Kubernetes abstracts these hardware differences, presenting a unified resource pool to workloads.
  3. Automation and Self-Healing
    Training jobs that run for weeks need robust failure handling. Kubernetes’ self-healing capabilities help maintain job continuity despite inevitable hardware failures in large clusters.
  4. Ecosystem and Extensibility
    The rich ecosystem of tools built around Kubernetes enables specialized functionality for AI workloads—custom schedulers, monitoring stacks, and training-specific operators extend the platform’s capabilities.
  5. Standardization Across Environments
    The same Kubernetes APIs work in on-premises data centers, public clouds, and hybrid environments, enabling consistent workflows regardless of where training occurs.
Kubernetes Components for AI Workloads

Historical context is important: when OpenAI trained GPT-3 in 2020, they reportedly utilized a Microsoft Azure-powered Kubernetes cluster with around 10,000 NVIDIA V100 GPUs. Today’s frontier models like GPT-4 or Anthropic’s Claude are trained on even larger infrastructures, with estimates suggesting tens of thousands of the latest GPU accelerators.

The Single-Pod-Per-Node Pattern: Maximizing Hardware Utilization

Why Dense Packing Doesn’t Work for AI Workloads

One of the first surprises for engineers coming from traditional Kubernetes deployments is the prevalence of the single-pod-per-node pattern for large-scale training. This approach seems to contradict the standard practice of maximizing node density through multiple pods per node, but it offers critical advantages for AI workloads:

AI Training K8s

Uncompromised Hardware Access

  1. GPU Allocation and CUDA Visibility
    Deep learning frameworks like PyTorch and TensorFlow perform best when they have direct, exclusive access to GPUs. A single pod with visibility to all GPUs on a node can dynamically allocate work across them without the overhead of Kubernetes’ device plugin abstraction.
  2. GPU-to-GPU Communication
    Modern GPU servers contain specialized hardware interconnects like NVIDIA NVLink, which enables GPUs to communicate directly with each other at up to 600 GB/s—10-20x faster than PCIe. These interconnects work optimally when a single process orchestrates the communication, which is easier to achieve within a single pod.
  3. Memory Management and NUMA Considerations
    Large servers often have Non-Uniform Memory Access (NUMA) architectures where memory access speeds vary depending on which CPU socket is accessing the memory. A single pod can be configured with NUMA-aware settings to optimize memory placement for training workloads.

Networking Simplifications

  1. Network Topology Awareness
    Distributed training requires frequent synchronization between nodes. By using one pod per node, the training job can map the pod topology directly to the physical network topology, enabling topology-aware communication patterns that significantly reduce synchronization overhead.
  2. Host Networking Performance
    The single-pod model often uses Kubernetes’ hostNetwork: true setting, bypassing virtual networking layers to achieve bare-metal network performance—critical for RDMA (Remote Direct Memory Access) and other low-latency communication protocols.

Operational Advantages

  1. Simplified Resource Accounting
    With one pod per node, resource accounting becomes straightforward: the pod gets all the node’s resources. This eliminates complex resource allocation negotiations and potential resource fragmentation.
  2. Reduced Scheduler Pressure
    The Kubernetes scheduler has less work to do when placing large, node-sized pods, as it doesn’t need to solve the bin-packing problem of fitting many small pods on nodes.

Semi-Stateful Workloads: Checkpointing as a Resilience Strategy

Training a large language model is neither completely stateless (like a web server) nor fully stateful (like a database). Instead, it represents a “semi-stateful” workload where computation progresses through stages, with periodic persistence of state to enable recovery.

The Role of Checkpoints in Resilient Training

  1. Checkpoint Frequency Tradeoffs
    Checkpointing too frequently wastes computation and storage I/O; too infrequently risks losing more work when failures occur. Modern training infrastructures adaptively adjust checkpoint frequency based on observed failure rates and job progress.
  2. Distributed Checkpointing Strategies
    For models with billions of parameters, writing checkpoints can become a bottleneck. Advanced distributed checkpointing techniques include:
    • Sharded checkpoints (each node saves a portion of the model)
    • Asynchronous checkpointing (background saving while training continues)
    • Incremental checkpoints (saving only parameters that changed significantly)
  3. Checkpoint Storage Hierarchies
    A typical setup employs a multi-tier strategy:
    • High-frequency, temporary checkpoints to local NVMe storage
    • Medium-frequency checkpoints to high-speed network storage (e.g., Lustre, GPFS)
    • Low-frequency, durable checkpoints to object storage (S3, GCS, Azure Blob)
Checkpoint Hierarchy

Kubernetes Job Controllers for Training Management

MPI-based job controllers like the Kubeflow MPI Operator have emerged to handle the specific needs of distributed training jobs. These controllers manage:

  • Coordinated startup of training pods across nodes
  • Signal propagation for clean shutdowns
  • Checkpoint triggering on pod preemption or node failures
  • Job restarts from the latest checkpoint

Storage Architectures for AI Workloads

Storage for AI training must handle two distinct workload patterns: high-throughput, sequential reads for training data, and periodic but intensive writes for checkpoints.

Data Access Patterns in Training

  1. Training Data Storage
    Training data access typically follows these patterns:
    • Initial random sampling for curriculum or mix creation
    • Sequential reading during training epochs
    • Potential re-reading for multiple passes

    This favors object storage systems with high read throughput and parallelism.

  2. Checkpoint Storage
    Checkpoint writes have these characteristics:
    • Bursty, high-volume writes
    • Infrequent reads (only during failure recovery)
    • Short retention periods for most checkpoints
    • Critical durability for “golden” checkpoints
Storage Architecture for Training

Storage System Comparison

Storage Type Strengths Weaknesses Best Use Case
Local NVMe SSDs Lowest latency, highest bandwidth Limited capacity, node-specific Temporary checkpoints, caching
Network File Systems (Lustre, GPFS) Shared access, high throughput Complex setup, scaling challenges Intermediate checkpoints, shared configuration
Object Storage (S3, GCS, Azure Blob, MinIO) Virtually unlimited capacity, highly durable Higher latency, limited transactional support Training data, archival checkpoints
Kubernetes PVs with CSI Kubernetes-native integration Overhead of PV lifecycle management Small-scale training, development

Practical Storage Configurations

Many production setups use a hybrid approach:

Training Pod
├── /data                 # Read-only training data (from object storage)
├── /local-checkpoints    # High-frequency checkpoints (NVMe hostPath)
├── /shared-checkpoints   # Medium-frequency checkpoints (NFS/Lustre PV)
└── /config               # Job configuration and monitoring data (ConfigMap)

Networking Considerations for Distributed Training

Network performance can make or break distributed training. As model sizes increase, so does the volume of gradient and parameter updates that must flow between nodes.

Network Bottlenecks in Training

  1. All-Reduce Communication Patterns
    The standard synchronous SGD (Stochastic Gradient Descent) training approach requires an “all-reduce” operation at each step, where gradients from all nodes must be combined. This creates intense, bursty network traffic that can easily saturate typical networks.
All-Reduce Communication
  1. Network Topology Impact
    Physical network topology dramatically affects training performance:
    • Full bisection bandwidth topologies allow any node to communicate with any other node at full speed
    • Oversubscribed topologies create bottlenecks when many nodes need to communicate simultaneously

Kubernetes Networking Solutions for AI

  1. CNI Limitations for HPC Workloads
    Standard Container Network Interface (CNI) implementations like Flannel or Calico add overhead that can significantly impact performance:
    • Overlay networks add encapsulation overhead
    • Virtual interfaces add latency
    • Software-defined routing adds CPU overhead
  2. Performance-Optimized Approaches
    For maximum performance, AI clusters often use:
    • Host networking mode to bypass CNI overhead
    • SR-IOV for direct hardware access to network adapters
    • RDMA-capable networks (InfiniBand, RoCE) for near-zero CPU overhead
    • GPUDirect RDMA to allow direct GPU-to-network transfers
AI Training Pod Networking
  1. Practical Implementation
    A common Kubernetes networking setup for large-scale training:
spec:
  hostNetwork: true       # Use host networking for maximum performance
  dnsPolicy: ClusterFirstWithHostNet  # Still use Kubernetes DNS
  nodeSelector:
    topology.kubernetes.io/zone: us-east-1a  # Keep pods in same zone
  affinity:
    podAntiAffinity:      # Spread across different physical racks
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: topology.kubernetes.io/rack

Modern Training Orchestration Platforms

The complexity of large-scale training has spawned specialized platforms that build on Kubernetes while adding ML-specific functionality.

Specialized Training Platforms

  1. MosaicML (now Databricks)
    MosaicML’s platform offers specialized capabilities for large-scale LLM training:
    • Optimized training libraries with advanced sharding techniques
    • Automated hyperparameter tuning integrated with distributed training
    • Hardware-aware scheduling for heterogeneous GPU clusters
  2. Determined AI (acquired by HPE)
    Focused on making distributed training more accessible:
    • Simplified API for distributed training
    • Automatic fault tolerance and checkpointing
    • Integrated experiment tracking
  3. SageMaker and Vertex AI
    Cloud providers offer managed training platforms:
    • Seamless scaling of training clusters
    • Pre-optimized containers for common frameworks
    • Integration with cloud-native monitoring and billing

Emerging Architectural Patterns

Recent developments in training infrastructure include:

  1. Ephemeral Training Clusters
    Instead of maintaining a permanent Kubernetes cluster, some organizations spin up purpose-built clusters for specific training jobs, then tear them down when complete.
  2. Hybrid On-Prem and Cloud Training
    For extremely large jobs, organizations might combine their on-premises GPU clusters with burst capacity from cloud providers during peak needs.
  3. Training-Specialized Kubernetes Distributions
    Custom Kubernetes distributions optimized specifically for AI workloads, with specialized scheduling, reduced overhead, and HPC-focused features.

Example: A Production-Ready Training Job Configuration

The following YAML illustrates a more complete training job configuration that incorporates many of the best practices discussed:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: llm-pretraining
spec:
  slotsPerWorker: 8  # 8 GPUs per node
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - name: launcher
            image: registry.example.com/llm-training:v1.2.3
            env:
            - name: MODEL_SIZE
              value: "70B"  # 70 billion parameter model
    Worker:
      replicas: 512  # 512 worker nodes (4,096 GPUs total)
      template:
        spec:
          priorityClassName: training-critical
          hostNetwork: true
          containers:
          - name: worker
            image: registry.example.com/llm-training:v1.2.3
            resources:
              limits:
                nvidia.com/gpu: 8
                cpu: "96"
                memory: 1024Gi
                ephemeral-storage: 2Ti
            volumeMounts:
            - name: training-data
              mountPath: /data
              readOnly: true
            - name: checkpoints
              mountPath: /checkpoints
            - name: local-cache
              mountPath: /local-cache
            env:
            - name: LOCAL_CHECKPOINT_PATH
              value: "/local-cache/checkpoints"
            - name: DISTRIBUTED_CHECKPOINT_PATH
              value: "/checkpoints"
            - name: NCCL_IB_DISABLE
              value: "0"  # Enable InfiniBand for NCCL
            - name: NCCL_DEBUG
              value: "INFO"
          volumes:
          - name: training-data
            persistentVolumeClaim:
              claimName: training-dataset
          - name: checkpoints
            persistentVolumeClaim:
              claimName: distributed-checkpoints
          - name: local-cache
            hostPath:
              path: /mnt/nvme/cache
              type: DirectoryOrCreate
          tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule

Monitoring and Observability for Training Jobs

A critical but often overlooked aspect of large-scale training is comprehensive monitoring. When running jobs that cost millions in compute resources, visibility into performance is essential.

Key Metrics for Training Jobs

A robust monitoring stack should track:

  1. Training Progress Metrics
    • Loss/accuracy curves over time
    • Training throughput (samples per second)
    • Gradient norms and parameter update magnitude
  2. Hardware Utilization Metrics
    • GPU utilization and memory usage
    • NCCL/communication bandwidth
    • Storage I/O throughput
  3. Efficiency Metrics
    • MFU (Model FLOPs Utilization) – percentage of theoretical peak FLOPS achieved
    • Power consumption and performance per watt
    • Cost per training token
Monitoring Architecture

Visualization and Alerting

Tools like Weights & Biases, TensorBoard, Grafana, and Prometheus are commonly integrated with training clusters to provide:
– Real-time dashboards of training progress
– Alerts for training stalls or hardware issues
– Historical performance comparisons across runs

Conclusion: Building for Scale and Flexibility

Successfully deploying large-scale AI training infrastructure on Kubernetes requires careful consideration of hardware capabilities, networking topology, storage architecture, and software design. The single-pod-per-node pattern, combined with robust checkpointing mechanisms, provides a pragmatic approach to running training workloads that can scale to thousands of nodes.

As models continue to grow in size and complexity, we can expect further evolution in infrastructure patterns:

  • Specialized Hardware: More diverse accelerators beyond traditional GPUs
  • Hybrid Parallelism: Combinations of data, pipeline, tensor, and sequence parallelism
  • Federated Training: Distributed training across organizational boundaries
  • Energy Efficiency: Optimizations driven by sustainability and cost concerns

Organizations building AI infrastructure should focus on creating flexible foundations that can adapt to these rapid changes while maintaining operational efficiency and reliability.

Further Reading & Resources

Posted in AI / ML, LLM Advanced
Write a comment