Unboxing LLMs > loading...

August 12, 2023

Large-Scale AI Infrastructure on Kubernetes: Scaling Training for Modern LLMs

Introduction: The Infrastructure Challenge of Massive AI Models

The relentless march of AI progress is increasingly a story of brute-force computation. As model parameters balloon from millions to billions and now into the trillions, the infrastructure required to breathe life into these behemoths has become a staggering challenge. Training modern large language models (LLMs) like GPT-4 or PaLM 2 isn’t a casual affair. It’s a protracted campaign demanding thousands of top-tier GPUs burning blindingly bright for weeks, sometimes months, consuming megawatts and millions.

This piece dissects the guts of the beast: the Kubernetes-based infrastructure that underpins many of these massive training endeavors. We’ll poke at why certain, sometimes counter-intuitive, design choices prevail, the brutal limitations engineers wrestle with, and the emerging patterns attempting to tame this computational hydra. It’s less about the magic of AI, and more about the grinding reality of the plumbing required to make the magic happen.

Growth of LLM Scale

Kubernetes as the Foundation for AI Infrastructure

Why Kubernetes Has Become the (Perhaps Accidental) Standard for Large-Scale Training

Kubernetes, born from the relatively tame world of microservice orchestration, wasn’t explicitly designed for the high-performance computing (HPC) demands of bleeding-edge AI training. Yet, it has become the de facto substrate. Why this unlikely ascension?

  1. Declarative Resource Wrangling
    In theory, Kubernetes lets you declare your intent—what compute you need, how it’s configured, how it talks. In practice, it provides a structured way to wrestle massive resource pools into submission, codifying the complex requirements of distributed training jobs.
  2. Papering Over Hardware Heterogeneity
    Real-world AI clusters are often messy mosaics: different GPU generations, varied CPUs, exotic accelerators. Kubernetes attempts to smooth over these differences, presenting a somewhat unified façade to the workloads, hiding the underlying hardware sprawl.
  3. Automation and Bracing for Failure
    Training runs lasting weeks will encounter hardware failures. It’s not if, but when. Kubernetes’ self-healing capabilities, while not a panacea, offer a baseline level of resilience, attempting to keep the ship sailing despite nodes inevitably going dark.
  4. The Ecosystem and Its Tentacles
    The sheer gravity of Kubernetes has pulled in a constellation of tools. Custom schedulers, monitoring stacks, specialized operators (like Kubeflow’s MPI Operator)—these extensions graft AI-specific intelligence onto the core platform, patching holes and adding necessary capabilities.
  5. A Lingua Franca Across Environments
    Whether you’re burning cash on-prem or in the cloud, the Kubernetes API offers a semblance of consistency. This standardized interface simplifies workflows, allowing teams to focus (slightly) less on the environment and more on the training itself.

Kubernetes Components for AI Workloads

History provides perspective. OpenAI’s GPT-3 bake in 2020 reportedly ran on a Microsoft Azure Kubernetes cluster packing around 10,000 NVIDIA V100 GPUs. Today’s frontier models demand even more computational firepower – think tens of thousands of the latest, hungriest accelerators. Kubernetes is the arena, chosen perhaps because it was the least bad option available at scale.

The Single-Pod-Per-Node Pattern: Maximizing Hardware Utilization (By Yielding to It)

Why Dense Packing Breaks for Serious AI Workloads

Here’s where things get interesting, especially for those steeped in traditional Kubernetes wisdom. The standard practice is bin-packing: cram as many pods onto a node as possible to maximize utilization. For large-scale AI training, this wisdom is inverted. The dominant pattern is one pod per node. This apparent wastefulness isn’t ideology, rather pragmatism born from the unforgiving demands of the hardware.

AI Training K8s

Granting Unfettered Hardware Access

  1. GPU Sovereignty and CUDA’s Gaze
    Frameworks like PyTorch and TensorFlow crave direct, exclusive control over GPUs. Shoving multiple training processes onto the same node, mediated by Kubernetes device plugins, introduces overhead and complexity. A single pod, seeing all the node’s GPUs, can manage them directly, speaking CUDA without intermediaries.
  2. Respecting the Laws of NVLink
    Modern GPU servers boast specialized interconnects like NVLink, allowing GPUs to gossip at speeds far exceeding standard PCIe lanes (think 600 GB/s). This high-speed chatter works best when orchestrated by a single process within a single pod, simplifying the intricate dance of GPU-to-GPU communication.
  3. Navigating Memory Mazes (NUMA)
    Big servers often feature Non-Uniform Memory Access (NUMA) architectures – a fancy way of saying memory access speed depends on which CPU socket asks. A single pod can be pinned and configured with NUMA awareness, ensuring data stays close to the compute that needs it, avoiding costly cross-socket trips.

Shedding Networking Overheads

  1. Topology Truthfulness
    Distributed training is network-bound. Synchronization between nodes is constant and intense. A one-pod-per-node setup makes the pod topology mirror the physical network topology. This allows training frameworks to use topology-aware communication algorithms, drastically cutting down synchronization time by avoiding network bottlenecks.
  2. Bare-Metal Speed (Host Networking)
    To squeeze out every last microsecond, these single pods often run with hostNetwork: true. This bypasses Kubernetes’ virtual networking layers (like CNI overlays), giving the pod direct access to the node’s network interface. Essential for RDMA (Remote Direct Memory Access) and other low-latency protocols where software overhead is poison.

Operational Sanity

  1. Brutally Simple Accounting
    One pod, one node. Resource allocation becomes trivial: the pod gets everything. No complex quota negotiations, no stranded resources, no bin-packing puzzles.
  2. Easing Scheduler Strain
    Placing one giant pod is simpler for the Kubernetes scheduler than juggling dozens of smaller ones with intricate constraints. It reduces the combinatorial complexity the scheduler faces. This pattern is an admission: sometimes the abstraction needs to get out of the way and let the workload talk directly to the metal.

Semi-Stateful Workloads: Checkpointing as the Bulwark Against Entropy

Training a massive LLM isn’t neatly stateless like a web server, nor fully stateful like a database clutching its precious data. It occupies a murky middle ground: a long, computationally intensive process punctuated by moments of persistence. This is the world of the “semi-stateful” workload, where progress is real but fragile.

The Existential Role of Checkpoints

Checkpoints are your fundamental defense against the twin demons of hardware mortality and the second law of thermodynamics. Without them, a single node failure weeks into a multi-month run could vaporize millions of dollars in compute.

  1. The Checkpoint Cadence Dilemma
    How often to save? Checkpoint too frequently, and you waste precious compute cycles and hammer storage I/O. Checkpoint too rarely, and a failure means losing more work. Sophisticated training systems dynamically adjust this cadence, balancing the cost of saving against the probability-weighted cost of failure, sometimes based on real-time cluster health monitoring.
  2. The Distributed Checkpointing Nightmare
    Saving a model with trillions of parameters isn’t trivial. it can become a significant bottleneck itself. Strategies to mitigate this include:
    • Sharded Checkpoints: Each node saves its slice of the model parameters and optimizer state.
    • Asynchronous Checkpointing: Saving happens in the background, allowing training to (theoretically) continue, though this adds complexity.
    • Incremental Checkpoints: Only saving parameters that have significantly changed since the last save (tricky to get right).
  3. A Hierarchy of Persistence Paranoia
    Most large-scale setups employ a tiered approach to checkpoint storage, reflecting different levels of urgency and durability:
    • Frequent, Fleeting: Temporary checkpoints saved rapidly to fast, local NVMe drives. Good for recovering from transient node failures.
    • Regular, Reliable: Checkpoints saved less frequently to high-performance network storage (like Lustre or GPFS). The go-to for recovery from more significant cluster events.
    • Archival, Eternal: “Golden” checkpoints saved infrequently (e.g., end-of-epoch, major milestones) to durable, cheaper object storage (S3, GCS, Azure Blob). For disaster recovery or resuming training much later.

Checkpoint Hierarchy

Kubernetes Job Controllers: The Digital Sheepdogs

Standard Kubernetes Deployments or StatefulSets aren’t quite right for managing these complex, distributed, semi-stateful training runs. Specialized controllers, often based on the Message Passing Interface (MPI) model like the Kubeflow MPI Operator, have emerged as the necessary digital sheepdogs:

  • They orchestrate the coordinated launch of hundreds or thousands of pods.
  • They propagate signals for graceful shutdowns or checkpoint triggers.
  • They manage job restarts, pointing the new pods to the latest valid checkpoint after a failure.

Storage Architectures: Feeding the Beast and Saving Its Memories

AI training storage is a two-headed monster. It needs to sustain a voracious, high-throughput appetite for training data reads, while also handling the periodic, intense data dumps of checkpoint writes.

Data Access Dichotomy

  1. Training Data Consumption
    Access patterns usually involve:
    • Initial random shuffles or sampling to create training batches or curricula.
    • Primarily sequential, high-throughput reads during training epochs.
    • Potential re-reads if multiple passes over the data are needed. Object storage, with its massive parallelism and read throughput, often shines here, sometimes fronted by caching layers.
  2. Checkpoint Persistence
    Writing checkpoints looks different:
    • Bursty, write-heavy operations.
    • Reads are rare – only needed for recovery.
    • Most checkpoints have short lifespans (only the latest few matter).
    • Critical durability required for milestone or “golden” checkpoints.

Storage Architecture for Training

Storage System Tradeoffs

Storage TypeStrengthsWeaknessesTypical Role
Local NVMe SSDsBlazing speed, lowest latencyFinite capacity, node-lockedEphemeral checkpoints, data caching
Network File Systems (Lustre, GPFS, etc.)Shared access, decent throughputSetup complexity, scaling limitsShared intermediate checkpoints, config
Object Storage (S3, GCS, Azure Blob, MinIO)Near-infinite scale, durability, cost-effectiveHigher latency, consistency caveatsTraining datasets, archival checkpoints
Kubernetes PVs (via CSI)K8s native, simplifies provisioningPotential overhead, abstraction layerSmaller setups, dev environments, sometimes NFS/Lustre backend

Pragmatic Storage Layering

In production, a hybrid strategy is almost universal, mapping different storage types to specific needs within the pod:

Training Pod
├── /data                 # Read-only mount of the massive training dataset (often object storage via FUSE or caching)
├── /local-checkpoints    # High-frequency, temporary saves (direct hostPath to local NVMe)
├── /shared-checkpoints   # Medium-frequency, shared saves (PV pointing to NFS/Lustre/GPFS)
└── /config               # Job configuration, scripts, monitoring sidecars (ConfigMap/PV)

Networking: The High-Speed Circulatory System (or Bottleneck)

For distributed training, the network is the critical circulatory system. As models grow, the flood of gradients and parameters that must synchronize across nodes can easily overwhelm inadequate network designs. Performance here is non-negotiable.

Where Networks Break Down

  1. The All-Reduce Storm
    The workhorse algorithm, synchronous Stochastic Gradient Descent (SGD), hinges on a step called “all-reduce.” At each training step, gradients calculated on every node must be aggregated globally. This triggers an intense, synchronized burst of communication that hammers the network fabric.

All-Reduce Communication

  1. Topology is Destiny
    The physical layout of the network matters immensely:
    • Full bisection bandwidth networks (like fat-trees or Clos topologies) are the ideal, allowing any-to-any communication without congestion. Expensive but necessary for top performance.
    • Oversubscribed networks have bottlenecks. performance degrades sharply when many nodes try talking at once during all-reduce phases. Common in less specialized clusters.

Kubernetes Networking: Getting Out of the Way

  1. The CNI Friction Tax
    Standard Kubernetes Container Network Interfaces (CNIs) like Flannel or Calico, designed for general-purpose networking, introduce unacceptable overhead for HPC/AI:
    • Overlay networks (VXLAN etc.) add encapsulation/decapsulation costs.
    • Virtual interfaces (veth pairs) add latency.
    • Software routing/firewalling (iptables/eBPF) burns precious CPU cycles needed for computation.
  2. Stripping Down to the Metal
    To achieve the necessary low latency and high bandwidth, AI clusters aggressively bypass standard K8s networking:
    • hostNetwork: true is the default, eliminating CNI overhead entirely.
    • SR-IOV (Single Root I/O Virtualization) can grant pods direct, virtualized access to physical NIC hardware.
    • RDMA-capable fabrics (InfiniBand or RDMA over Converged Ethernet – RoCE) allow network adapters to transfer data directly between application memory (or even GPU memory) across nodes, bypassing the CPU and OS kernel stack almost entirely.
    • GPUDirect RDMA enables GPUs to push/pull data directly to/from RDMA-capable NICs, eliminating costly data copies through CPU memory.

AI Training Pod Networking

  1. Configuration Snapshot
    A typical K8s pod spec snippet for training reflects this pragmatism:
spec:
  hostNetwork: true       # Direct host networking is mandatory
  dnsPolicy: ClusterFirstWithHostNet  # Keep K8s DNS for service discovery
  nodeSelector:
    topology.kubernetes.io/zone: us-east-1a  # Attempt to keep pods physically close
  affinity:
    podAntiAffinity:      # Spread pods across physical racks to avoid overloading rack switches
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: topology.kubernetes.io/rack

Modern Training Orchestration Platforms: Abstractions Atop the Foundation

The raw complexity of orchestrating these massive, failure-prone, hardware-sensitive training runs has inevitably led to higher-level platforms. These systems layer specialized ML/AI intelligence onto the Kubernetes substrate.

The Contenders

  1. MosaicML (Swallowed by Databricks)
    Focused squarely on LLM training efficiency, Mosaic offered optimized libraries (like Composer), clever sharding algorithms, integrated hyperparameter optimization, and hardware-aware scheduling designed for the nuances of large model training.
  2. Determined AI (Now HPE)
    Aimed to democratize distributed training, providing simpler APIs, robust fault tolerance with automatic checkpointing/restarts, and integrated experiment tracking, abstracting away some of the MPI-level complexities.
  3. Cloud Behemoths (SageMaker, Vertex AI)
    AWS and Google Cloud offer managed platforms that promise seamless scaling, pre-built optimized containers, and deep integration with their respective cloud ecosystems (monitoring, storage, billing). The trade-off is potential lock-in and sometimes less flexibility than bespoke solutions.

Shifting Architectural Sands

The landscape continues to evolve:

  1. Ephemeral Compute Tribes: Instead of monolithic, permanent K8s clusters, some opt for creating dedicated clusters just for a single massive training run, tearing them down afterwards. This avoids resource contention and allows for highly customized environments.
  2. Bridging Worlds (Hybrid Cloud/On-Prem): For truly enormous jobs, organizations might use their on-prem clusters as a base and burst into the cloud for peak capacity, requiring sophisticated workload management and data synchronization.
  3. Kubernetes, Remastered for AI: We’re seeing the emergence of K8s distributions specifically hardened and optimized for AI/HPC, stripping out unnecessary components, tuning schedulers, and integrating HPC-centric features natively.

Example: A Glimpse into the Engine Room (Production-Ready Job)

This YAML artifact provides a more concrete sense of how these pieces fit together in a large-scale MPI-based training job definition. It’s a snapshot of pragmatism, encoding hardware needs, networking choices, and resilience strategies:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: llm-pretraining-job-001 # Give jobs meaningful names
spec:
  slotsPerWorker: 8 # Assuming nodes with 8 GPUs each
  runPolicy:
    cleanPodPolicy: Running # Keep pods around for debugging if job fails
  mpiReplicaSpecs:
    Launcher: # A single pod to coordinate the MPI launch
      replicas: 1
      template:
        spec:
          containers:
          - name: launcher
            image: registry.example.com/llm-training-framework:v1.2.3
            # Launcher often needs less resources, only coordinates startup
            env:
            - name: MODEL_CHECKPOINT_NAME
              value: "llama-70b-exp-run-001"
            - name: TARGET_LOSS
              value: "1.85"
    Worker: # The main compute workforce
      replicas: 512 # Requesting 512 worker nodes (512 * 8 = 4096 GPUs total)
      template:
        spec:
          priorityClassName: training-critical # Ensure these pods aren't easily preempted
          hostNetwork: true # Mandatory for performance
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - name: worker # The container running the actual training code
            image: registry.example.com/llm-training-framework:v1.2.3
            resources:
              limits:
                nvidia.com/gpu: 8 # Request all 8 GPUs on the node
                cpu: "96" # Request significant CPU resources (adjust based on node)
                memory: 1024Gi # Request large memory (adjust based on node)
                ephemeral-storage: 2Ti # For local caching, logs etc.
            volumeMounts:
            - name: training-data # Mount point for the dataset
              mountPath: /data
              readOnly: true # Usually read-only
            - name: shared-checkpoints # Mount point for durable checkpoints
              mountPath: /checkpoints
            - name: local-checkpoints # Mount point for faster, local checkpoints
              mountPath: /local-checkpoints # Pointing to hostPath volume below
            env:
            # Environment variables to configure the training script
            - name: LOCAL_CHECKPOINT_DIR
              value: "/local-checkpoints"
            - name: SHARED_CHECKPOINT_DIR
              value: "/checkpoints"
            - name: NCCL_PROTO
              value: "Simple" # NCCL tuning parameters (highly environment specific)
            - name: NCCL_ALGO
              value: "Ring"
            - name: NCCL_IB_DISABLE # Ensure InfiniBand is enabled for NCCL if available
              value: "0"
            - name: NCCL_DEBUG # Verbose NCCL logging for debugging
              value: "INFO"
          volumes:
          # Define the volumes used in volumeMounts
          - name: training-data
            persistentVolumeClaim: # Assumes a PVC for the dataset (could be backed by object storage)
              claimName: large-training-dataset-pvc
          - name: shared-checkpoints
            persistentVolumeClaim: # Assumes a PVC for shared checkpoints (backed by NFS/Lustre/GPFS)
              claimName: shared-checkpoint-storage-pvc
          - name: local-checkpoints
            hostPath: # Use hostPath for direct access to local NVMe
              path: /mnt/fast_nvme/checkpoints # Path on the host node
              type: DirectoryOrCreate # Create if it doesn't exist
          tolerations: # Ensure pods can run on GPU nodes
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
          nodeSelector: # Optionally target specific node types
            cloud.google.com/gke-accelerator: nvidia-h100-80gb
          affinity: # Spread workers across failure domains (racks)
             podAntiAffinity:
               requiredDuringSchedulingIgnoredDuringExecution:
               - labelSelector:
                   matchExpressions:
                   - key: mpi-job-name
                     operator: In
                     values:
                     - llm-pretraining-job-001 # Match pods from the same job
                 topologyKey: topology.kubernetes.io/rack

Monitoring and Observability: Watching the Million-Dollar Burn

Flying blind on a multi-million dollar training run is fiscal malpractice. Comprehensive monitoring is an economic necessity. You need to know if you’re actually making progress or just efficiently converting electricity into heat.

Vital Signs for Training Jobs

A useful monitoring stack needs to track multiple dimensions:

  1. Actual Training Progress
    • Loss curves (is the model learning?)
    • Accuracy/perplexity/other relevant metrics
    • Training throughput (tokens/second, samples/second)
    • Gradient norms, parameter update scales (sanity checks)
  2. Hardware Vital Signs
    • GPU utilization (% SM active, memory usage, power draw, temperature)
    • Network bandwidth used (especially during all-reduce)
    • Storage I/O (reads from dataset, writes to checkpoints)
    • CPU utilization (often needed for data preprocessing)
  3. Efficiency & Cost
    • MFU (Model FLOPs Utilization): The holy grail. What percentage of the GPUs’ theoretical peak FLOPS are you actually achieving in your end-to-end training step? Often depressingly low.
    • Power consumption (total cluster draw).
    • Cost per training step / per billion tokens processed.

Data Sources

Dashboards, Alerts, and Insight (Hopefully)

Tools like Weights & Biases (W&B), TensorBoard, Grafana, Prometheus, and integrated cloud monitoring services are commonly duct-taped together to provide:

  • Real-time dashboards tracking the pulse of the training run.
  • Automated alerts for critical conditions: training divergence, stalled progress, hardware failures, thermal throttling.
  • Historical data for comparing runs, debugging regressions, and justifying budget requests. The challenge, as always, is moving from a sea of dashboards to actual, actionable insight.

Conclusion: Building Adaptable Foundations for an Accelerating Future

Taming the large-scale AI training beast on Kubernetes demands a specific, sometimes counter-intuitive, set of architectural choices. The single-pod-per-node pattern, aggressive networking optimizations, multi-tiered checkpointing, and specialized orchestration are pragmatic adaptations to the extreme demands of modern models. K8s provides the substrate, but significant customization and higher-level tooling are essential.

The ground continues to shift rapidly beneath our feet. We can anticipate:

  • More Exotic Hardware: The accelerator zoo will expand beyond GPUs (TPUs, IPUs, custom ASICs).
  • Ever More Complex Parallelism: Intricate dances combining data, pipeline, tensor, sequence, and expert parallelism will become standard.
  • Federated Frontiers: Training across organizational or geographic boundaries presents political and technical minefields.
  • The Efficiency Imperative: Brutal economics and sustainability concerns will drive relentless optimization for energy and cost.

Building robust AI infrastructure today isn’t about finding the perfect, static solution. It’s about constructing adaptable foundations – systems that acknowledge the current realities while retaining the flexibility to evolve with the inevitable, accelerating pace of change in the AI landscape.

Further Reading & Resources

Posted in AI / ML, LLM Advanced