Advertisement
Role
About the Role
About the RoleThis is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production and understand the failure modes of distributed training.
What You’ll Own- GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
- Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads.
- Reliability & Performance Engineering: Define SLOs and error budgets that account for unique failure modes like ECC errors, NVLink degradation, and NCCL timeouts.
- Networking & Fabric Health: Ensure the health of high-speed interconnects (InfiniBand, RoCE, NVLink) underpinning distributed training.
- Observability: Build deep visibility into GPU utilization, memory pressure, and hardware health beyond standard metrics.
- Automation & Tooling: Build production-grade automation for cluster provisioning, health checks, and firmware/driver lifecycle management.
- GPU Systems Expertise: Deep, hands-on experience operating NVIDIA A100/H100/B200 clusters and understanding hardware failure modes.
- High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics and congestion control at scale.
- Distributed Training & ML Frameworks: Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, or Megatron.
- Linux & Systems Internals: Expert-level knowledge of kernel tuning, driver management, and performance profiling.
- Kubernetes & Orchestration: Experience running Kubernetes with GPU workloads, device plugins, and topology-aware scheduling.
- Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash, and proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible).
- Experience with high-performance parallel file systems like VAST, Weka, or Lustre.
- Experience profiling and optimizing Model FLOPs Utilization (MFU).
- Involvement in physical cluster design including rack layout and power/cooling constraints.
Advertisement
Skills
Required Skills
GPU Infrastructure (A100/H100/B200)
InfiniBand
RoCE
NVLink
NCCL
CUDA
PyTorch
Kubernetes
Linux Internals
Python
Go
Terraform
Distributed Training
Interested in this role?
Sign in to your free seeker account to apply.
Advertisement