Browse Jobs Find Talent Pricing Why Us Blog Post a Job
Advertisement
A
Andromeda Cluster

Senior Site Reliability Engineer AI Infrastructure

Full-time Global / Not Specified Engineering & Tech Senior

Salary not specified
Posted May 7, 2026
5 views
0 apply clicks
Apply Now ← Browse Jobs

About the Role

About the Role

This is not a generalist SRE role. You will design, operate, and debug large-scale GPU infrastructure used for distributed training and inference, working directly with customers pushing the limits of modern AI systems. We’re looking for engineers who have personally run GPU clusters in production and understand the failure modes of distributed training.

What You’ll Own
  • GPU Cluster Architecture: Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Customer Technical Partnership: Serve as the primary technical point of contact for customers running large-scale training workloads.
  • Reliability & Performance Engineering: Define SLOs and error budgets that account for unique failure modes like ECC errors, NVLink degradation, and NCCL timeouts.
  • Networking & Fabric Health: Ensure the health of high-speed interconnects (InfiniBand, RoCE, NVLink) underpinning distributed training.
  • Observability: Build deep visibility into GPU utilization, memory pressure, and hardware health beyond standard metrics.
  • Automation & Tooling: Build production-grade automation for cluster provisioning, health checks, and firmware/driver lifecycle management.
What We’re Looking For
  • GPU Systems Expertise: Deep, hands-on experience operating NVIDIA A100/H100/B200 clusters and understanding hardware failure modes.
  • High-Performance Networking: Production experience with InfiniBand, RoCE, or NVLink fabrics and congestion control at scale.
  • Distributed Training & ML Frameworks: Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, or Megatron.
  • Linux & Systems Internals: Expert-level knowledge of kernel tuning, driver management, and performance profiling.
  • Kubernetes & Orchestration: Experience running Kubernetes with GPU workloads, device plugins, and topology-aware scheduling.
  • Automation & Software Engineering: Strong engineering skills in Python, Go, or Bash, and proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible).
Strong Candidates May Have
  • Experience with high-performance parallel file systems like VAST, Weka, or Lustre.
  • Experience profiling and optimizing Model FLOPs Utilization (MFU).
  • Involvement in physical cluster design including rack layout and power/cooling constraints.
Advertisement

Required Skills

GPU Infrastructure (A100/H100/B200) InfiniBand RoCE NVLink NCCL CUDA PyTorch Kubernetes Linux Internals Python Go Terraform Distributed Training
Interested in this role?
Sign in to your free seeker account to apply.
Sign In to Apply →
Advertisement
← Back to all remote jobs