Site Reliability Engineer - Kubernetes Platform
Company: xAI
Location: Palo Alto
Posted on: February 21, 2026
|
|
|
Job Description:
Job Description Job Description About xAI xAI's mission is to
create AI systems that can accurately understand the universe and
aid humanity in its pursuit of knowledge. Our team is small, highly
motivated, and focused on engineering excellence. This organization
is for individuals who appreciate challenging themselves and thrive
on curiosity. We operate with a flat organizational structure. All
employees are expected to be hands-on and to contribute directly to
the company's mission. Leadership is given to those who show
initiative and consistently deliver excellence. Work ethic and
strong prioritization skills are important. All employees are
expected to have strong communication skills. They should be able
to concisely and accurately share knowledge with their teammates.
About the Role We are seeking a highly skilled Site Reliability
Engineer to join our mission-driven team, focusing on designing,
building, and optimizing Kubernetes clusters across multiple
regions. In this role, you will leverage your expertise in
Kubernetes orchestration and distributed systems to enhance the
reliability, performance, and cost-effectiveness of xAI's
infrastructure. You will collaborate closely with engineering teams
to deliver robust, scalable solutions that support large-scale AI
workloads. The ideal candidate is passionate about automation,
observability, and ensuring the integrity of critical systems in a
fast-paced, innovative environment. Responsibilities Develop and
optimize software to provision and manage Kubernetes clusters
on-premises, enabling xAI to scale efficiently. Enhance the
reliability, performance, and cost-effectiveness of Kubernetes
infrastructure to support large-scale AI and application workloads.
Collaborate with xAI engineers to understand workload requirements
and design tailored Kubernetes solutions to meet their needs.
Implement robust observability, monitoring, and security practices
to ensure the integrity, availability, and confidentiality of
critical systems. Manage storage infrastructure using
Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or
Ansible. Drive system reliability through incident management,
postmortems, and the definition of clear SLAs and SLOs. Contribute
to the Kubernetes stack, including expertise in CNI, CRI, CSI, and
related components. This is an in-person role based in Palo Alto,
CA, with up to 25% travel required. Required Qualifications 5 years
of experience as a Site Reliability Engineer or similar role, with
a focus on building and maintaining reliable, scalable systems.
Proven expertise in managing Kubernetes infrastructure using tools
like Cluster API (CAPI) and kubeadm. Proficiency in managing
storage infrastructure with IaC tools such as Pulumi, Terraform, or
Ansible. Deep understanding of the Kubernetes stack, including CNI,
CRI, CSI, and related components. Demonstrated ability to improve
system reliability through incident management, postmortems, and
defining SLAs/SLOs. Preferred Qualifications Experience with
high-traffic web or mobile application workloads, including
optimizing Kubernetes for large-scale deployments. Familiarity with
chaos engineering, capacity planning, or similar practices for
ensuring system resilience. Proficiency with tools such as Kyverno,
ArgoCD, or Go programming for infrastructure automation. Strong
sense of ownership, curiosity, and enthusiasm for tackling complex
technical challenges. Passion for problem-solving and a proactive
drive to deliver impactful results. A sense of adventure and humor
to navigate challenges with a positive mindset. Annual Salary Range
$180,000 - $440,000 USD Benefits Base salary is just one part of
our total rewards package at xAI, which also includes equity,
comprehensive medical, vision, and dental coverage, access to a
401(k) retirement plan, short & long-term disability insurance,
life insurance, and various other discounts and perks. xAI is an
equal opportunity employer. For details on data processing, view
our Recruitment Privacy Notice.
Keywords: xAI, Woodland , Site Reliability Engineer - Kubernetes Platform, IT / Software / Systems , Palo Alto, California