Cloudlinux

Senior Database Reliability Engineer

Internship Worldwide Engineering & Tech Senior

Salary: Not Specified

Posted May 17, 2026

49 views

0 apply clicks

Apply Now ← Browse Jobs

Role

About the Role

CloudLinux / TuxCare is seeking a high-caliber Senior Database Reliability Engineer to join our specialized Infrastructure DBA cell. In this hands-on production ownership role, you will move far beyond traditional ticket processing to architect, automate, and scale the data platforms powering CloudLinux OS, Imunify, and TuxCare services used by enterprises and hosting providers worldwide. As a key member of our engineering team, you will be instrumental in ensuring the high availability and performance of our multi-database estate, with a primary focus on PostgreSQL while rapidly mastering and supporting ClickHouse, MongoDB, and Redis operations.

We operate in a sophisticated, remote-first environment where database management is treated as an engineering discipline. You will be responsible for reducing single-person dependencies and transforming manual DBA tasks into robust, automated workflows. This role is ideal for a senior engineer who thrives on solving complex architectural challenges, optimizing query performance, and building self-service capabilities that empower our wider engineering organization. At CloudLinux, you will find a culture that embraces AI-assisted engineering and values human verification, clear documentation, and operational excellence.

Key Responsibilities

PostgreSQL Production Ownership: Drive the reliability of our core PostgreSQL clusters through high-availability design using Patroni and PgBouncer, meticulous vacuum and bloat control, expert query tuning, and capacity planning.
Infrastructure as Code & Automation: Eliminate repetitive manual work by automating DBA workflows using Ansible, Terraform/OpenTofu, and GitLab CI/CD. You will develop reproducible runbooks for provisioning, grants, and health checks.
Disaster Recovery Excellence: Build and maintain rigorous disaster recovery strategies, including tested Point-in-Time Recovery (PITR), documented recovery paths, and measurable RTO/RPO targets to ensure data safety across the global estate.
Multi-Engine Support: Troubleshoot and optimize our diverse database landscape, including ClickHouse, MongoDB, and Redis. You will actively work to reduce operational silos and improve the monitoring of these distributed systems.
Self-Service Platform Engineering: Help architect DBaaS-style capabilities that allow product teams to request databases, credentials, and health checks through automated pipelines with minimal manual intervention.
Observability & Incident Response: Enhance our proactive monitoring using Grafana, metrics, and logs. You will lead incident response efforts, define SLOs, and participate in blameless post-mortems to improve long-term system resilience.
AI-Enhanced Engineering: Leverage advanced AI assistants like Claude and Codex to accelerate development and operational tasks, while maintaining strict human-in-the-loop verification of all generated SQL and scripts.

Qualifications

Deep PostgreSQL Expertise: At least 5 years of hands-on experience managing business-critical PostgreSQL environments, with a profound understanding of MVCC, WAL, locking mechanisms, and major version upgrades.
High Availability Mastery: Proven ability to design and manage quorum-based systems, with experience reasoning about split-brain risks, failover procedures, and complex recovery scenarios.
Linux & Systems Fundamentals: Strong command of Linux internals, including systemd, networking, storage filesystems, and identifying CPU or disk bottlenecks.
Automation Proficiency: Strong experience with Ansible and scripting. Familiarity with merge-request-based delivery and infrastructure orchestration via Terraform or OpenTofu is highly advantageous.
Adaptability & Learning Agility: While PostgreSQL is your primary focus, you must have the technical depth to quickly learn and take operational responsibility for ClickHouse and MongoDB environments.
Asynchronous Communication: Exceptional written English skills for documenting runbooks, contributing to Slack/Jira discussions, and collaborating effectively across a global, remote team.
Analytical Mindset: A data-driven approach to troubleshooting and a commitment to building evidence-based operational procedures.

Benefits

Remote-First Flexibility: Work from anywhere in the world with flexible hours that allow you to balance professional impact with personal life.
Generous Paid Time Off: Benefit from 24 days of annual vacation, 10 national holidays, and an unlimited sick leave policy to ensure you stay healthy and rested.
Professional Growth & Education: Access a dedicated budget for professional development and continuous learning opportunities in a cutting-edge technical environment.
Health & Wellness Support: Compensation for private medical insurance and reimbursement for co-working spaces or gym and sports memberships.
Innovation Rewards: Participate in a culture that rewards creativity, including potential rewards for innovative ideas that the company can patent.
Impactful Work: Contribute to infrastructure that secures and powers millions of servers globally, working alongside over 300 talented engineers.

Skills

Required Skills

PostgreSQL ClickHouse Ansible Terraform Database Reliability Engineering Linux Systems High Availability MongoDB Redis Infrastructure as Code

Interested in this role?

← Back to all remote jobs