We are looking for a Principal Site Reliability Engineer to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.
Key Responsibilities:
- Reliability Engineering: Design and build automated systems that ensure the reliability and scalability of our Kubernetes clusters and Hydrolix deployments across multiple cloud platforms, eliminating manual operational tasks.
- Automation and Efficiency: Identify, quantify, and systematically eliminate repetitive manual work through automation and improved tooling, eliminating toil and freeing the team to focus on high-value work.
- Observability Infrastructure: Build and enhance comprehensive observability systems that provide deep visibility into system behavior, enable debugging and troubleshooting, and support data-driven reliability decisions
- CI/CD and Deployment Automation: Design and build robust CI/CD pipelines and deployment automation that enable safe, frequent releases with minimal human intervention.
- Infrastructure Reliability: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
- Service Optimization: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
- Root Cause Analysis: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.
Collaboration and Customer Engagement
- Cross-Functional Teamwork: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
- Knowledge Sharing: Document systems, create runbooks, and share knowledge across the organization to build collective expertise in reliability engineering.
- Reliability Advocacy: Champion SRE best practices and foster a culture of operational excellence across the organization.
- Reliability Systems: Build and maintain centralized reliability platforms, tools, and services that empower all engineering teams to operate their systems effectively.
- Global Team Collaboration: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support and continuous improvement of our reliability posture.
- Customer-Facing Reliability: Work with customers to understand reliability requirements and ensure our platform meets their operational needs.
Qualifications and Skills:
SRE Expertise:
- With a minimum 10+ years of proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role, supporting large-scale, complex distributed systems in production.
- Demonstrated ability to operate at a principal level by setting reliability direction, defining standards, and influencing system design across multiple teams.
Architecture, Performance & Scalability
- Deep experience designing and evolving system architectures with reliability, scalability, and operability as first-class concerns.
- In-depth experience in application and infrastructure performance tuning and scaling to handle heavy workloads under varying traffic patterns and failure scenarios.
- Ability to identify systemic bottlenecks, capacity risks, and inefficiencies, and drive long-term architectural improvements.
Automation, Platform & Infrastructure Engineering
- Exceptional track record of eliminating toil through automation, including building internal platforms or frameworks that enable safe, scalable self-service.
- In-depth knowledge of configuration management and Infrastructure as Code (IaC) tools such as Terraform, Pulumi, and Ansible for provisioning and managing infrastructure consistently across environments.
Observability & Reliability Engineering
- Deep expertise in observability tools and practices, with the ability to design end-to-end monitoring strategies aligned with business outcomes.
- Strong understanding of core reliability concepts, including SLIs, SLOs, SLAs, error budgets, golden signals, and quality gates.
- Hands-on experience with distributed tracing, synthetic monitoring, end-user monitoring, performance testing, and chaos engineering.
- Proven experience driving blameless postmortems and ensuring learnings result in measurable reliability improvements.
Kubernetes & Distributed Systems
- Deep understanding of Kubernetes architecture, operations, failure modes, and ecosystem tooling.
- Experience designing and operating multi-cluster and/or multi-region Kubernetes platforms at scale.
Cloud & Multi-Cloud Expertise
- Demonstrated proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode), with experience building cloud-native systems.
- Familiarity with multi-cloud or hybrid architectures and the operational trade-offs involved.
Networking, Security & Traffic Management
- Experience with network load balancing, traffic management, and capacity planning at scale.
- Strong understanding of security technology stacks, Transport Layer Security (TLS), certificate management, and standard networking protocols and configurations.
Data & Storage Systems
- Experience working with SQL databases; familiarity with PostgreSQL is a plus.
- Ability to reason about performance, availability, and scaling characteristics of data-intensive systems.
Programming & Systems Engineering
- Strong programming ability in Go, Python, or Rust, with a proven ability to build and maintain production-quality tools, services, and automation.
- Comfortable reviewing, shaping, and influencing code across multiple teams and services.
Linux & Infrastructure Fundamentals
- Deep experience with Linux systems, including performance tuning, capacity planning, and low-level system troubleshooting.
Incident Management & Operational Excellence
- Extensive experience leading high-severity incidents, managing cross-team response, and driving post-incident reviews.
- Ability to translate incident learnings into systemic fixes, architectural changes, and improved operational standards.
We look forward to seeing how you can make an impact at Hydrolix.
…