Company: Hydrolix

Location: Mumbai

Job Description:

We are looking for a Principal Site Reliability Engineer to join our dynamic Services team. In this role, you will contribute to the reliability and scalability of our cutting-edge platform, ensuring exceptional solutions tailored to our customers’ unique needs. This is a highly technical, hands-on role that requires deep expertise in system reliability and automation.

Key Responsibilities:

Reliability Engineering: Design and build automated systems that ensure the reliability and scalability of our Kubernetes clusters and Hydrolix deployments across multiple cloud platforms, eliminating manual operational tasks.
Automation and Efficiency: Identify, quantify, and systematically eliminate repetitive manual work through automation and improved tooling, eliminating toil and freeing the team to focus on high-value work.
Observability Infrastructure: Build and enhance comprehensive observability systems that provide deep visibility into system behavior, enable debugging and troubleshooting, and support data-driven reliability decisions
CI/CD and Deployment Automation: Design and build robust CI/CD pipelines and deployment automation that enable safe, frequent releases with minimal human intervention.
Infrastructure Reliability: Deploy, maintain, and ensure a highly reliable fleet of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
Service Optimization: Design, implement, and maintain systems and processes to enhance the reliability, availability, and performance of our services.
Root Cause Analysis: Conduct comprehensive root cause analyses for system failures, implementing long-term preventive measures.

Collaboration and Customer Engagement

Cross-Functional Teamwork: Work closely with software engineering, infrastructure, and product teams to integrate reliability practices into every stage of the development lifecycle.
Knowledge Sharing: Document systems, create runbooks, and share knowledge across the organization to build collective expertise in reliability engineering.
Reliability Advocacy: Champion SRE best practices and foster a culture of operational excellence across the organization.
Reliability Systems: Build and maintain centralized reliability platforms, tools, and services that empower all engineering teams to operate their systems effectively.
Global Team Collaboration: Collaborate with a distributed team of engineers worldwide to provide round-the-clock support and continuous improvement of our reliability posture.
Customer-Facing Reliability: Work with customers to understand reliability requirements and ensure our platform meets their operational needs.

Qualifications and Skills:

SRE Expertise:

With a minimum 10+ years of proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role, supporting large-scale, complex distributed systems in production.
Demonstrated ability to operate at a principal level by setting reliability direction, defining standards, and influencing system design across multiple teams.

Architecture, Performance & Scalability

Deep experience designing and evolving system architectures with reliability, scalability, and operability as first-class concerns.
In-depth experience in application and infrastructure performance tuning and scaling to handle heavy workloads under varying traffic patterns and failure scenarios.
Ability to identify systemic bottlenecks, capacity risks, and inefficiencies, and drive long-term architectural improvements.

Automation, Platform & Infrastructure Engineering

Exceptional track record of eliminating toil through automation, including building internal platforms or frameworks that enable safe, scalable self-service.
In-depth knowledge of configuration management and Infrastructure as Code (IaC) tools such as Terraform, Pulumi, and Ansible for provisioning and managing infrastructure consistently across environments.

Observability & Reliability Engineering

Deep expertise in observability tools and practices, with the ability to design end-to-end monitoring strategies aligned with business outcomes.
Strong understanding of core reliability concepts, including SLIs, SLOs, SLAs, error budgets, golden signals, and quality gates.
Hands-on experience with distributed tracing, synthetic monitoring, end-user monitoring, performance testing, and chaos engineering.
Proven experience driving blameless postmortems and ensuring learnings result in measurable reliability improvements.

Kubernetes & Distributed Systems

Deep understanding of Kubernetes architecture, operations, failure modes, and ecosystem tooling.
Experience designing and operating multi-cluster and/or multi-region Kubernetes platforms at scale.

Cloud & Multi-Cloud Expertise

Demonstrated proficiency in at least one major cloud platform (AWS, GCP, Azure, or Linode), with experience building cloud-native systems.
Familiarity with multi-cloud or hybrid architectures and the operational trade-offs involved.

Networking, Security & Traffic Management

Experience with network load balancing, traffic management, and capacity planning at scale.
Strong understanding of security technology stacks, Transport Layer Security (TLS), certificate management, and standard networking protocols and configurations.

Data & Storage Systems

Experience working with SQL databases; familiarity with PostgreSQL is a plus.
Ability to reason about performance, availability, and scaling characteristics of data-intensive systems.

Programming & Systems Engineering

Strong programming ability in Go, Python, or Rust, with a proven ability to build and maintain production-quality tools, services, and automation.
Comfortable reviewing, shaping, and influencing code across multiple teams and services.

Linux & Infrastructure Fundamentals

Deep experience with Linux systems, including performance tuning, capacity planning, and low-level system troubleshooting.

Incident Management & Operational Excellence

Extensive experience leading high-severity incidents, managing cross-team response, and driving post-incident reviews.
Ability to translate incident learnings into systemic fixes, architectural changes, and improved operational standards.

We look forward to seeing how you can make an impact at Hydrolix.

…

Posted: February 26th, 2026

Similar Jobs: