Role Title:Senior Systems Engineer II, SRE
Position Summary:
Marriott International is the worlds largest hotel company, with more brands, more hotels and more opportunities for associates to grow and succeed.Bewhere you can do your best work, beginyour purpose,belongto an amazing global team, andbecomethe best version of you.
As a Senior Site Reliability Engineer, you will design and implement highly reliable, scalable, and secure systems. You will lead incident response, improve operational processes, and mentor junior engineers. This role requires strong technical expertise in cloud infrastructure, automation, observability, and reliability engineering practices
Key Responsibilities:
- Define and enforce Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical systems, ensuring alignment with business goals.
- Monitor and improve MTTR (Mean Time to Recovery) and MTTD (Mean Time to Detect) across services to enhance operational resilience.
- Design scalable AWS-based systems, including multi-region deployments and disaster recovery strategies.
- Develop reusable, versioned Terraform modules and maintain scalable Ansible configurations.
- Design and maintain reliable CI/CD pipelines using tools like Harness.
- Build and optimize observability systems using Dynatrace and other platforms for proactive monitoring and root cause analysis.
- Lead incident response efforts, conduct root cause analysis, and implement process improvements.
- Manage production databases (SQL/NoSQL), including replication, failover, and performance tuning.
- Implement security controls and ensure compliance with organizational standards.
- Conduct capacity planning and performance tuning for critical systems.
- Collaborate with application support teams and manage vendor relationships to ensure timely resolution of issues and adherence to SLAs.
- Mentor engineers, collaborate across teams, and influence reliability practices organization wide.
Required Skills & Experience:
Mandatory Skills:
- Min experience required is 6+ years.
- Advanced Linux/Windows tuning and hardening.
- Strong proficiency in Bash and Python for production automation.
- Expertise in AWS services and scalable architecture design.
- Hands-on experience with Terraform and Ansible.
- Proficiency in pipeline design and release engineering.
- Experience with Dynatrace, Prometheus, Grafana, ELK, or similar platforms.
- Strong understanding of SLOs, SLIs, error budgets, and operational metrics like MTTR and MTTD.
Good to Have:
- Proven ability to lead and improve incident management processes.
- Ability to coordinate with third-party vendors for application support and issue resolution.
- Knowledge of security best practices and compliance frameworks.
- Strong leadership, mentoring, and cross-team collaboration abilities.
Preferred Qualifications:
- Experience with multi-account AWS architecture.
- Familiarity with automation and self-healing systems.
- Knowledge of performance modeling and long-term capacity forecasting.
- Certifications in AWS, Terraform, or SRE practices.
Education and Certifications:
- Undergraduate degree or equivalent experience/certification
Work location: Hyderabad, India.
Work mode: Hybrid
…