Principal Architect, Systems

Company: TMUS Global Solutions
Apply for the Principal Architect, Systems
Location: Hyderabad
Job Description:

JOB SUMMARY:The SRE is responsible for rapid and accurate production triage, deep-rooted incident analysis, and driving proactive reliability improvements across cloud-native ecosystems. This role exists to ensure that T-Mobiles most critical systems remain resilient, scalable, and observable at all times. While hands-on engineering is a component of the role, the majority of time is spent providing technical leadershipconducting architecture reviews, mentoring engineers, maturing observability and reliability frameworks, performing performance diagnostics, codifying runbooks, driving incident reviews, refining operational standards, and influencing long-term reliability strategies. The SRE plays a key role in shaping reliability posture across business domains by collaborating on architectural direction, risk assessments, capacity planning, and cross-functional resiliency initiatives.Key Responsibilities:

  • Lead real-time production triage for high escalated incidents (app, platform, network, data) and driving mitigation or failover.
  • Design and evolve end-to-end observability (structured logs, metrics, traces, events, correlation IDs) to cut MTTD and eliminate blind spots.
  • Perform deep performance engineering (latency breakdown, GC/heap tuning, thread/async analysis, CPU/memory/I/O profiling) and eliminate tail latency.
  • Analyze incident and alert trends to remove systemic failure modes and reduce repeat occurrences and noisy alert sources.
  • Provide recommendations on optimizing Kubernetes workloads (resource requests/limits, HPA, pod disruption budgets, affinity/anti-affinity, ingress, service mesh traffic) for resilience and efficiency.
  • Build automation and self-healing (runbook codification, dependency health probes, pre-flight deployment guards, drift and config integrity checks).
  • Work for post-incident reviews, producing clear causal chains, durable remediation actions, and tracked ownership to closure.
  • Enhance release and change safety with automated rollback and SLO guardrails.
  • Drive capacity and scalability planning (forecast saturation, right-size clusters, assess quota limits, model concurrency vs throughput) to prevent resource exhaustion.
  • Maintain authoritative runbooks, architecture dependency maps, DR playbooks, and reliability scorecards for transparency and onboarding speed.
  • Partner with development, platform, security, and data teams to embed reliability patterns (idempotency, bulkheads, circuit breakers, backpressure) early in design.
  • Proactively surface emerging risks (error budget degradation, scaling inflection points, capacity shortfalls, aging certificates) before they become incidents.

Must Have:

  • Production triage and troubleshooting and problem-solving skills and incident communication clarity (concise timeline narration, stakeholder updates, executive summaries, remediation advocacy).
  • Strong production Kubernetes expertise (controllers, scheduling behaviour, networking, ingress, service mesh, resource tuning, multi-cluster operations), preferred CKAD or CKA certified.
  • Proficiency in any one language Java or Go or Python for building diagnostic tooling, automation services, performance harnesses, and reliability utilities.
  • Solid database and SQL capability (query tuning, indexing, execution plan analysis) plus familiarity with at least one NoSQL or caching layer (Dynamo, Mongo).
  • Deep observability stack usage (Splunk, Prometheus, Grafana, OpenTelemetry, tracing systems, APM tools) and alert noise reduction techniques.
  • Performance profiling mastery (async-profiler, flame graphs, thread and heap dumps, network and syscall analysis).
  • Strong Linux/Unix internals knowledge (process scheduling, cgroups, kernel signals, network stack, filesystem and I/O, perf/strace/tcpdump/iostat/sar tooling).
  • Automation and infrastructure-as-code experience (Ansible, Helm, GitOps pipelines, CI/CD gating, self-heal workflows).
  • Strong log, metric, and trace correlation skills for root cause isolation across microservices, queues, caches, and external dependencies.
  • Messaging and event streaming familiarity (Kafka, SQS, RabbitMQ) including lag analysis, consumer scaling, ordering, and replay strategies.
  • Ownership mindset with collaborative influence, mentoring peers in production debugging, reliability principles, and continuous improvement discipline.

Additional Skills:

  • Practical SRE framework implementation (SLI taxonomy, SLO lifecycle, error budget policies, toil reduction, reliability scorecards).
  • Distributed systems resilience patterns (circuit breakers, retries with jitter, timeouts, bulkheading, idempotent semantics, backpressure, graceful degradation).

Nice to have:

  • Hands-on multi-region AWS and/or Azure experience (load balancing, autoscaling, Route53/DNS/Azure DNS, storage replication, DR and failover orchestration).
  • Demonstrated proactive risk identification (capacity hotspots, noisy dependencies, cascading failure precursors, config drift, expiring certs/secrets).

Posted: March 12th, 2026