Senior SRE Engineer

Office in Pune

Site Reliability Engineering

can't find the job you are looking for?

Send us your CV to get a personalized offer.

We are seeking a Senior SRE Engineer to join our team and drive the reliability, scalability, and performance of our systems. If you are passionate about ensuring production excellence within complex, large-scale environments, this role is for you.

Responsibilities

Apply SRE principles, including SLI, SLO, and error budget management, to enhance service reliability and availability
Define meaningful metrics and alerts using monitoring and observability tools like Dynatrace and Splunk
Manage and improve production environments using Kubernetes, Terraform, and database technologies (SQL/NoSQL)
Develop automation scripts in Shell, Python, or Bash for operational efficiencies
Lead incident management processes, conduct root cause analyses, and implement actionable postmortem improvements
Maintain and optimize CI/CD pipelines using tools like Jenkins, Bamboo, or Concourse, aligning with DevOps standards
Collaborate across cross-functional teams to ensure system reliability and prompt resolution of complex issues
Enhance the scalability, performance, and reliability of distributed systems through innovative engineering solutions
Apply software engineering concepts to support large-scale production environments

Requirements

5-10 years of experience in Site Reliability Engineering or a related field
Strong understanding of SRE principles and practices, including SLI, SLO, and error budget management
Proficiency with monitoring tools like Dynatrace and observability platforms like Splunk
Expertise in Kubernetes, Terraform, and database technologies (SQL/NoSQL) in production environments
Proficiency in scripting languages such as Shell, Python, or Bash for automation
Strong knowledge of CI/CD tools like Jenkins, Bamboo, or Concourse, combined with DevOps best practices
Experience with incident management, automated root cause analysis, and leading postmortems
Familiarity with software engineering concepts, system design, and distributed systems at scale
Capable of defining and implementing system reliability improvements across diverse technical stacks

Nice to have

Degree in a technical-related field or equivalent practical experience
Familiarity with Java-based applications, Bitbucket, Maven, and Jenkins
Experience with performance tuning and optimization of cloud-native Kubernetes applications
Proficiency in large-scale Infrastructure as Code implementations using Terraform
Background in managing SLAs, SLOs, SLIs, and error budgets for production systems
Understanding of chaos engineering, resilience testing, and advanced reliability practices