Senior Site Reliability Engineer - Azure

Office in Hyderabad, Pune, Bangalore, Gurgaon, Chennai, Coimbatore

Site Reliability Engineering

can't find the job you are looking for?

Send us your CV to get a personalized offer.

We are seeking a highly skilled and motivated Senior Site Reliability Engineer (SRE) to join our team and lead the charge in building robust, scalable, and secure systems on the Azure platform. In this role, you will be responsible for ensuring the reliability, performance, and efficiency of our cloud-based infrastructure, as well as driving best practices in incident management, observability, and automation.

Responsibilities

Troubleshoot complex distributed systems and networking issues in a cloud-native environment
Ensure optimal performance and stability of Azure-based systems, utilizing tools like Azure Monitor, Log Analytics, and Application Insights
Architect, develop, and maintain Infrastructure as Code (IaC) solutions using ARM, Bicep, and Terraform
Implement and enforce observability solutions, develop metrics, and define and monitor SLOs/SLIs
Manage incident response processes, on-call rotations, and conduct post-incident analysis to prevent future occurrences
Automate repetitive tasks, leveraging scripting languages such as Python, PowerShell, or Bash
Collaborate with engineering and operational teams to improve system reliability, scalability, and cost-efficiency
Drive continuous improvement in system design and operational processes across the organization
Advocate for SRE culture by promoting best practices in monitoring, deployment, and infrastructure optimization

Requirements

5+ years of experience in SRE, DevOps, or related roles, with a strong track record in cloud environments (Azure experience required)
Deep expertise in troubleshooting distributed systems, networking, and cloud-native architectures
Hands-on experience with Azure monitoring, logging, and automation tools (e.g., Azure Monitor, Log Analytics, Application Insights, ARM, Bicep, Terraform)
Proficiency in at least one scripting or programming language (Python, PowerShell, Bash, etc.)
Strong understanding of incident management, on-call operations, and post-incident analysis
Experience in implementing observability solutions and defining SLOs/SLIs
Excellent communication skills and the ability to collaborate cross-functionally in high-pressure situations

Nice to have

Azure certifications (e.g., Azure Solutions Architect, Azure DevOps Engineer)
Experience working in environments with low SRE process maturity, including building practices from the ground up
Familiarity with CI/CD pipelines and infrastructure-as-code practices
Experience mentoring or leading SRE or DevOps teams