Lead Site Reliability Engineer

Office in Hyderabad, Bangalore, Pune, Gurgaon, Chennai

Site Reliability Engineering

We are excited to invite applications for the role of Lead Site Reliability Engineer. The selected candidate will be instrumental in optimizing our infrastructure and application performance through proactive system management, automation, and monitoring. This role is perfect for individuals with a profound knowledge in cloud architectures and a knack for ensuring uninterrupted services.

Responsibilities

Design, build, and ensure the maintenance of scalable, reliable, and efficient cloud infrastructure across platforms like AWS and Azure
Automate repetitive tasks and system deployments using Python, Bash, or PowerShell in cloud settings
Implement and manage automation tools such as Jenkins, GitLab, and Ansible/Chef for seamless deployment, monitoring, and management of systems
Monitor overall system performance, proactively troubleshooting to ensure high availability and optimal functioning
Utilize tools like Grafana, New Relic, Splunk, or Dynatrace for effective monitoring, alerting, and logging to preemptively resolve potential issues in cloud infrastructure
Handle containerization and orchestration technologies including Docker and Kubernetes within cloud-native environments
Understand and apply concepts of SLI, SLO, SLA, and Error Budgets in day-to-day operations
Provide necessary on-call support and contribute to incident management and response initiatives as required

Requirements

8+ years of relevant working experience
At least 1 year of relevant leadership experience
Proficiency in managing cloud infrastructures, ideally on AWS or Azure
Competency in scripting and programming with Python, Bash, or PowerShell specifically tailored for cloud environments
Background in using automation and configuration management tools like Jenkins, GitLab, and Ansible/Chef
Familiarity with Observability and monitoring solutions such as Grafana, New Relic, Splunk, or Dynatrace
Expertise in deploying and managing containerized applications using Docker and Kubernetes
Knowledge of employing SLI, SLO, SLA, and Error Budget frameworks in operational settings