Lead Site Reliability Engineer
Office in Hyderabad, Bangalore, Pune, Gurgaon, Chennai
Site Reliability Engineering
& others
We are excited to invite applications for the role of Lead Site Reliability Engineer. The selected candidate will be instrumental in optimizing our infrastructure and application performance through proactive system management, automation, and monitoring. This role is perfect for individuals with a profound knowledge in cloud architectures and a knack for ensuring uninterrupted services.
Responsibilities
- Design, build, and ensure the maintenance of scalable, reliable, and efficient cloud infrastructure across platforms like AWS and Azure
- Automate repetitive tasks and system deployments using Python, Bash, or PowerShell in cloud settings
- Implement and manage automation tools such as Jenkins, GitLab, and Ansible/Chef for seamless deployment, monitoring, and management of systems
- Monitor overall system performance, proactively troubleshooting to ensure high availability and optimal functioning
- Utilize tools like Grafana, New Relic, Splunk, or Dynatrace for effective monitoring, alerting, and logging to preemptively resolve potential issues in cloud infrastructure
- Handle containerization and orchestration technologies including Docker and Kubernetes within cloud-native environments
- Understand and apply concepts of SLI, SLO, SLA, and Error Budgets in day-to-day operations
- Provide necessary on-call support and contribute to incident management and response initiatives as required
Requirements
- 8+ years of relevant working experience
- At least 1 year of relevant leadership experience
- Proficiency in managing cloud infrastructures, ideally on AWS or Azure
- Competency in scripting and programming with Python, Bash, or PowerShell specifically tailored for cloud environments
- Background in using automation and configuration management tools like Jenkins, GitLab, and Ansible/Chef
- Familiarity with Observability and monitoring solutions such as Grafana, New Relic, Splunk, or Dynatrace
- Expertise in deploying and managing containerized applications using Docker and Kubernetes
- Knowledge of employing SLI, SLO, SLA, and Error Budget frameworks in operational settings