Go to search
Lead Site Reliability Engineer (SRE)
Site Reliability Engineering, Datadog, Dynatrace, Splunk, Grafana, Jenkins, Kubernetes, Amazon Web Services, Python, Linux
Hyderabad, Bangalore, Pune, Gurgaon, Chennai
We are seeking a talented and motivated Lead Site Reliability Engineer (SRE) to join our organization.
The Lead SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.
Responsibilities
- Design, build, and maintain scalable and reliable cloud infrastructure and services on platforms such as AWS, Azure, or Google Cloud
- Automate manual work using scripting/programming languages like Python, Bash, or PowerShell, particularly within cloud environments
- Utilize automation tools like Jenkins, GitLab, and Ansible/Chef to streamline deployment, monitoring, and management of systems and applications in the cloud
- Monitor system performance, proactively troubleshoot issues, and ensure high availability and performance using Observability tools like Prometheus, Grafana, or ELK stack
- Participate in capacity planning and scalability assessments to support business growth, focusing on cloud resource optimization
- Implement containerization and orchestration technologies such as Docker and Kubernetes, particularly in cloud-native environments
- Ensure compliance with security best practices and standards to safeguard data and systems in the cloud
- Continuously evaluate and recommend new technologies and practices to improve system reliability, performance, and efficiency in the cloud
- Document processes, procedures, and configurations to maintain system integrity and facilitate knowledge sharing
- Provide on-call support and participate in incident management & response activities as needed
Requirements
- 8-13 years of experience in a similar role
- Prior leadership experience or team management skills
- Experience with cloud platforms like AWS, Azure, or Google Cloud
- Proficiency in scripting/programming languages such as Python, Bash, or PowerShell
- Experience with automation tools like Jenkins, GitLab, and Ansible/Chef
- Strong communication and collaboration skills
- Experience with Observability tools such as Prometheus, Grafana, ELK stack, or similar
- Hands-on experience with Docker, Kubernetes, or similar technologies
- Knowledge of security practices and standards in cloud environments
- Experience with SLI, SLO, SLA, and Error Budget concepts
- Strong problem-solving skills and ability to troubleshoot complex issues under pressure
- Familiarity with Agile methodologies and DevOps practices
- Excellent documentation skills
Nice to have
- Certifications in cloud technologies (AWS, Azure, Google Cloud)
- Contributions to open-source projects