Go to search
Site Reliability Engineer
Site Reliability Engineering, Datadog, Dynatrace, Splunk, Grafana, Jenkins, Kubernetes, Amazon Web Services, Python, Linux
Hyderabad, Bangalore, Pune, Gurgaon, Chennai, Mumbai
We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our organization.
The SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.
Responsibilities
- Design, build, and maintain scalable, reliable, and efficient cloud infrastructure and services on platforms like AWS, Azure, or Google Cloud
- Automate manual work using scripting/programming languages such as Python, Bash, or PowerShell, especially within cloud environments
- Utilize automation tools like Jenkins, GitLab, and Ansible/Chef to streamline deployment, monitoring, and management of systems and applications in the cloud
- Monitor system performance and proactively troubleshoot issues to ensure high availability and performance
- Employ observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog for monitoring, alerting, and logging
- Participate in capacity planning and scalability assessments to support business growth and requirements
- Manage containerization and orchestration technologies such as Docker and Kubernetes, particularly in cloud-native environments
- Implement security best practices and standards to safeguard data and systems in the cloud
- Continuously evaluate and recommend new technologies and practices to improve system reliability and efficiency
- Document processes, procedures, and configurations to maintain system integrity and facilitate knowledge sharing
Requirements
- 3 – 5 years of relevant experience
- Proficient in designing and maintaining cloud infrastructure on AWS, Azure, or Google Cloud
- Strong scripting and programming skills in languages like Python, Bash, or PowerShell
- Experience with automation tools such as Jenkins, GitLab, and Ansible/Chef
- Excellent communication and collaboration skills
- Experience with observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog
- Hands-on experience with Docker, Kubernetes, or similar containerization and orchestration technologies
- Knowledge of security best practices for cloud environments
- Familiarity with SLI, SLO, SLA, and Error Budget concepts
- Strong problem-solving skills and ability to troubleshoot complex issues under pressure
Nice to have
- Experience with Agile methodologies and DevOps practices
- Certifications in cloud technologies (AWS, Azure, Google Cloud)
- Advanced knowledge of network and security architecture
Benefits
Benefits
- Insurance coverage
- Paid leaves – including maternity, bereavement, paternity, and special COVID-19 leaves.
- Financial assistance for medical crisis
- Retiral Benefits – VPF and NPS
- Customized Mindfulness and Wellness programs
- EPAM Hobby Clubs
Community
- Flexible and hybrid work opportunities
- Soft loans to set up workspace at home
- Relocation and mobility programs
Professional development
- Access to soft skills training in general communication, presenting and public speaking, diversity, equity and inclusion (DEI), cultural Intelligence, self-productivity, well-being and more.
- Unlimited access to the LinkedIn Learning Library, including 22,000+ courses
- Access to internal learning platforms, EPAM University and a wide range of professional communities and competency centers
- Community networking and idea creation platforms
- Mentorship programs
- Self-driven career progression tool
- Upskilling, reskilling and certification courses <wbr>