backGo to search

Site Reliability Engineer

hot
bullets
Site Reliability Engineering, Datadog, Dynatrace, Splunk, Grafana, Jenkins, Kubernetes, Amazon Web Services, Python, Linux
bullets
Hyderabad, Bangalore, Pune, Gurgaon, Chennai, Mumbai

We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our organization.

The SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.

Responsibilities
  • Design, build, and maintain scalable, reliable, and efficient cloud infrastructure and services on platforms like AWS, Azure, or Google Cloud
  • Automate manual work using scripting/programming languages such as Python, Bash, or PowerShell, especially within cloud environments
  • Utilize automation tools like Jenkins, GitLab, and Ansible/Chef to streamline deployment, monitoring, and management of systems and applications in the cloud
  • Monitor system performance and proactively troubleshoot issues to ensure high availability and performance
  • Employ observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog for monitoring, alerting, and logging
  • Participate in capacity planning and scalability assessments to support business growth and requirements
  • Manage containerization and orchestration technologies such as Docker and Kubernetes, particularly in cloud-native environments
  • Implement security best practices and standards to safeguard data and systems in the cloud
  • Continuously evaluate and recommend new technologies and practices to improve system reliability and efficiency
  • Document processes, procedures, and configurations to maintain system integrity and facilitate knowledge sharing
Requirements
  • 3 – 5 years of relevant experience
  • Proficient in designing and maintaining cloud infrastructure on AWS, Azure, or Google Cloud
  • Strong scripting and programming skills in languages like Python, Bash, or PowerShell
  • Experience with automation tools such as Jenkins, GitLab, and Ansible/Chef
  • Excellent communication and collaboration skills
  • Experience with observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog
  • Hands-on experience with Docker, Kubernetes, or similar containerization and orchestration technologies
  • Knowledge of security best practices for cloud environments
  • Familiarity with SLI, SLO, SLA, and Error Budget concepts
  • Strong problem-solving skills and ability to troubleshoot complex issues under pressure
Nice to have
  • Experience with Agile methodologies and DevOps practices
  • Certifications in cloud technologies (AWS, Azure, Google Cloud)
  • Advanced knowledge of network and security architecture

Benefits

Benefits
  • Insurance coverage 
  • Paid leaves – including maternity, bereavement, paternity, and special COVID-19 leaves. 
  • Financial assistance for medical crisis 
  • Retiral Benefits – VPF and NPS 
  • Customized Mindfulness and Wellness programs 
  • EPAM Hobby Clubs
Community
  • Flexible and hybrid work opportunities
  • Soft loans to set up workspace at home 
  • Relocation and mobility programs

Professional development

  • Access to soft skills training in general communication, presenting and public speaking, diversity, equity and inclusion (DEI), cultural Intelligence, self-productivity, well-being and more.  
  • Unlimited access to the LinkedIn Learning Library, including 22,000+ courses 
  • Access to internal learning platforms, EPAM University and a wide range of professional communities and competency centers  
  • Community networking and idea creation platforms 
  • Mentorship programs 
  • Self-driven career progression tool
  • Upskilling, reskilling and certification courses <wbr>