Senior Systems Engineer (DevOps & SRE)

Site Reliability Engineering, DevOps

Hyderabad, Bangalore, Pune, Gurgaon, Chennai

We are looking for a skilled and driven Site Reliability Engineer (SRE) to become a part of our team.

The chosen candidate will play a key part in safeguarding the Reliability, Scalability, Capacity Planning, and performance of our infrastructure and applications. If you have a rich background in software engineering, system administration, Containerisation, and cloud technologies, you might be our ideal candidate.

Responsibilities

Crafting, implementing, and managing scalable, reliable, and secure cloud infrastructure using tools such as Terraform, Kubernetes, and Docker
Building and maintaining monitoring and alerting systems for application and infrastructure health and performance with tools such as Prometheus, Grafana, and ELK stack
Leading response efforts for critical incidents, conducting root cause analysis, and implementing long-term fixes to prevent recurrence
Developing, maintaining, and optimizing continuous integration and continuous deployment (CI/CD) pipelines using tools like Jenkins, GitLab CI, or CircleCI
Automating routine tasks and enhancing efficiency through scripting and tools, employing languages such as Python, Bash, or Go
Implementing and managing security best practices for infrastructure and applications, including vulnerability assessments, penetration testing, and adherence to security standards
Cooperating closely with development, QA, and operations teams to ensure smooth integration and deployment of new features and updates
Conducting capacity planning and scaling infrastructure to meet present and future demands
Creating and maintaining thorough documentation for infrastructure, processes, and procedures

Requirements

A minimum of 5 years experience in a DevOps/SRE role
Solid experience with cloud platforms like AWS, GCP, Azure
Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation
Significant experience with containerization and orchestration (Docker, Kubernetes)
In-depth knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI)
Proficiency in scripting languages (Python, Bash)
Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack)
Capacity to participate in capacity planning and scalability assessments to meet business growth and requirements
Familiarity with SLI, SLO, SLA, and Error Budget concepts, their implementation, and willingness to provide on-call support and participate in incident management & response activities as needed
Solid grasp of networking and security principles
Exceptional problem-solving skills and the ability to work under pressure
Strong communication and collaboration skills

Benefits

Insurance coverage
Paid leaves – including maternity, bereavement, paternity, and special COVID-19 leaves.
Financial assistance for medical crisis
Retiral Benefits – VPF and NPS
Customized Mindfulness and Wellness programs
EPAM Hobby Clubs

Community

Flexible and hybrid work opportunities
Soft loans to set up workspace at home
Relocation and mobility programs

Professional development

Access to soft skills training in general communication, presenting and public speaking, diversity, equity and inclusion (DEI), cultural Intelligence, self-productivity, well-being and more.
Unlimited access to the LinkedIn Learning Library, including 22,000+ courses
Access to internal learning platforms, EPAM University and a wide range of professional communities and competency centers
Community networking and idea creation platforms
Mentorship programs
Self-driven career progression tool
Upskilling, reskilling and certification courses <wbr>