Go to search
Senior Systems Engineer (DevOps & SRE)
Site Reliability Engineering, DevOps
Hyderabad, Bangalore, Pune, Gurgaon, Chennai
We are looking for a skilled and driven Site Reliability Engineer (SRE) to become a part of our team.
The chosen candidate will play a key part in safeguarding the Reliability, Scalability, Capacity Planning, and performance of our infrastructure and applications. If you have a rich background in software engineering, system administration, Containerisation, and cloud technologies, you might be our ideal candidate.
Responsibilities
- Crafting, implementing, and managing scalable, reliable, and secure cloud infrastructure using tools such as Terraform, Kubernetes, and Docker
- Building and maintaining monitoring and alerting systems for application and infrastructure health and performance with tools such as Prometheus, Grafana, and ELK stack
- Leading response efforts for critical incidents, conducting root cause analysis, and implementing long-term fixes to prevent recurrence
- Developing, maintaining, and optimizing continuous integration and continuous deployment (CI/CD) pipelines using tools like Jenkins, GitLab CI, or CircleCI
- Automating routine tasks and enhancing efficiency through scripting and tools, employing languages such as Python, Bash, or Go
- Implementing and managing security best practices for infrastructure and applications, including vulnerability assessments, penetration testing, and adherence to security standards
- Cooperating closely with development, QA, and operations teams to ensure smooth integration and deployment of new features and updates
- Conducting capacity planning and scaling infrastructure to meet present and future demands
- Creating and maintaining thorough documentation for infrastructure, processes, and procedures
Requirements
- A minimum of 5 years experience in a DevOps/SRE role
- Solid experience with cloud platforms like AWS, GCP, Azure
- Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation
- Significant experience with containerization and orchestration (Docker, Kubernetes)
- In-depth knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI)
- Proficiency in scripting languages (Python, Bash)
- Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack)
- Capacity to participate in capacity planning and scalability assessments to meet business growth and requirements
- Familiarity with SLI, SLO, SLA, and Error Budget concepts, their implementation, and willingness to provide on-call support and participate in incident management & response activities as needed
- Solid grasp of networking and security principles
- Exceptional problem-solving skills and the ability to work under pressure
- Strong communication and collaboration skills
Benefits
Benefits
- Insurance coverage
- Paid leaves – including maternity, bereavement, paternity, and special COVID-19 leaves.
- Financial assistance for medical crisis
- Retiral Benefits – VPF and NPS
- Customized Mindfulness and Wellness programs
- EPAM Hobby Clubs
Community
- Flexible and hybrid work opportunities
- Soft loans to set up workspace at home
- Relocation and mobility programs
Professional development
- Access to soft skills training in general communication, presenting and public speaking, diversity, equity and inclusion (DEI), cultural Intelligence, self-productivity, well-being and more.
- Unlimited access to the LinkedIn Learning Library, including 22,000+ courses
- Access to internal learning platforms, EPAM University and a wide range of professional communities and competency centers
- Community networking and idea creation platforms
- Mentorship programs
- Self-driven career progression tool
- Upskilling, reskilling and certification courses <wbr>