Senior Systems Engineer (DevOps & SRE)
Site Reliability Engineering, DevOps
Hyderabad, Bangalore, Pune, Gurgaon, Chennai
Senior Systems Engineer (DevOps & SRE)
We are looking for a skilled and driven Site Reliability Engineer (SRE) to become a part of our team.
The chosen candidate will play a key part in safeguarding the Reliability, Scalability, Capacity Planning, and performance of our infrastructure and applications. If you have a rich background in software engineering, system administration, Containerisation, and cloud technologies, you might be our ideal candidate.
responsibilities
- Crafting, implementing, and managing scalable, reliable, and secure cloud infrastructure using tools such as Terraform, Kubernetes, and Docker
- Building and maintaining monitoring and alerting systems for application and infrastructure health and performance with tools such as Prometheus, Grafana, and ELK stack
- Leading response efforts for critical incidents, conducting root cause analysis, and implementing long-term fixes to prevent recurrence
- Developing, maintaining, and optimizing continuous integration and continuous deployment (CI/CD) pipelines using tools like Jenkins, GitLab CI, or CircleCI
- Automating routine tasks and enhancing efficiency through scripting and tools, employing languages such as Python, Bash, or Go
- Implementing and managing security best practices for infrastructure and applications, including vulnerability assessments, penetration testing, and adherence to security standards
- Cooperating closely with development, QA, and operations teams to ensure smooth integration and deployment of new features and updates
- Conducting capacity planning and scaling infrastructure to meet present and future demands
- Creating and maintaining thorough documentation for infrastructure, processes, and procedures
requirements
- A minimum of 5 years experience in a DevOps/SRE role
- Solid experience with cloud platforms like AWS, GCP, Azure
- Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation
- Significant experience with containerization and orchestration (Docker, Kubernetes)
- In-depth knowledge of CI/CD tools (Jenkins, GitLab CI, CircleCI)
- Proficiency in scripting languages (Python, Bash)
- Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack)
- Capacity to participate in capacity planning and scalability assessments to meet business growth and requirements
- Familiarity with SLI, SLO, SLA, and Error Budget concepts, their implementation, and willingness to provide on-call support and participate in incident management & response activities as needed
- Solid grasp of networking and security principles
- Exceptional problem-solving skills and the ability to work under pressure
- Strong communication and collaboration skills
Benefits
For you
- Insurance Coverage
- Paid Leaves – including maternity, bereavement, paternity, and special COVID-19 leaves.
- Financial assistance for medical crisis
- Retiral Benefits – VPF and NPS
- Customized Mindfulness and Wellness programs
- EPAM Hobby Clubs
For your comfortable work
- Hybrid Work Model
- Soft loans to set up workspace at home
- Stable workload
- Relocation opportunities with ‘EPAM without Borders’ program
For your growth
- Certification trainings for technical and soft skills
- Access to unlimited LinkedIn Learning platform
- Access to internal learning programs set up by world class trainers
- Community networking and idea creation platforms
- Mentorship programs
- Self-driven career progression tool