Site Reliability Engineer

  • Gurugram
  • Airtel Digital
Site Reliability Engineer is one of the critical role in the technology team and the person working in this team will be responsible for application performance, availability, reliability and system uptime. Candidate is responsible to provide consultation and strategic recommendations by quickly assessing and remediating complex platform availability issues. Site Reliability Engineer will dive head-first into creating or applying innovative solutions and techniques that advance the reliability of Digital products. Experience Criteria 2-5 Years of relevant experience Key responsibilities: Installation/deployment of new releases , environments for applications. Build and maintain highly scalable, large scale deployments globally Co-Create and maintain architecture for 100% uptime. E.g. creating alternate connectivity. Practice sustainable incident response/management and blameless post-mortems. Monitor and maintain production environment stability. Own entire platforms (prod environments) Deploying, automating, maintaining and managing production systems, to ensure the availability, performance, scalability and security of productions systems Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. Collaborate with Agile teams in defining technical requirements and best practices with containerized and cloud-native applications Represent production support and site reliability in stand-ups, planning sessions, code reviews, and architecture reviews Help evolve our configuration management (CM) efforts and our move to containers Help the operations head in selecting the enthusiastic and technically knowledgeable team and guide the existing team members. Skills Required : Should have good knowhow of application, middleware, Databases (posgres, mongo, mysql etc.), infra, OS. Should have good understanding in Docker and Kubernetes. Should have an understanding of CI/CD and DevOps tools like Jenkins, Ansible, Shell scripting etc Monitoring and Logging: Experience with monitoring and logging tools (e.g. Nagios / appdynamics, ELK, Prometheus). Good Experience of distributed systems RabbitMQ, Kafka, Redis etc. Should have an experience of working on Linux, Weblogic/tomcat, Jboss and middleware technology. Should have worked on high traffic & highly scalable systems in past Knowledge on fundamental aspects for release automation (packaging, dependencies, promotion, deployment, compliance)