Site Reliability Engineer

  • Pune
  • Ltimindtree
About the Job: Position: SRE Devops Location: Chennai/Bangalore/Hyderabad/Pune/Mumbai Experience: 5 to 8 Years only Primary Skill - SRE, Dynatrace, Prometheus, Grafana, Kubernetes, AWS Native components, CloudWatch, (Puppet/ Chef/Ansible), CDK Responsibilities • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. • Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence. • Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications • Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency. • Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence • Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments. • Practice sustainable incident response and blameless postmortems • Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems • Preserve operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation. • Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems. Tech skills • Bachelor’s degree in design, computer science, or a related technical field • Strong debugging, troubleshooting, and problem-solving skills • Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc. • Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana. • Experience with logs and metrics analytics platforms like Sumologic, Splunk • Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace. • Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible • Proven history of leveraging automation • Experience using tools like PagerDuty for managing incidents. • Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies • Experience in Serverless Application Framework • Experience in containerized workloads and management platforms such as Docker or Kubernetes • Familiarity with distributed systems is a plus including Microservices. • Experience in Infrastructure automation tools such as CDK • Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo • Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors. • Experience liaising with developers, operations engineers, and third-party resources. • Experience consuming APIs. Soft Skills • Ability to work in a team and independently. • Excellent verbal and written communication skills • Multitasking • Time management