Site Reliability Engineer

  • Pune
  • Ltimindtree

About the Job:


Position: SRE Devops

Location: Chennai/Bangalore/Hyderabad/Pune/Mumbai

Experience: 5 to 8 Years only


Primary Skill - SRE, Dynatrace, Prometheus, Grafana, Kubernetes, AWS Native components, CloudWatch, (Puppet/ Chef/Ansible), CDK

Responsibilities

• Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.

• Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.

• Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications

• Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency.

• Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence

• Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments.

• Practice sustainable incident response and blameless postmortems

• Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems

• Preserve operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation.

• Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems.

Tech skills

• Bachelor’s degree in design, computer science, or a related technical field

• Strong debugging, troubleshooting, and problem-solving skills

• Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc.

• Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana.

• Experience with logs and metrics analytics platforms like Sumologic, Splunk

• Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace.

• Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible

• Proven history of leveraging automation

• Experience using tools like PagerDuty for managing incidents.

• Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies

• Experience in Serverless Application Framework

• Experience in containerized workloads and management platforms such as Docker or Kubernetes

• Familiarity with distributed systems is a plus including Microservices.

• Experience in Infrastructure automation tools such as CDK

• Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo

• Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors.

• Experience liaising with developers, operations engineers, and third-party resources.

• Experience consuming APIs.

Soft Skills

• Ability to work in a team and independently.

• Excellent verbal and written communication skills

• Multitasking

• Time management