Site Reliability Engineering Principal [T500-**]

  • Hyderabad
  • Fedex Acc
  • Cloud Platforms: Advanced proficiency in one or more cloud platforms such as AWS, Azure, or Google Cloud Platform (GCP), including expertise in services such as EC2, S3, RDS, and VPC networking.
  • Container Orchestration: Strong experience with container orchestration platforms such as Kubernetes, including deployment, scaling, and management of containerized applications.
  • Configuration Management and Automation: Proficiency in configuration management tools such as Ansible, Puppet, or Chef, with a strong emphasis on automation and infrastructure as code (IaC) practices.
  • Monitoring and Observability: Hands-on experience with monitoring and observability tools such as Splunk, Prometheus, Grafana, ELK stack (Elasticsearch, Logstash, Kibana), or similar solutions for real-time system monitoring, logging, tracing, and alerting.
  • Continuous Integration/Continuous Deployment (CI/CD): Experience with CI/CD pipelines and tools such as Jenkins, GitLab CI/CD, CircleCI, or Travis CI, including automated testing, deployment, and rollback strategies.
  • Infrastructure as Code (IaC): Proficiency in IaC tools such as Terraform or CloudFormation for provisioning and managing infrastructure resources declaratively.
  • Scripting and Automation: Strong scripting skills in languages such as Python, Shell, or Go for automating repetitive tasks, managing configurations, and orchestrating deployments.
  • Databases and Datastores: Experience with relational databases (e.g., PostgreSQL, MySQL) and NoSQL databases (e.g., MongoDB, Cassandra), time series databases Including performance tuning, replication, and high availability configurations.
  • Security Best Practices: Familiarity with security best practices for cloud environments, including identity and access management (IAM), encryption, network security, and compliance standards such as PCI-DSS and GDPR.
  • Version Control Systems: Proficiency in version control systems such as Git, including branching strategies, code reviews, and collaboration workflows.
  • Synthetic Monitoring: Experience with synthetic monitoring tools such as New Relic Synthetics, Datadog Synthetics, or Selenium for simulating user interactions and monitoring application performance from external locations.
  • Network Understanding: Strong understanding of networking, distributed systems, microservices architecture, and other relevant architectural concepts.
  • Analytical Skills: Excellent problem-solving skills and the ability to troubleshoot complex issues in production environments.

Responsibilities:

  • Efficient Lifecycle Management: You will be enhancing application and cloud service lifecycles.
  • Reliable Software Improvement: Boost software dependability for organizational efficiency.
  • Expert Guidance in Reliability: Provide expert direction on reliability practices.
  • Robust Testing Development: Develop effective testing strategies and tools.
  • Adaptable SRE Solutions Implementation: Implement flexible solutions to enhance system stability.
  • Dashboard Development Leadership: Lead comprehensive SRE Dashboard creation.
  • Optimized Performance Testing Deployment: Deploy specialized tests for peak system performance.
  • Swift Incident Resolution: Resolve production incidents promptly to minimize disruptions.
  • Continuous Service Enhancement: Enhance service reliability through proactive measures.
  • Proactive Anomaly Management: Identify and address anomalies before they impact operations.
  • Automated Dashboard Setup: Streamline dashboard provisioning for efficient operations.
  • Precise Code Debugging: Investigate and resolve issues at the code level efficiently.
  • Seamless Release Integration: Integrate SRE practices seamlessly into the release cycle.
  • Efficient Process Automation: Automate repetitive tasks to save time and resources.
  • Dynamic SRE Solutions Enhancement: Assess and enhance SRE solutions for optimal performance.
  • Collaborative SRE Implementation: Work with teams to implement and refine SRE practices.
  • Proactive System Enhancement: Improve system resilience through proactive initiatives.
  • Effective SRE Training Delivery: Deliver training sessions for widespread SRE knowledge.
  • Scalability Strategy Planning: Design strategies for scalable infrastructure growth.
  • Proactive Improvements: Spend at least 50% of your time on proactive improvements to system reliability and resilience
  • Training: Conduct SRE training sessions

Nice to have:

  • Previous FedEx experience
  • Master’s degree
  • Domain knowledge in logistics, finance, or supply chain