Job Description

Summary

We are seeking a Site Reliability Engineer (SRE) to join our Team in India. 

WHAT YOULL DO:

  • Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs,
  • Actively troubleshoot any issues that arise during testing and production, catching and solving issues before launch,
  • Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more,
  • Monitor and troubleshoot highly scalable and distributed server clusters that perform various functions, from web-servers to machine learning processing,
  • Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents,
  • Participate and establish best practices in Site Reliability Engineering,
  • Manage code deployments, fixes, updates, and related processes,
  • Work with a close-knit team and brainstorm on the best ways to tackle complex problems in infrastructure, security and monitoring,
  • Provide technical guidance and educate team members and coworkers on monitoring and logging. (Have an interesting idea or solution? Present it!),
  • Automating any software maintenance processes which previously required a manual procedure.

WHAT WERE LOOKING FOR:

  • 3+ years experience with software engineering, software development, or system operations on high available and high traffic environments,
  • Strong experience with Linux-based infrastructures, Linux/Unix administration, and Azure
  • Experience with databases such as PostgreSQL
  • Experience administering linux servers as well as docker based infrastructure (like Kubernetes, AKS, etc.) in a highly available environment,
  • Experience of scripting languages such as Python, Bash,
  • Experience with message broker/queue technologies like RabbitMQ,
  • Experience with modern monitoring, logging and observability tools in complex distributed systems such as with  Application Insights, Grafana, New Relic, Splunk, Elastic stack, Datadog, Prometheus, etc,
  • Practical experience with infrastructure-as-code (with tools like Terraform, Chef, Ansible, etc.).
  • Good understanding of cybersecurity fundamentals and best practices,
  • Containerizing and clustering (Dockerfiles, docker-compose, Helm, Kubernetes, etc.),
  • Stellar problem-solving and troubleshooting skills with the ability to spot issues before they become problems,
  • Fluent language skills in English,
  • Excellent oral and written communication skills,
  • Process-oriented with great documentation skills,
  • Solid team player!

Skills
  • Python
  • Software Engineering
  • SQL
© 2024 cryptojobs.com. All right reserved.