Job Description
Summary
Responsibilities
- Be part of a devops team, dedicated to building internal platforms.
- Work closely with internal teams to improve the system reliability, scalability and developer productivity
- Engage in and improve the infrastructure quality supporting the platform.
- Build and manage systems, infrastructure and applications through automation.
- Provide operational support to internal teams working on the platform.
- Work on improvements to bring in high efficiency, reduce latency, deploy systems faster.
- Practice sustainable incident response and blameless postmortems.
- Together with your engineering team, you will share an on-call rotation and be an escalation contact for service incidents.
Minimum Qualification
- Bachelors with 5+ years of working experience as Site Reliability Engineering (SRE) / Devops Engineer
- Experience with programming. Preferably Python, or Go.
- Knowledge of Linux internals and bash scripting.
- Strong skills around observability, debugging and performance tuning, willing to dive into understanding, debugging, and improving any layer of the stack.
- Strong experience in managing infrastructure with cloud providers like AWS.
- Experience in container orchestration systems like kubernetes.
- Strong experience in Observability platforms like Prometheus, Grafana etc
- Experience in standards devops tools for infrastructure management (terraform/opentofu etc), CI/CD (ArgoCD, Jenkins etc)
Preferred Qualification
- Expertise in automation tools like Ansible, Terraform.
- Expertise in devops tooling like Jenkins, ArgoCD, github actions
- Expertise in advanced observability (USE-RED signals, Tracing, front end observability etc) and monitoring stacks, preferably Grafana stack.
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
- Extensive experience in supporting production systems as SRE.
- Experience in setting up monitoring stack for process and docker based environments.
Skills
- AWS
- Communications Skills
- Development
- Problem Solving
- Software Engineering