Job Description
Summary
You will lead an experienced Site Reliability Engineering team, ensuring our services and tooling are available, building infrastructure to make our team's production and testing environments available, and greasing the rails of our systems and processes to ensure they're robust, efficient, and easy to deploy.
SDF has a robust career path for both individual contributors and managers.
In this role, you will:
- Establish a clear vision and mandate for the Site Reliability Engineering team
- Define the SRE team's quarterly OKRs to best align with the company's goals
- Define processes of collaboration between SREs and development teams throughout the software development lifecycle
- Define a career growth path for the SRE team, as well as coach and mentor individual contributors on the team
- Define and track metrics across engineering and help hold engineering teams accountable for their KPIs
- Coordinate priorities with other teams and areas of the organization
- Participate in sprint planning and execution, track progress and oversee day-to-day tactical decisions
- Design and build reliable systems, and infrastructure that is easy to use by software engineers
- Monitor and troubleshoot systems in production
- Define and participate in 24/7 on-call rotations alongside the team
- Mediate technical discussions and review PRs
- Jump in as needed with code fixes, troubleshooting and hands-on contributions
- Collaborate across the Stellar ecosystem, engaging with key partners and advising on their integration to set them up for success
You have:
- 3+ years of experience working as a Site Reliability Engineer
- 3+ years of experience managing an SRE team
- Site Reliability Engineering experience:
- Strong track record of collaborating with dev teams at all stages of product development (design, development/CI, beta testing, production)
- Strong track record collaborating on defining, measuring and driving improvements in KPIs
- Strong track record assisting teams during Root Cause Analysis and post mortems
- Infrastructure and Operations experience:
- Designing and building out the infrastructure for large distributed systems
- Maintaining highly-available infrastructure
- Troubleshooting and understanding complex technical problems
- Using configuration Management or IaC tooling such as Terraform, Ansible, Puppet
- Building and maintaining infrastructure using Kubernetes
- Highly autonomous; able to find clarity in ambiguous circumstances
- Excellent communicator; comfortable working with remote team members
Bonus Points if (optional):
- 3+ years of experience writing code in a major programming language
- You have worked on an open source project
- You have managed a distributed team
- You build things for fun in your spare time
We offer competitive pay with a base salary range for this position of $210,000 - $310,000 depending on job-related knowledge, skills, experience, and location.
Skills
- Communications Skills
- Development
- Problem Solving
- Software Engineering
- Team Collaboration