Job Description

Summary

You will lead an experienced Site Reliability Engineering team, ensuring our services and tooling are available, building infrastructure to make our team's production and testing environments available, and greasing the rails of our systems and processes to ensure they're robust, efficient, and easy to deploy.

SDF has a robust career path for both individual contributors and managers.

In this role, you will:

  1. Establish a clear vision and mandate for the Site Reliability Engineering team
  2. Define the SRE team's quarterly OKRs to best align with the company's goals
  3. Define processes of collaboration between SREs and development teams throughout the software development lifecycle
  4. Define a career growth path for the SRE team, as well as coach and mentor individual contributors on the team
  5. Define and track metrics across engineering and help hold engineering teams accountable for their KPIs
  6. Coordinate priorities with other teams and areas of the organization
  7. Participate in sprint planning and execution, track progress and oversee day-to-day tactical decisions
  8. Design and build reliable systems, and infrastructure that is easy to use by software engineers
  9. Monitor and troubleshoot systems in production
  10. Define and participate in 24/7 on-call rotations alongside the team
  11. Mediate technical discussions and review PRs
  12. Jump in as needed with code fixes, troubleshooting and hands-on contributions
  13. Collaborate across the Stellar ecosystem, engaging with key partners and advising on their integration to set them up for success

You have:

  1. 3+ years of experience working as a Site Reliability Engineer
  2. 3+ years of experience managing an SRE team
  3. Site Reliability Engineering experience:
  4. Strong track record of collaborating with dev teams at all stages of product development (design, development/CI, beta testing, production)
  5. Strong track record collaborating on defining, measuring and driving improvements in KPIs
  6. Strong track record assisting teams during Root Cause Analysis and post mortems
  7. Infrastructure and Operations experience:
  8. Designing and building out the infrastructure for large distributed systems
  9. Maintaining highly-available infrastructure
  10. Troubleshooting and understanding complex technical problems
  11. Using configuration Management or IaC tooling such as Terraform, Ansible, Puppet
  12. Building and maintaining infrastructure using Kubernetes
  13. Highly autonomous; able to find clarity in ambiguous circumstances
  14. Excellent communicator; comfortable working with remote team members

Bonus Points if (optional):

  1. 3+ years of experience writing code in a major programming language
  2. You have worked on an open source project
  3. You have managed a distributed team
  4. You build things for fun in your spare time

We offer competitive pay with a base salary range for this position of $210,000 - $310,000 depending on job-related knowledge, skills, experience, and location. 

Skills
  • Communications Skills
  • Development
  • Problem Solving
  • Software Engineering
  • Team Collaboration
© 2025 cryptojobs.com. All right reserved.