Job Description
Summary
Zero Hash is looking for an experienced and passionate Site Reliability Engineer to join our Platform team.
What you will do:
- Take an active role as co-owner of production services to ensure they are built, maintained, and operated in a reliable and scalable way.
- Be part of the successful delivery of new features and services, as well as the day-to-day operations of existing services.
- Collaborate with Software Engineering to identify and help drive operational improvements through metric driven collection and analysis.
- Develop and maintain performance benchmarks for our applications to ensure a consistent customer experience
- Help drive operational efficiencies releasing code and monitoring performance.
- Provide traditional SRE/Operational support scopes like tooling and automation, monitoring, workflow management, maintaining and improving CI/CD, etc.
- Participate in our weekly on-call rotation to investigate and resolve potential system issues.
- Get your hands dirty managing and scaling our various infrastructure systems.
Desired Skills:
- You have extensive experience deploying, managing and troubleshooting infrastructure in AWS.
- You have managed the full lifecycle of deploying a container to a production environment using self-managed kubernetes, ECS, or EKS.
- When the perfect tool wasn’t available you wrote one yourself and taught others how to use it.
- You understand CI/CD and have built custom tooling to deploy code to production environments.
- You are able to solve problems in distributed Linux systems and are comfortable tracing requests across applications, systems and networks.
- You hold a CKS certification (kubernetes security)
- You can automate routine tasks and are proficient in at least two programming languages.
- You have fantastic communication skills in both spoken and written forms to explain complex ideas to various audiences.
- You thrive in an environment where collaboration and communication are paramount but are able to solve problems on your own.
Projects you might work on:
- Creating and maintaining application performance benchmarks so we know when our applications are not performing well.
- Improving our CI/CD pipeline to reduce the time it takes from development merge to production deployment
- Continue to improve and scale our AWS and application infrastructure.
- Work closely with software development to help optimize local development workflow.
- Identify common issues and come up with solutions on how to reduce their impact or remove them altogether.
- Help implement a scalable solution for blue/green and canary deployments.
Skills
- AWS
- Communications Skills
- Development
- Software Engineering
- Team Collaboration

