Job Description

Summary

We are looking for a Site Reliability Engineer (SRE) to join the IT AI Infrastructure team to deploy, manage, and optimize AI-powered productivity tools and in-house AI solutions that enhance employee efficiency at scale. A successful candidate will have demonstrated success in similar roles within high-growth, security-conscious environments, bringing deep expertise in public cloud infrastructure (AWS/GCP), backend development (Python, Go, or Java), and automation tooling. The right person is passionate about building scalable and reliable AI infrastructure, driving automation, and collaborating across disciplines to integrate AI systems while maintaining strong security and compliance standards.

What You’ll Be Doing:

  1. Deployment and Management of AI Tools: Deploy, configure, and manage AI-powered employee productivity tools and in-house AI built solutions 
  2. Reliability and Performance: Ensure high availability, reliability, and optimal performance of AI platforms and services. Implement monitoring, alerting, and incident response procedures.
  3. Scalability and Infrastructure: Design and implement scalable infrastructure to support the growing demands of AI tools and user base. Optimize resource utilization and manage capacity planning.
  4. Automation and Tooling: Develop and maintain automation scripts and tools to streamline deployment, monitoring, and maintenance tasks. Contribute to the experimental sandbox environments for testing new AI solutions.
  5. Collaboration and Support: Collaborate with cross-functional teams (Machine-Learning, HR, Security, Data Science, Developer Experience) to support the development and integration of AI solutions. Provide technical support and troubleshooting for AI-related issues.
  6. Security and Compliance: Adhere to security and privacy policies while deploying and managing AI tools. Ensure compliance with regulatory requirements.
  7. Monitoring and Metrics: Implement comprehensive monitoring and metrics to track the performance and health of AI systems. Analyze data to identify areas for improvement and optimization.
  8. Incident Response: Participate in incident response and troubleshooting for AI-related outages or performance issues. Develop and maintain incident response plans.
  9. Backend Development: Contribute to backend development tasks to support the integration and functionality of AI tools.
  10. Public Cloud Management: Deploy and manage AI solutions on public cloud platforms (AWS/GCP), leveraging cloud-native services and best practices.
  11. Written and Verbal Communication: Excellent communication skills and experience presenting technical information to non-technical audiences, including senior leadership.

What We Look For In You:

  1. Proven experience as a Site Reliability Engineer (SRE) or similar role.
  2. Strong understanding of AI technologies and platforms.
  3. Experience with deploying and managing applications in a cloud environment (AWS/GCP).
  4. Solid backend development experience with programming languages such as Python, Java, or Go.
  5. Strong proficiency in managing and configuring public cloud services (AWS/GCP) for scalability and reliability.
  6. Experience with automation tools and scripting (e.g., Ansible, Terraform, Bash, Python).
  7. Excellent troubleshooting and problem-solving skills.
  8. Strong communication and collaboration skills.
  9. Strong security and compliance understanding.
  10. Experience working in a highly regulated environment
  11. Experience in a fast-paced, high-growth company

 

ID: P70538

Pay Transparency Notice: Depending on your work location, the target annual salary for this position can range as detailed below. Full time offers from Coinbase also include target bonus + target equity + benefits (including medical, dental, vision and 401(k)).

Pay Range:

$186,065—$218,900 USD

Skills
  • AWS
  • Development
  • Java
  • Problem Solving
  • Python
  • Software Engineering
  • Team Collaboration
© 2025 cryptojobs.com. All right reserved.