Job Description
Summary
We’re seeking a seasoned Technical Operations Engineer to ensure the stability, reliability, and performance of our production systems. In this key role, you’ll leverage deep technical expertise, particularly in Web3/blockchain technologies, to manage, optimize, and enhance our platform infrastructure. You’ll drive operational excellence through proactive monitoring, meticulous incident management, innovative problem-solving, and collaborative cross-team initiatives.
What You’ll Do
- Blockchain Network Management: Lead the deployment, optimization, and operational management of new blockchain networks. Conduct thorough testing, benchmarking, and continuous improvement of chain reliability and performance.
- Complex Web3 Issue Resolution: Address high-impact Web3 incidents through rigorous troubleshooting, detailed log analysis, JSON-RPC response debugging, and direct coordination with blockchain foundations and ecosystem partners.
- Proactive System Monitoring: Develop and maintain comprehensive monitoring and alerting solutions using advanced dashboards (e.g., Grafana, DataDog), identifying trends, anomalies, and performance bottlenecks before they become critical.
- Incident & SLO Management: Define, implement, and enforce service-level objectives (SLOs) and agreements (SLAs), ensuring measurable standards of system reliability and performance are consistently met.
- Automation & Optimization: Implement and maintain automation solutions (Ansible, Terraform, Kubernetes) to streamline deployments, reduce manual tasks, and optimize cloud infrastructure cost and efficiency.
- Technical Collaboration: Actively collaborate with Tier-1 support, infrastructure, and development teams, ensuring alignment on system improvements, rapid issue resolution, and operational knowledge sharing.
- On-Call Support: Participate in a rotating 24/7 on-call schedule to swiftly address critical system incidents, maintain continuous service delivery, and uphold customer trust.
What You’ll Bring
- Minimum of 5 years in Technical Operations, Site Reliability Engineering (SRE), or related roles. Proven Linux/Unix system administration and advanced troubleshooting capabilities.
- Deep experience managing complex Web3 infrastructures (RPC services, validator setups, node operations). Skilled in interpreting blockchain logs, JSON-RPC responses, and debugging intricate Web3 protocol issues.
- Solid hands-on experience with configuration management and infrastructure automation tools (Helm, Terraform, Ansible, Consul), including containerization expertise (Docker, Kubernetes), managing and scaling services in cloud environments.
- Competency in scripting/programming languages (Python, Go, JavaScript).
- Advanced proficiency in monitoring and analytics platforms (Grafana, DataDog), enabling proactive and data-driven operational decision-making.
- Demonstrated ability to identify performance patterns, forecast potential issues, and implement preventive solutions.
- Strong track record defining, measuring, and maintaining SLAs/SLOs, and experienced with incident response tooling and processes (PagerDuty), ensuring quick resolution and systematic root-cause analyses.
- Willing to travel on a limited basis for conferences, offsites and/or meetings, generally less than 10 days per year.
- Exceptional interpersonal and communication skills, with a proven ability to collaborate effectively across multiple teams and stakeholders.
- Self-motivated, solution-oriented, and consistently striving for operational improvements, quality enhancements, and reduced technical debt.
- Solid professional attributes, committed to transparency, accountability, and ethical behavior. Capable of managing complexity and staying adaptable under pressure, and able to demonstrate continuous learning and comfort evolving within a rapidly changing technical landscape.
- Self-starter driven by curiosity and initiative, proactively identifying opportunities, addressing gaps, and implementing solutions autonomously.
- Thrives in dynamic environments and committed to maintaining industry leadership through close collaboration with the most innovative and talented minds in Web3.
Skills
- Communications Skills
- Development
- Problem Solving
- Python
- Software Engineering
- Team Collaboration