Job Description
Summary
We’re seeking a seasoned Technical Operations Engineer to ensure the stability, reliability, and performance of our production systems. In this key role, you’ll leverage deep technical expertise, particularly in Web3/blockchain technologies, to manage, optimize, and enhance our platform infrastructure. You’ll drive operational excellence through proactive monitoring, meticulous incident management, innovative problem-solving, and collaborative cross-team initiatives.
What You’ll Do
- Blockchain Network Management: Lead the deployment, optimization, and operational management of new blockchain networks. Conduct thorough testing, benchmarking, and continuous improvement of chain reliability and performance.
- Complex Web3 Issue Resolution: Address high-impact Web3 incidents through rigorous troubleshooting, detailed log analysis, JSON-RPC response debugging, and direct coordination with blockchain foundations and ecosystem partners.
- Proactive System Monitoring: Develop and maintain comprehensive monitoring and alerting solutions using advanced dashboards (e.g., Grafana, DataDog), identifying trends, anomalies, and performance bottlenecks before they become critical.
- Incident & SLO Management: Define, implement, and enforce service-level objectives (SLOs) and agreements (SLAs), ensuring measurable standards of system reliability and performance are consistently met.
- Automation & Optimization: Implement and maintain automation solutions (Ansible, Terraform, Kubernetes) to streamline deployments, reduce manual tasks, and optimize cloud infrastructure cost and efficiency.
- Technical Collaboration: Actively collaborate with Tier-1 support, infrastructure, and development teams, ensuring alignment on system improvements, rapid issue resolution, and operational knowledge sharing.
- On-Call Support: Participate in a rotating 24/7 on-call schedule to swiftly address critical system incidents, maintain continuous service delivery, and uphold customer trust.
What You’ll Bring
- Minimum of 5 years in Technical Operations, Site Reliability Engineering (SRE), or related roles. Proven Linux/Unix system administration and advanced troubleshooting capabilities.
- Deep experience managing complex Web3 infrastructures (RPC services, validator setups, node operations). Skilled in interpreting blockchain logs, JSON-RPC responses, and debugging intricate Web3 protocol issues.
- Solid hands-on experience with configuration management and infrastructure automation tools (Helm, Terraform, Ansible, Consul), including containerization expertise (Docker, Kubernetes), managing and scaling services in cloud environments.
- Competency in scripting/programming languages (Python, Go, JavaScript).
- Advanced proficiency in monitoring and analytics platforms (Grafana, DataDog), enabling proactive and data-driven operational decision-making.
- Demonstrated ability to identify performance patterns, forecast potential issues, and implement preventive solutions.
- Strong track record defining, measuring, and maintaining SLAs/SLOs, and experienced with incident response tooling and processes (PagerDuty), ensuring quick resolution and systematic root-cause analyses.
- Exceptional interpersonal and communication skills, with a proven ability to collaborate effectively across multiple teams and stakeholders.
- Self-motivated, solution-oriented, and consistently striving for operational improvements, quality enhancements, and reduced technical debt.
- Solid professional attributes, committed to transparency, accountability, and ethical behavior. Capable of managing complexity and staying adaptable under pressure, and able to demonstrate continuous learning and comfort evolving within a rapidly changing technical landscape.
- Self-starter driven by curiosity and initiative, proactively identifying opportunities, addressing gaps, and implementing solutions autonomously.
- Thrives in dynamic environments and committed to maintaining industry leadership through close collaboration with the most innovative and talented minds in Web3.
Level-Specific Expectations
P1 – Technical Operations Associate
- Execute documented playbooks (node deployment, DNS updates, incident triage) with close guidance.
- Monitor dashboards and PagerDuty; tackle known issues, escalate complex issues within the team.
- Shadow incident response, and submit clear shift-handover notes.
P2 – Technical Operations Engineer
- Maintain two to three production chains or subsystems independently during your shift.
- Write or update small Ansible/Terraform modules and simple Bash/Python utilities.
- Act as first incident commander for SEV 2/3 events; publish concise post-incident notes.
- Tune alerts and dashboards to reduce false positives.
P3 – Technical Operations Engineer II
- Lead new chain launches from design review through canary, cut-over, and post-mortem.
- Command SEV 0/1 efforts and drive deep root-cause analysis.
- Define, track, and report SLOs; create capacity and cost models.
- Mentor P1/P2 engineers; perform peer reviews on IaC and observability changes.
- Join customer or partner calls for complex escalations.
P4 – Senior Technical Operations Engineer
- Architect region-wide failover, anycast, and multi-cloud safety controls.
- Build benchmarking harnesses that compare kernels, instance types, and storage back-ends.
- Lead fleet-scale initiatives (e.g., deployment stack updates, platform migrations) with minimal oversight.
- Establish reliability standards adopted by all Core TechOps engineers.
- Coach senior engineers and run design-review teams.
Skills
- Communications Skills
- Development
- Operations
- Python
- Software Engineering
- Team Collaboration

