Site Reliability Engineer at Polygon | Remote | Full-Time | cryptojobs.com | Best Platform for the Latest Web3 and Blockchain Jobs

Summary

As a Site Reliability Engineer (SRE) at Polygon Labs, you will play a key role in helping operate and support the production infrastructure that powers the Polygon network. Working alongside experienced SREs and protocol engineers, you will gain hands-on exposure to running large-scale, distributed blockchain systems while learning best practices for reliability, observability, and incident response.

This is an ideal role for someone early in their SRE or infrastructure career who is curious about how production systems work, motivated to learn through real-world operational challenges, and excited to grow within a collaborative and mentorship-driven environment. Your work will directly contribute to the reliability and performance of critical public infrastructure used by developers and users globally.

Your Responsibilities

You will support the day-to-day reliability and operations of Polygon Labs’ production systems, with responsibilities that include:

Monitoring production systems, alerts, dashboards, and logs across Polygon networks, including Polygon PoS and the Agglayer.
Assisting with incident detection, triage, escalation, and resolution under the guidance of senior engineers.
Supporting on-call and operational coverage through structured rotations, with training and mentorship.
Following, maintaining, and improving runbooks and standard operating procedures.
Assisting with routine operational tasks such as service restarts, upgrades, and configuration changes.
Helping maintain and improve monitoring, logging, and alerting systems, including dashboards for network health, RPC performance, and node metrics.
Learning to improve alert signal quality and reduce operational noise.
Supporting cloud-based and containerized infrastructure, including nodes, RPC endpoints, and supporting services.
Collaborating with protocol, product, and cross-functional teams to understand production issues and user impact.
Participating in post-incident reviews and contributing to root-cause analysis documentation.
Continuously building knowledge of blockchain fundamentals, distributed systems, and networking.

What You'll Need

A foundational understanding of Linux systems, processes, and basic networking concepts.
Familiarity with at least one scripting or programming language, such as Python, Bash, or Go.
An interest in site reliability, monitoring, and operating production infrastructure.
Clear written and verbal communication skills, with a willingness to ask questions and learn.
The ability to remain calm, methodical, and responsive during incidents or operational events.

Preferred Qualifications

Exposure to cloud platforms such as AWS or GCP.
Familiarity with containerization or orchestration technologies, including Docker or Kubernetes.
Basic understanding of blockchain or Web3 concepts, such as nodes, RPC services, or validators.
Experience with monitoring and observability tools such as Grafana, Prometheus, Datadog, or ELK-based stacks.

Skills

AWS
Communications Skills
Development
Operations
Python
Software Engineering
Team Collaboration

About Company

Job Description

Summary

Skills

About Company

Job Description

Summary

Skills

Newsletter