Lead Site Reliability Engineer (Remote)

Lead Site Reliability Engineer | Sprinto | India

Sprinto is a leading platform that automatesinformation security compliance. By raising the bar on informationsecurity. We are a team of 250+ employees & helping2000+ Customers across 75+ Countries. We...

Lead Site Reliability Engineer | Sprinto | India

Sprinto is a leading platform that automates information security compliance. By raising the bar on information security. We are a team of 250+ employees & helping 2000+ Customers across 75+ Countries. We are funded by top investment partners Accel, ELEVATION & Blume Ventures and have raised 32 Million USD in funding, including our latest Series B round.

The Role

As the Lead Site Reliability Engineer, you will oversee the observability and CI/CD pipelines as well as full infrastructure management to ensure high availability, scalability, and reliable product delivery.

Responsibilities

Observability Pipeline Management: Take ownership of the observability pipeline to ensure high availability and optimal performance of applications.
CI/CD Pipeline Development: Design, build, and maintain the Continuous Integration/Continuous Deployment (CI/CD) pipelines to facilitate smooth and reliable product deliveries.
Infrastructure Management: Own the complete infrastructure stack of the product, contributing to scalability and enhancements of the overall offering.
Collaboration with Application Engineers: Work closely with application engineers to develop and refine tooling necessary for efficient operations management.
Incident Response and On-Call Process Development: Establish and maintain on-call protocols and incident response processes to ensure timely resolution of issues and maintain service reliability.

Requirements

Expertise in Infrastructure as Code (IaC) Tools: Proficiency with tools such as Terraform and Ansible.
Experience with APM Tools: Skilled in using Application Performance Monitoring tools, setting up on-call practices, identifying bottlenecks across the stack, and collaborating with teams to address these issues effectively.
Application Capacity Planning and Incident Response: Proven experience in application capacity planning, owning incident response workflows, and running processes such as Root Cause Analyses (RCAs), and maintaining runbooks.
Problem-Solving and Communication Skills: Strong problem-solving abilities and excellent communication skills, both spoken and written.
Familiarity with Our Tech Stack (Bonus): While experience with our current tech stack is optional, familiarity with it is a plus as it will enable you to start contributing sooner. Our tech stack includes Node.js, React, Apollo GraphQL, PostgreSQL, and AWS.

Show less

Tagged as: remote, remote job, virtual, Virtual Job, virtual position, Work at Home, work from home