Site Reliability Engineer Fully (Remote)

Site Reliability Engineer Fully | BRAMKAS INC | UnitedStates

Site Reliability Engineer (SRE) The Site Reliability Engineer(SRE) will be responsible for both uplifting and maintaining our evolvingtechnology platforms, infrastructure and technology controls. As an SRE,the role will include both oversight for production operations of oursystems, as well as development/engineering of solutions to maximize systemreliability & automation. The role will need to work with a global teamresponsible for a mission critical business function, and will partner withInfrastructure, DevOps and Core practices (like Security, Identity,ProdOps, Cloud platform and Tools) teams to identify and implementautomation opportunities to drive down toil, reduce technical debt andimprove system reliability. Key Skill: Python , Monitoring , New Relic ,Prometheus ,Observability , Scalability , Reliability Operation SkillsRequired The successful candidate will have the followingattributes/qualifications: 5 + years of Development and Operationsexperience in building and running applications in production that hasuptime over 99%. related experience and/or training; or equivalentcombination of education and experience 3-5 years of experience as a SRE inhandling applications that are web scale Strong hands-on coding experiencein one or more of programming languages such as Python, Golang, Java, Bash,etc. Good understanding of Observability (monitoring, logging, tracing,metrics), Chaos engineering concepts. Proficiency in using ApplicationPerformance Monitoring (APM) tool New Relic for monitoring, logging,tracing. Expert level hands on knowledge in public cloud platform AWSand/or Google Cloud Platform. Professional level certificate on one of thepublic clouds is highly desirable. Must have hands-on experience in usingconfiguration management systems such as Ansible or SaltStack andinfrastructure automation tools like Terraform or CloudFormation. Shouldhave used altering systems such as Pager Duty. Should have implementedsolutions around Service Level Indicators (SLIs) and Service LevelObjectives (SLOs) for services. Measurement should have been within asystem and across systems in distributed systems Should have supportedProduction Incidents (PIs) on critical applications of a company.Troubleshoot, debug, and diagnose operational issues and drive them toclosure. Understanding of software delivery life cycles, particularlyAgile/Lean & DevOps Proven experience in handling large scale andgrowing infrastructure across Data Centers and heterogeneous Cloudplatforms Experience as a service owner in managing large –geographically diverse stakeholders Ability to work with creative – fastgrowing engineering team and motivate them to deliver their best workHistory of driving innovation Bachelor’s/Master’s Degrees Skills– Nice to Have: Familiarity with handling: Containerization –Kubernetes, Docker, Rancher, etc Kafka, Yarn, ElasticSearch etc. Sourcecode management and Implementation of Security best practices. Tech Stack– Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map ReduceNetworking knowledge Understanding of software delivery life cycles,particularly Agile/Lean & DevOps Contribution to open source communityKey Responsibilities: Work with DevOps teams to Build, Release, Monitor andrun the services to improve service reliability. Write software to automateAPI-driven tasks at scale and contribute to the product codebase in Java,JS, React, Node, Go and Python Write automation to reduce toil andeliminate manual tasks that are repeatable. Work with Ansible, Puppet,Chef, Terraform or another config management / orchestration suite, knowwhere it’s broken, work towards fixing them and explore newalternatives Maintain services once they are live by measuring andmonitoring availability, latency and overall system reliability Handlecross team performance issues from identification of the cause, determiningthe areas of improvement and driving those actions to closure Performanceand maturity baselining of DevOps process, tools maturity & coverage,metrics, technology and engineering practices Define, Measure and improveReliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracingsolutions), Ops process (Incident, Problem Mgmt.) and streamline –automate release management. Build dashboards to provide visibility intoperformance of the applications. Understand the current process, systemsetup and propose the improvements needed in the processes, and technologyso that the application exceeds the desired Service Level Objective. Strongbeliever of automation to bring in sustained continuous improvement byautomating Toil, Runbooks, improving ability of the applications to autoheal leading to improved reliability********************************************************************************* Show more Show lessTagged as: remote, remote job, virtual, Virtual Job,virtual position, Work at Home, work from home