Senior Site Reliability Engineer - Remote - Military veterans preferred



  full-time   employee

United States

Job Summary

This role can be based at 433 W Van Buren Street, Chicago or as a remote work from home arrangement.
The Senior Site Reliability Engineer will be a key technologist who will help grow our systems into bestin-class for stability, observability, and enterprise scale in the healthcare services space. This role is
responsible for the availability and reliability of critical platform services and applications, ensuring
they meet the requirements of internal and external users. They collaborate with technology and business leaders to understand and define the service levels needed for our solutions. They ensure the appropriate metrics for successfully meeting SLAs are defined, measured, and met, and continually
evolve to adapt to the changing needs of the business. This role is for an experienced engineer who has a mind for stability, an understanding of SRE best practices, and ideas about what could be improved. They accomplish this by working with development and DevOps teams to drive observability features into our solutions and by implementing and using observability tools to ensure SLAs and supporting metrics are monitored and met. The person will work across engineering and DevOps teams to ensure our platforms are ready for scale and engineered for resiliency and will provide feedback loops to developers and DevOps teams for continual improvement and evolution of non-functional requirements. Strong technology problem solving skills are needed to diagnose and troubleshoot production issues across our technology stack, quickly getting to root cause to resolve issues and providing feedback to other engineering teams for improvement. Ensuring HITRUST controls are implemented, evolved, and followed will help ensure we meet the compliance requirements of our key client partners.

Areas of focus:

  • Availability – ensure max uptime, identify changes needed to weed out failures
  • Latency – measure against SLAs, identify bottlenecks, create NFRs and recommend changes to address
  • Efficiency – fast, frequent deployment with little to no impact on customer base (processes like canary and blue/green deployments)
  • Change Management – instill resilience and robustness in new updates and features. Clear identification and tracking of changes and ability to measure impact and revert when needed
  • Issue resolution – resolve production issues, perform root cause analysis, feedback loop to other teams for fixes and improvements
  • Capacity Planning – leverage data to analyze trends and plan for future state capabilities
  • Metrics and measurements – identify service levels, identify metrics to measure to ensure these are met
  • Observability – work with teams to ensure solution support and implement and manage tools
  • Compliance controls and adherence – HITRUST controls implemented and followed
  • Business continuity – backup, DR, resiliency
Job Responsibilities
  • Works with business and technology leaders to define appropriate service level objectives and service level indicators in partnership with product and engineering teams.
  • Analyzes production system operations using tools such as monitoring, capacity analysis and outage root cause analysis to identify and drive change that ensure continuous improvement in system stability and performance
  • Performs real-time troubleshooting and repair of mission-critical application and platform components using critical knowledge of Azure PaaS components and application architecture.
  • Facilitates blameless post-mortems and provides feedback to product development and engineering teams for fixes and prioritization of improvements.
  • Measures and optimizes system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
  • Ensures timely and effective reporting, tracking, follow-up, and communication of problems to internal and external clients, technical resources, and executives.
  • Manages day to day issues including health checks of applications and processes, working closely with end users, development staff and Infrastructure teams, to prioritize and resolve and/or mitigate outages
  • Defines and implements business continuity strategies and ensures information controls are in place across all environments.
  • Identifies opportunities to automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure, minimize downtime, implement self healing patterns, and achieve human free operations.
  • Designs, develops, and drives troubleshooting & mitigation tools as part of driving self-healing agenda.
  • Provides primary operational support and engineering for cloud-based ecosystem
  • Works across product, engineering, and support teams to understand the deployment lifecycle,
  • Business and Application SLIs'/SLAs' and creates appropriate dashboards and thresholds for monitoring and alerting to support service level achievement
  • Defines and drives adoption of best-in-class monitoring frameworks, tools, dashboards, and automation to proactively detect, alert, and self-heal Production anomalies
  • Partners with development, testing, and DevOps teams to improve service reliability and scalability through influencing architecture, design, and testing processes and establishing comprehensive release procedures
  • Participates in system design consulting, platform management, and capacity planning
  • Works with product and engineering teams to identify and prioritize nonfunctional requirements around resiliency, security, and availability which will help ensure the achievement of platform and solution SLAs.
  • Implements and advocates applicable HITRUST controls across all tools, platforms, applications, and support processes
  • Contributes to architectural strategy and roadmap to improve scalability, availability, performance, and security concerns.
  • Implements continuous process improvement, including but not limited to policy, procedures, and production monitoring and alerting.

Basic Qualifications
  • BA/BS + at least 4 years OR High School/GED + at least 7 years Experience developing and monitoring mission critical cloud-based systems.
  • Knowledge of SRE philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation.
  • Experience of working with cloud environments as an SRE, Infrastructure Software Engineer, or DevOps Engineer.
  • Experience with Azure Paas services, including but not limited to AKS, ADO, App Gateway, Azure
  • Functions, Azure networking, Cosmos, API management tools.
  • At least 3 years of experience with Infrastructure as Code
  • Experience utilizing a proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Experience providing problem solving that allows for effective and timely resolution of system issues including but not limited to production outages.
  • Experience managing operations of large-scale internet-centric production environments built with
  • Azure PaaS solutions.
  • At least 2 years of experience contributing to financial decisions in the workplace.
  • At least 2 years of direct leadership, indirect leadership and/or cross-functional team leadership.
  • Willing to travel up to/at least 10% of the time for business purposes (within state and out of state).

Preferred Qualifications
  • Bachelor’s degree in computer science or other highly technical, scientific discipline
  • Experience with Terraform or ARM
  • 5+ years of working with cloud environments as an SRE, Infrastructure Software Engineer, or DevOps Engineer.
  • 5+ years of experience developing and monitoring mission-critical cloud-based systems.

The following information is applicable to Colorado only, in accordance with the Colorado Pay Equity Act. In Colorado, an employee in this position can expect a salary/hourly rate between $110,340.00 and $196,200.00 plus bonus pursuant to the terms of any bonus plan, if applicable will depend on experience, seniority, geographic locations, and other factors permitted by law. To review benefits, please click here Walgreens will provide applicants in other states with information related to the positions, to the extent required by state or local law, by calling 1-866-967-5492.