Position Overview
We're looking for an experienced and adaptable Site Reliability Analyst to join a growing Technology Services team. This individual will play a key role in ensuring the operational integrity and long-term scalability of our platforms. The position combines traditional IT support responsibilities with modern reliability engineering methods to create a stable and resilient technology environment that aligns with business priorities.
Key Responsibilities
Partner with engineering and infrastructure teams to assess the performance, resilience, and availability of systems. Advise on design decisions that impact operational reliability.
Simulate potential failure scenarios when new features or architectural changes are deployed. Lead analysis sessions following service disruptions to drive improvements.
Design and coordinate controlled failure testing (chaos engineering) to validate system robustness. Help execute performance assessments to support product readiness.
Provide expert-level support during system outages or client-affecting incidents, leading troubleshooting efforts.
Ensure system performance targets and reliability metrics are effectively defined and maintained.
Create and update recovery documentation (runbooks) for critical systems, and guide SRE tool and process adoption across teams.
Monitor usage patterns and plan for future capacity needs to maintain system responsiveness and growth.
Keep infrastructure configurations consistent and up to date across various environments.
Support ad hoc projects and contribute to broader technology initiatives as needed.
Requirements
A bachelor's degree in Computer Science, Information Systems, or a related field-or equivalent practical experience.
5+ years of professional experience in technology operations, systems analysis, or site reliability engineering.
Proven ability to diagnose complex technical issues and communicate solutions clearly.
Familiarity with monitoring platforms, incident management practices, and vendor oversight.
