If you’re looking for a career where you can make a real impression, join Global Service Center (GSC) HSBC and discover how valued you’ll be. HSBC is one of the largest banking and financial services organisations in the world, with operations in 64 countries and territories. We aim to be where the growth is, enabling businesses to thrive and economies to prosper, and, ultimately, helping people to fulfil their hopes and realise their ambitions.
We are currently seeking an experienced individual to join our team in the role of Site Reliability Engineer Senior
· This is a thought leadership position aimed at establishing and evangelizing the SRE practice within the team and across the bank.
· Senior SREs will own and be accountable for delivery of key initiatives and projects.
· A senior SRE is a mentor and team leader for junior SREs
The Role
As an Site Reliability Engineer Senior you will have responsibility to:
- Ensure the availability and maintainability of our large-scale API and Microservices platform located across three points of presence in HK, UK, and the US.
- Continuously improve the reliability, capacity, and performance of our platforms by applying SRE principles and practices to drive scale, enhance observability, reduce toil, more accurately measure risk, and more safely enable business driven change.
- Elevate our expertise and maturity in safely managing our core technology stack underpinned by AWS, Kubernetes, Kong API gateway, Mulesoft API, Istio Service Mesh, and a host of supporting services in a hybrid hosting environment (i.e., private/public cloud & on-prem).
- Develop best in class observability tools and techniques enabling monitoring and alerting capability which facilitate not only incident detection and response, but also capacity management, improved release safety, and greater resource efficiency.
- Investigate, triage, and resolve production incidents and use data to articulate impact with relentless attention to the technical signals and underlying root causes that enable remediation and future avoidance/mitigation.
- Contribute to the design and engineering of auto and self-healing capability for known failure modes across our platforms.
- Contribute code to our platform repositories enabling not only our reliability agenda (e.g., monitoring-as-code), but also higher release speed and safety, simpler tenant onboarding, and improved controls.