Head of Site Reliability Engineering (SRE)

Executive recruitment company Monroe Consulting Group Philippines is recruiting on behalf of a leading AI and big data company providing digital transformation, fraud prevention and process automation services in Asia. Our respected client is seeking an Information Technology (IT) professional for the job of Principle Site Reliability Engineering (SRE) Manager. The job is currently remote work in the Philippines.
Job Summary:
The Principle SRE Manager is expected to lead 5 to 7 SRE engineers and partner with the company's Development team to create groundbreaking technology. This is an incredible opportunity to make a meaningful impact on the future of Digital Bank.
Job Responsibilities:

Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
Lead in-depth technical and data analysis to gauge service trends and drive improvements.
Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
Contribute to proactive technical communication of reliability, stability, and efficiency results (based on Service Level Objectives), service health (via dashboards) key reliability risks and issues to senior business and technology stakeholders - to prioritize activity (based on trend analysis) and direct investment and action.
Design and development of IaaC with Terraform/Pulumi
Ensuring all infrastructure code is thoroughly tested in a CI pipeline
Ensuring all infrastructure components are highly visible (monitoring, logging, alert)
Managing cloud environments in accordance with company security guidelines
Troubleshooting incidents, identifying root causes, fixing, and documenting problems, and implementing preventive measures
Ensuring application performance, uptime, and scale, maintaining high standards of code quality and thoughtful design
Document best practices regarding application deployment and infrastructure maintenance
Work in tandem with our engineering team to identify and implement optimal solutions for their day-to-day task
Provide guidance, and mentorship to development teams to build cloud competencies
Provide 24/7 operation support for the digital platform
Optimizing Kubernetes to maintain systems uptime debugging production issues and running runbooks to mitigate potential production issues
Help guide our overall strategy through design, prototyping, and market research
Able to communicate with senior management and manage vendors

Job Qualifications:

Degree in Computer Science, Engineering, or equivalent experience.
7+ years' experience in software development and/or SRE functions with at least 4 years in a senior/lead capacity
You are either a Software Engineer with a real interest in systems, networking, monitoring, and automation; or an experienced sysadmin or systems engineer with professional skills in Linux, preferably on distributed systems at scale, and a demonstrable interest and experience in using software engineering to solve operational problems.
Comfortable writing software to automate API-driven tasks at scale. Cloud Tooling engineers primarily use NodeJS and /or Java, Kotlin and Go are also key languages in our environment.
Experience automating the build and deployment of software products and understanding the related challenges in distributed systems.
Very good knowledge and experience with architecting and provision Cloud-based
infrastructure on Google GCP or Amazon AWS, Ms. Azure.
Excellent communication (both verbal and written). The ability to communicate
confidently and clearly on conference calls, in meetings, via email, etc. at all levels
of the organization is essential.
Ability to communicate incident status quickly and clearly via email in business-friendly language
Experience and advanced understanding of Observability, CI/CD, and release management.
Well-rounded broad knowledge of OS platforms (Linux/UNIX), Networking, WebSystems, and DevOps
Experience working with large-scale distributed systems with an understanding of microservices architecture concepts
Strong organizational skills and the ability to effectively manage multiple tasks simultaneously
Capable of working in a complex, fast-paced environment, and ability to maintain calm during stressful situations