Executive recruitment company Monroe Consulting Group Philippines is recruiting on behalf of a leading AI and big data company providing digital transformation, fraud prevention and process automation services in Asia. Our respected client is seeking an Information Technology (IT) professional for the job of Principle Site Reliability Engineering (SRE) Manager. The job is currently remote work in the Philippines.
The Principle SRE Manager is expected to lead 5 to 7 SRE engineers and partner with the company's Development team to create groundbreaking technology. This is an incredible opportunity to make a meaningful impact on the future of Digital Bank.
- Participate in SRE software engineering, writing code for the continuing reduction of human intervention in operational tasks and automation of processes.
- Lead in-depth technical and data analysis to gauge service trends and drive improvements.
- Contribute to prioritization of reliability features and contribute to the design, development, and delivery of effective tooling, alerts, and automated responses to identify and address reliability risks.
- Contribute to proactive technical communication of reliability, stability, and efficiency results (based on Service Level Objectives), service health (via dashboards) key reliability risks and issues to senior business and technology stakeholders - to prioritize activity (based on trend analysis) and direct investment and action.
- Design and development of IaaC with Terraform/Pulumi
- Ensuring all infrastructure code is thoroughly tested in a CI pipeline
- Ensuring all infrastructure components are highly visible (monitoring, logging, alert)
- Managing cloud environments in accordance with company security guidelines
- Troubleshooting incidents, identifying root causes, fixing, and documenting problems, and implementing preventive measures
- Ensuring application performance, uptime, and scale, maintaining high standards of code quality and thoughtful design
- Document best practices regarding application deployment and infrastructure maintenance
- Work in tandem with our engineering team to identify and implement optimal solutions for their day-to-day task
- Provide guidance, and mentorship to development teams to build cloud competencies
- Provide 24/7 operation support for the digital platform
- Optimizing Kubernetes to maintain systems uptime debugging production issues and running runbooks to mitigate potential production issues
- Help guide our overall strategy through design, prototyping, and market research
- Able to communicate with senior management and manage vendors
- Degree in Computer Science, Engineering, or equivalent experience.
- 7+ years' experience in software development and/or SRE functions with at least 4 years in a senior/lead capacity
- You are either a Software Engineer with a real interest in systems, networking, monitoring, and automation; or an experienced sysadmin or systems engineer with professional skills in Linux, preferably on distributed systems at scale, and a demonstrable interest and experience in using software engineering to solve operational problems.
- Comfortable writing software to automate API-driven tasks at scale. Cloud Tooling engineers primarily use NodeJS and /or Java, Kotlin and Go are also key languages in our environment.
- Experience automating the build and deployment of software products and understanding the related challenges in distributed systems.
- Very good knowledge and experience with architecting and provision Cloud-based
- infrastructure on Google GCP or Amazon AWS, Ms. Azure.
- Excellent communication (both verbal and written). The ability to communicate
- confidently and clearly on conference calls, in meetings, via email, etc. at all levels
- of the organization is essential.
- Ability to communicate incident status quickly and clearly via email in business-friendly language
- Experience and advanced understanding of Observability, CI/CD, and release management.
- Well-rounded broad knowledge of OS platforms (Linux/UNIX), Networking, WebSystems, and DevOps
- Experience working with large-scale distributed systems with an understanding of microservices architecture concepts
- Strong organizational skills and the ability to effectively manage multiple tasks simultaneously
- Capable of working in a complex, fast-paced environment, and ability to maintain calm during stressful situations