Executive search firm Monroe Consulting Group Philippines is recruiting on behalf of a large integrated resort and entertainment complex located in a major urban area in the Philippines. Our respected client is seeking for a Senior Engineer - Site Reliability that will ensures the stability, efficiency, and availability of applications, services, and infrastructure by utilizing monitoring and observability tools, while also creating and maintaining automation scripts to optimize performance across the technology stack. This position will report in Paranaque, Philippines.

Job Summary:

The Senior Engineer - Site Reliability is responsible for maintaining the health and performance of applications, services, and infrastructure through the use of monitoring and observability tools, and by developing automation scripts to support consistent and optimal system operations across all layers of the technology stack.

Main responsibilities:

Configure and maintenance of the enterprise monitoring tool to provide realtime visibility and state of health across the technology stack
Design and create dashboards to provide multi-level view based on functional requirement such as executive and tactical views
Create and maintain key threshold across all monitoring elements to ensure proactive detection and early detection of impending incident or problem
Analyze events and correlate to all observability and monitoring tools to capture trends and behavior patterns to assist in proactive course of actions
Design, develop and utilize automation tools and scripts to address repetitive actions and where possible create correction course of action to prevent and/or reduce prolonged outages
Work closely with operations team during incident and problem management for quick reaction response as identified using the monitoring tools
Regularly review and optimize infrastructure performance using logs, metrics and traces as part of continuous improvements thru adjustment of thresholds and monitoring requirement as environment constantly change
Develop and maintain a robust alerting strategy, including integration with on-call tools to ensure timely escalation and resolution of critical issues.
Implement and manage end-to-end event lifecycle processes to ensure accurate incident detection and efficient response.

Qualifications:

Bachelor's degree in Computer Science, Information Technology, or a related field; or equivalent work experience.
2-5+ years of extensive experience as systems and network administrator
Hand-on experience managing monitoring tools such as but not limited to Solarwinds, Nagios, etc.
Evident understand what Observability and what it does
Proficient with major cloud platforms such as AWS, GCP, Azure and Alibaba Cloud
Good grasp on Observability platform such as Splunk and Dynatrace
Experience with containerization platform such as Docker and Kubernetes
Extensive experience with virtualization technology such as VMWare
Strong knowledge of networking using collapsed architecture or similar enterprise networking technology
Knowledgeable in scripting languages such as Python, Bash, or PowerShell.
AWS Certified Solutions Architect, Azure Solutions Architect, or equivalent certification.
Certified Kubernetes Administrator (CKA)Solid understanding of disaster recovery and business continuity practices.

All applications will be treated in the strictest of confidence. If you are a suitable match for this position, please send your application