Job Responsibilities:

  • Build and manage highly available, scalable infrastructure across AWS and Alibaba Cloud. Ensure the platform is optimized for performance, reliability, and cost efficiency while supporting both regional and international traffic.
  • Define and maintain service reliability standards through SLIs, SLOs, and SLAs. Take ownership of incident response processes, lead troubleshooting efforts, and conduct root cause analysis to continuously improve system stability. Develop automated recovery mechanisms to reduce downtime.
  • Implement and maintain infrastructure using Infrastructure-as-Code practices, enabling consistent, repeatable deployments. Promote version control and standardization across all environments.
  • Design and enhance continuous integration and deployment workflows to enable fast, stable, and low-risk software releases, while minimizing manual intervention.
  • Continuously improve system performance and ensure the platform can handle sudden spikes in traffic, particularly in high-demand scenarios such as real-time news events.
  • Establish end-to-end observability through monitoring, logging, and tracing solutions. Strengthen platform security by applying best practices across cloud environments, including access control, network security, and threat protection.
  • Act as a senior technical advisor within the team, guiding engineers on best practices and architectural decisions. Contribute to long-term platform strategy and promote a culture focused on reliability, automation, and continuous improvement.

Job Requirements:

  • At least 5 years of relevant experience in DevOps, cloud engineering, or site reliability engineering roles, with hands-on exposure to production environments.
  • Strong hands-on experience with AWS services (such as EC2, EKS, Lambda, RDS) and Alibaba Cloud offerings (including ECS, ACK, OSS, SLB). Familiarity with CDN solutions and multi-cloud environments is highly advantageous.
  • Proven experience using tools such as Terraform, Ansible, or Pulumi to automate infrastructure provisioning and management.
  • Practical knowledge of CI/CD platforms like Jenkins, GitLab CI, or GitHub Actions, with a focus on building efficient deployment pipelines.
  • Experience working with Kubernetes and container-based systems, along with monitoring and observability tools for maintaining system health.
  • Background in industries with high traffic volumes such as media, digital content, or large-scale online platforms, including experience handling traffic surges and CDN operations.
Similar jobs
TOPPAN Security ( Hong Kong ) 1 day ago
TOPPAN Security ( Hong Kong ) 1 day ago
TEKsystems ( Hong Kong ) 2 days ago

Lead DevOps Engineer (Platform & Reliability)

Apply Now
Back to search page