The Oracle Cloud Infrastructure (OCI) team offers the opportunity to build and operate massive-scale, integrated cloud services in a broadly distributed, multi-tenant cloud environment. OCI builds cloud products for customers who are tackling some of the world's largest technical and business challenges.

Oracle Kubernetes Engine (OKE) is OCI's managed Kubernetes service. OKE enables customers to create, run, scale, secure, and operate Kubernetes clusters on OCI, integrating Kubernetes with OCI compute, networking, storage, identity, observability, security, and automation. The OKE team owns a highly available 24x7 cloud service and is expanding the platform to support larger clusters, higher scale, improved operability, deeper OCI integrations, and increasingly demanding cloud native, AI, and GPU workloads.

We are looking for a senior IC5 software engineer with deep Kubernetes expertise, required cloud infrastructure experience, and a strong distributed systems background. This is a high-impact technical leadership role for an engineer who can define architecture, drive cross-team execution, solve ambiguous production and platform problems, and deliver durable systems that improve both customer experience and operational excellence.

You will work on core OKE platform capabilities including cluster lifecycle management, orchestration, scalability, reliability, performance, automation, observability, security, and integration with OCI infrastructure services. The ideal candidate has hands-on experience designing, building, operating, or deeply debugging production cloud services, infrastructure platforms, or Kubernetes-based systems at meaningful scale.

This role requires advanced Kubernetes experience, including Kubernetes control plane behavior, controllers and operators, scheduling, autoscaling, networking, storage, service discovery, container runtimes, node lifecycle, Kubernetes APIs, and etcd. Experience with Kubernetes networking and storage technologies such as CNI, Cilium, Calico, Flannel, other container networking implementations, CSI drivers, and cloud provider integrations is highly relevant.

OKE is also expanding to support demanding AI and accelerated computing use cases. Experience with AI/ML infrastructure, multi-node GPU clusters, accelerated compute, model training or inference platforms, GPU scheduling, device plugins, Karpenter, cluster autoscaling, CUDA, NCCL, RoCE, InfiniBand, RDMA, SmartNIC/DPU offload, or high-performance AI/HPC networking is a significant plus.

This role also requires an engineer who is ready to use modern agentic engineering practices responsibly. We expect senior engineers to apply AI-assisted and agentic workflows to accelerate design exploration, implementation, testing, debugging, documentation, operational analysis, and developer productivity while maintaining strong ownership, security judgment, code quality, and production accountability.

As a member of the software engineering division, you will take an active role in defining and evolving standard practices and procedures. You will define specifications for significant new projects and specify, design, develop, troubleshoot, and debug software for OCI's managed Kubernetes service.

Responsibilities include:

  • Provide technical leadership for major OKE platform initiatives from architecture through implementation, launch, and production operation.
  • Design and build distributed systems that create, update, scale, repair, and operate Kubernetes clusters across OCI regions.
  • Improve OKE reliability, scalability, performance, upgrade safety, lifecycle management, observability, automation, and operational tooling.
  • Work deeply with Kubernetes technologies, including control plane components, controllers/operators, scheduling, autoscaling, Kubernetes APIs, container runtimes, node behavior, and etcd.
  • Design, debug, and improve Kubernetes networking and storage integrations, including CNI-based networking, Cilium, Calico, Flannel, other container networking implementations, CSI drivers, and OCI infrastructure integrations.
  • Build automation for cluster validation, health checks, readiness testing, failure detection, remote recovery, and reduction of post-deployment operational issues.
  • Lead technical design reviews, code reviews, incident reviews, and production readiness reviews for complex service changes.
  • Debug difficult production issues across service boundaries, including Kubernetes, Linux, networking, compute, storage, identity, telemetry, and OCI infrastructure dependencies.
  • Apply performance engineering practices including profiling, tracing, latency analysis, throughput optimization, and production diagnostics across distributed systems.
  • Build automation that reduces manual operations, improves fleet health, accelerates diagnosis, and raises the quality bar for OKE engineering.
  • Partner with OCI service teams to deliver end-to-end platform capabilities regardless of organizational boundaries.
  • Apply AI-assisted and agentic engineering workflows to improve engineering velocity, test coverage, debugging, operational analysis, and documentation while ensuring correctness, security, and maintainability.
  • Mentor engineers, influence technical direction, and help establish patterns that scale across the OKE organization.
  • Participate in operating a 24x7 cloud service and use customer feedback, production data, and operational experience to prioritize improvements.

Required qualifications:

  • 10+ years of software engineering experience, or equivalent experience building and operating production software systems.
  • Hands-on cloud infrastructure experience is required, ideally designing, building, operating, or debugging production services or platforms on OCI, AWS, Azure, GCP, or a large-scale private cloud.
  • Strong hands-on Kubernetes expertise is required, including Kubernetes architecture, APIs, control plane behavior, controllers/operators, scheduling, autoscaling, networking, storage, nodes, cluster lifecycle management, or production cluster operations.
  • Advanced Kubernetes knowledge, including CNI, CSI, etcd, service discovery, container runtimes, node lifecycle, and Kubernetes failure modes.
  • Experience with Kubernetes networking technologies such as Cilium, Calico, Flannel, or other CNI implementations.
  • Experience with Kubernetes storage integrations, including CSI drivers or cloud storage integrations.
  • Strong distributed systems fundamentals, including availability, failure handling, performance, scalability, and operational tradeoffs.
  • Experience building highly available infrastructure services, platform services, or cloud native systems used in production.
  • Strong development experience in both Go/Golang and Java is required.
  • Strong Linux, networking, debugging, and production operations skills.
  • Demonstrated ability to lead ambiguous technical projects, influence across teams, and deliver through other engineers without relying on formal authority.
  • Strong communication skills, ownership, judgment, and ability to make pragmatic tradeoffs in production systems.

Preferred qualifications:

  • Experience with AI/ML infrastructure, GPU workloads, multi-node GPU clusters, accelerated compute, model training or inference platforms, GPU scheduling, device plugins, Karpenter, cluster autoscaling, CUDA, NCCL, high-performance networking, or distributed training systems.
  • Experience with eBPF-based networking, Kubernetes network policy, service mesh, ingress, load balancing, overlays/underlays, BGP, VXLAN, SmartNIC/DPU offload, RoCE, InfiniBand, RDMA, or multi-cluster networking.
  • Experience with infrastructure as code and cloud provisioning tools such as Terraform, Packer, cloud-init, IAM, VCN/VPC networking, VPN, FastConnect/Direct Connect, or equivalent cloud primitives.
  • Experience building developer productivity, operational automation, or responsible AI-assisted and agentic engineering workflows.
  • Experience with observability systems, incident response, safe deployment practices, canary analysis, rollback strategies, service health automation, and large fleet operations.
  • Open-source or upstream contribution experience in Kubernetes, cloud native infrastructure, observability, networking, or related systems.
Disclaimer:
Certain U.S. based or U.S. customer or client-facing roles may be required to comply with applicable requirements, such as immunization/occupational health mandates, and/or drug testing requirements.
Range and benefit information provided in this posting are specific to the stated locations only
US: Hiring Range in USD from: $96,800 to $306,400 per annum. May be eligible for bonus, equity, and compensation deferral.
Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.
Oracle US offers a comprehensive benefits package which includes the following:
1. Medical, dental, and vision insurance, including expert medical opinion
2. Short term disability and long term disability
3. Life insurance and AD&D
4. Supplemental life insurance (Employee/Spouse/Child)
5. Health care and dependent care Flexible Spending Accounts
6. Pre-tax commuter and parking benefits
7. 401(k) Savings and Investment Plan with company match
8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
9. 11 paid holidays
10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
11. Paid parental leave
12. Adoption assistance
13. Employee Stock Purchase Plan
14. Financial planning and group legal
15. Voluntary benefits including auto, homeowner and pet insurance
The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.

Career Level - IC5

Only Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.

True innovation starts when everyone is empowered to contribute. That’s why we’re committed to growing a workforce that promotes opportunities for all with competitive benefits that support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.

We’re committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing [email protected] or by calling 1-888-404-2494 in the United States.

Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

Similar jobs