Senior Java Site Reliability Engineer (SRE)

Staffxpert LLC (Mclean VA, USA) Follow 8 hours ago

Job Title: Senior Java Site Reliability Engineer (SRE)
Location: McLean, VA (Hybrid)

Job Summary

STAFFXPERT LLC is seeking a Senior Java Site Reliability Engineer (SRE) on behalf of our client in McLean, VA (Hybrid). This role is focused on supporting and enhancing highly available, mission-critical enterprise platforms within a large-scale financial services environment. The ideal candidate will bring deep expertise in production support, reliability engineering, cloud platforms, automation, observability, and incident management, with strong experience in enterprise Java-based systems.

Key Responsibilities

Support and maintain highly available production systems across cloud and distributed environments
Lead incident management, problem management, root cause analysis (RCA), and platform stability initiatives
Monitor and ensure uptime, performance, and reliability of Java applications and microservices
Identify, troubleshoot, and resolve application and system performance bottlenecks
Design and implement resiliency patterns including circuit breakers, retries, failover, and high-availability architectures
Improve observability through monitoring, logging, alerting, and automation of incident response
Collaborate with development, infrastructure, platform, and cloud engineering teams to enhance deployment reliability
Support cloud transformation, infrastructure modernization, and automation initiatives
Coordinate disaster recovery testing, resiliency validation, capacity planning, and production readiness reviews
Drive operational excellence and continuous service improvement initiatives
Provide technical leadership and mentor distributed engineering teams

Required Qualifications

16 20 years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, or Application Support roles
Strong experience supporting large-scale enterprise production environments
Proven expertise in incident management, problem management, and operational support
Experience working in Banking, Financial Services, FinTech, or other highly regulated environments
Hands-on experience with mission-critical applications requiring high availability, scalability, and performance
Strong troubleshooting, analytical, and problem-solving skills

Technical Skills

Java
Linux / Unix Administration
Kubernetes, Docker
Cloud Platforms: AWS / Azure / GCP
CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD
Infrastructure as Code: Terraform, Ansible
Monitoring & Observability: Splunk, Datadog, Grafana, Prometheus, Moogsoft
ITSM Tools: ServiceNow, JIRA, Confluence
Scripting: Python, Bash/Shell
SQL and database troubleshooting
Application Performance Monitoring (APM) tools
Production release management
Disaster recovery and high availability architectures

Education

Bachelor's degree in Computer Science, Information Systems, Engineering, or related field

Preferred Qualifications

Strong cloud-native and microservices architecture experience
Ability to lead critical production incidents and drive long-term reliability improvements
Excellent communication and stakeholder management skills
Experience mentoring and leading global engineering teams

Apply Now

Save Job