Create Alert
Email me similar jobs

Senior Java Site Reliability Engineer (SRE)

Job Title: Senior Java Site Reliability Engineer (SRE)
Location: McLean, VA (Hybrid)

Job Summary

STAFFXPERT LLC is seeking a Senior Java Site Reliability Engineer (SRE) on behalf of our client in McLean, VA (Hybrid). This role is focused on supporting and enhancing highly available, mission-critical enterprise platforms within a large-scale financial services environment. The ideal candidate will bring deep expertise in production support, reliability engineering, cloud platforms, automation, observability, and incident management, with strong experience in enterprise Java-based systems.

Key Responsibilities
  • Support and maintain highly available production systems across cloud and distributed environments
  • Lead incident management, problem management, root cause analysis (RCA), and platform stability initiatives
  • Monitor and ensure uptime, performance, and reliability of Java applications and microservices
  • Identify, troubleshoot, and resolve application and system performance bottlenecks
  • Design and implement resiliency patterns including circuit breakers, retries, failover, and high-availability architectures
  • Improve observability through monitoring, logging, alerting, and automation of incident response
  • Collaborate with development, infrastructure, platform, and cloud engineering teams to enhance deployment reliability
  • Support cloud transformation, infrastructure modernization, and automation initiatives
  • Coordinate disaster recovery testing, resiliency validation, capacity planning, and production readiness reviews
  • Drive operational excellence and continuous service improvement initiatives
  • Provide technical leadership and mentor distributed engineering teams
Required Qualifications
  • 16 20 years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, or Application Support roles
  • Strong experience supporting large-scale enterprise production environments
  • Proven expertise in incident management, problem management, and operational support
  • Experience working in Banking, Financial Services, FinTech, or other highly regulated environments
  • Hands-on experience with mission-critical applications requiring high availability, scalability, and performance
  • Strong troubleshooting, analytical, and problem-solving skills
Technical Skills
  • Java
  • Linux / Unix Administration
  • Kubernetes, Docker
  • Cloud Platforms: AWS / Azure / GCP
  • CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD
  • Infrastructure as Code: Terraform, Ansible
  • Monitoring & Observability: Splunk, Datadog, Grafana, Prometheus, Moogsoft
  • ITSM Tools: ServiceNow, JIRA, Confluence
  • Scripting: Python, Bash/Shell
  • SQL and database troubleshooting
  • Application Performance Monitoring (APM) tools
  • Production release management
  • Disaster recovery and high availability architectures
Education
  • Bachelor's degree in Computer Science, Information Systems, Engineering, or related field
Preferred Qualifications
  • Strong cloud-native and microservices architecture experience
  • Ability to lead critical production incidents and drive long-term reliability improvements
  • Excellent communication and stakeholder management skills
  • Experience mentoring and leading global engineering teams
Similar jobs

More from Staffxpert LLC
Staffxpert LLC 8 hours ago
Staffxpert LLC 8 hours ago
Staffxpert LLC 8 hours ago

Senior Java Site Reliability Engineer (SRE)

Apply Now
Back to search page