Job Title: Senior Java Site Reliability Engineer (SRE)
Location: McLean, VA (Hybrid)
Job Summary
STAFFXPERT LLC is seeking a Senior Java Site Reliability Engineer (SRE) on behalf of our client in McLean, VA (Hybrid). This role is focused on supporting and enhancing highly available, mission-critical enterprise platforms within a large-scale financial services environment. The ideal candidate will bring deep expertise in production support, reliability engineering, cloud platforms, automation, observability, and incident management, with strong experience in enterprise Java-based systems.
Key Responsibilities
- Support and maintain highly available production systems across cloud and distributed environments
- Lead incident management, problem management, root cause analysis (RCA), and platform stability initiatives
- Monitor and ensure uptime, performance, and reliability of Java applications and microservices
- Identify, troubleshoot, and resolve application and system performance bottlenecks
- Design and implement resiliency patterns including circuit breakers, retries, failover, and high-availability architectures
- Improve observability through monitoring, logging, alerting, and automation of incident response
- Collaborate with development, infrastructure, platform, and cloud engineering teams to enhance deployment reliability
- Support cloud transformation, infrastructure modernization, and automation initiatives
- Coordinate disaster recovery testing, resiliency validation, capacity planning, and production readiness reviews
- Drive operational excellence and continuous service improvement initiatives
- Provide technical leadership and mentor distributed engineering teams
Required Qualifications
- 16 20 years of experience in Site Reliability Engineering, Production Engineering, Platform Engineering, or Application Support roles
- Strong experience supporting large-scale enterprise production environments
- Proven expertise in incident management, problem management, and operational support
- Experience working in Banking, Financial Services, FinTech, or other highly regulated environments
- Hands-on experience with mission-critical applications requiring high availability, scalability, and performance
- Strong troubleshooting, analytical, and problem-solving skills
Technical Skills
- Java
- Linux / Unix Administration
- Kubernetes, Docker
- Cloud Platforms: AWS / Azure / GCP
- CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, ArgoCD
- Infrastructure as Code: Terraform, Ansible
- Monitoring & Observability: Splunk, Datadog, Grafana, Prometheus, Moogsoft
- ITSM Tools: ServiceNow, JIRA, Confluence
- Scripting: Python, Bash/Shell
- SQL and database troubleshooting
- Application Performance Monitoring (APM) tools
- Production release management
- Disaster recovery and high availability architectures
Education
- Bachelor's degree in Computer Science, Information Systems, Engineering, or related field
Preferred Qualifications
- Strong cloud-native and microservices architecture experience
- Ability to lead critical production incidents and drive long-term reliability improvements
- Excellent communication and stakeholder management skills
- Experience mentoring and leading global engineering teams