Work Schedule
Other
Environmental Conditions
Office
Job Description
Summarized Purpose:
We are seeking a Lead Data Engineer to own the complete lifecycle of enterprise data pipelines from development to production, including roadmap planning, scalable ETL architecture, AWS data services, secure PHI/PII handling, healthcare data standards, AI-assisted mapping automation, data quality, transformation, catalog standards, and RAG-enabled data solutions.
Education/Experience:
- Bachelor's degree or equivalent in Computer Science, Information Technology, Data Engineering, or related field
- 7+ years of experience in data engineering, ETL development, cloud data platforms, healthcare or regulated data environments, and production data pipeline delivery
Major Job Responsibilities:
- Design, develop, deploy, and operate scalable ETL and data pipelines using PySpark, Python, advanced SQL, and AWS data services
- Own data pipeline lifecycle from requirements, mapping, development, testing, deployment, monitoring, production support, release management, and future roadmap planning
- Build ingestion and transformation pipelines for flat files, relational databases, APIs, data warehouses, healthcare data sources, and enterprise data platforms
- Implement mapping automation, preferably using AI, along with LLM-assisted data cleaning, transformation, data quality checks, and RAG use cases
- Implement secure handling of PHI/PII data including encryption, access controls, auditability, retention, masking, de-identification, governance, and operational readiness
Knowledge, Skills, and Abilities:
- Advanced expertise in PySpark, Python, advanced SQL, ETL best practices, data modeling, and large-scale data processing
- Strong hands-on experience with AWS services including S3, Glue, Lambda, Step Functions, ECS, DynamoDB, Redshift, RDS/PostgreSQL, and related data services
- Experience with PostgreSQL, SQL Server, Redshift, flat files, complex source-to-target mappings, HL7, claims data, EMR extracts, and clinical trial data
- Knowledge of data cataloging, metadata management, transformation standards, orchestration, monitoring, data quality, CI/CD, automated testing, and production support practices
- Ability to lead technical design, mentor engineers, guide delivery decisions, troubleshoot complex issues, and communicate with cross-functional teams
Must Have Skills:
- Advanced PySpark, Python, advanced SQL, ETL design, and data pipeline engineering expertise
- AWS data services experience including S3, Glue, Lambda, Step Functions, ECS, DynamoDB, Redshift, PostgreSQL, and SQL Server integration
- Secure PHI/PII handling, flat-file ingestion, source-to-target mapping, transformation, data catalog, governance, and healthcare data standards experience
- CI/CD, GitHub workflows, automated testing, release management for data pipelines and database changes, and dev-to-prod pipeline ownership
Good to Have Skills:
- AI-assisted mapping automation and use of LLMs for data cleaning, data quality checks, transformation logic, documentation, and patient de-identification support
- Experience with RAG patterns, embeddings, vector databases, semantic search, or AI-enabled data discovery solutions
- Familiarity with infrastructure as code such as Terraform or CloudFormation, plus streaming, Databricks, Snowflake, observability, and DevOps practices
Working Hours:
- India: 05:30 PM to 02:30 AM IST
- Philippines: 08:00 PM to 05:00 AM PHT