Key Responsibilities: Design, develop, and maintain ETL pipelines using Python , PySpark , and SQL on distributed data platforms.
Write clean, efficient, and scalable PySpark code for big data transformation and processing.
Develop reusable scripts and tools for data ingestion , cleansing , validation , and aggregation .
Work with structured and semi-structured data (JSON, Parquet, Avro, etc.).
Optimize SQL queries for performance and cost-efficiency in data lakes or warehouses.
Collaborate with data architects, analysts, and BI developers to deliver end-to-end data solutions.
Participate in code reviews , peer programming, and unit/integration testing.
Support and troubleshoot issues in development, test, and production environments.
Document technical processes, data flows, and pipeline designs.
Required Skills and Qualifications: Bachelor's degree in Computer Science, Engineering, or related technical field.
3–6+ years of hands-on experience in Python , PySpark , and SQL .
Proficiency in Apache Spark (RDD/DataFrame APIs), Spark performance tuning, and distributed computing concepts.
Strong experience with relational databases like PostgreSQL , SQL Server , or Oracle , and experience writing complex SQL queries , CTEs , joins , and window functions .
Familiarity with cloud platforms such as AWS, Azure, or GCP (e.g., EMR , Databricks , BigQuery , Synapse , etc.).
Experience with data lake and data warehouse concepts.