Description
We are looking for a Data Architect/Developer with deep expertise in PySpark/Databricks and AWS to design, build, and maintain scalable data platforms that enable advanced analytics, reporting, and data-driven decision-making. This role involves hands-on development and architecture responsibilities, with a strong focus on cloud data solutions, big data processing, and performance optimization.
Key Responsibilities:
Design and implement scalable data lake and lakehouse solutions using Databricks and AWS services (e.g., S3, Glue, Athena, EMR).
Develop robust data processing pipelines using PySpark for both batch and real-time ingestion and transformation.
Architect efficient data models (dimensional/star/snowflake) to support reporting, analytics, and machine learning workloads.
Collaborate with data scientists, analysts, and product teams to understand data requirements and translate them into architecture and pipeline solutions.
Build and maintain reusable ETL/ELT components, optimizing for performance, cost, and reliability.
Enforce data governance, quality, and security standards, leveraging tools like AWS Lake Formation, IAM, and Glue Data Catalog.
Monitor and troubleshoot performance of Spark workloads and cluster resources on Databricks.
Implement version control and CI/CD for data workflows using tools like Git, Jenkins, or AWS CodePipeline.
Document architecture diagrams, data flows, and technical specifications.
Required Qualifications:
Bachelorβs or Masterβs degree in Computer Science, Data Engineering, or a related technical field.
5+ years of data engineering experience, with 3+ years using PySpark and Databricks on AWS.
Strong hands-on experience with AWS data services including S3, Glue, Redshift, Lambda, Athena, EMR, and CloudFormation/Terraform.
Proficiency in Python and PySpark for scalable data transformation and processing.
Experience with Delta Lake and Lakehouse architecture on Databricks.
Advanced SQL skills and familiarity with distributed query engines (e.g., Athena, Redshift Spectrum).
Knowledge of data security practices, encryption, IAM policies, and compliance (e.g., GDPR, HIPAA).
Spark performance tuning and optimization for large-scale datasets.
Preferred Qualifications:
Familiarity with data orchestration tools such as Airflow, dbt, or Step Functions.
Experience integrating data pipelines with machine learning workflows using tools like MLflow or SageMaker.
Exposure to real-time/streaming technologies such as Kafka, Kinesis, or Spark Structured Streaming.
AWS and/or Databricks certifications (e.g., AWS Certified Data Analytics, Databricks Data Engineer Associate/Professional).
Soft Skills:
Strong analytical thinking and problem-solving capabilities.
Excellent verbal and written communication skills for technical and non-technical audiences.
Ability to work independently, prioritize effectively, and deliver high-quality results in a fast-paced environment.
Passion for continuous learning and staying current with emerging data technologies.