Onset Technologies logo

Data Architect/Developer (PySpark & AWS)

Onset Technologies
Contract
Remote
United States
Big Data

Description

We are looking for a Data Architect/Developer with deep expertise in PySpark/Databricks and AWS to design, build, and maintain scalable data platforms that enable advanced analytics, reporting, and data-driven decision-making. This role involves hands-on development and architecture responsibilities, with a strong focus on cloud data solutions, big data processing, and performance optimization.

Key Responsibilities:

  • Design and implement scalable data lake and lakehouse solutions using Databricks and AWS services (e.g., S3, Glue, Athena, EMR).
  • Develop robust data processing pipelines using PySpark for both batch and real-time ingestion and transformation.
  • Architect efficient data models (dimensional/star/snowflake) to support reporting, analytics, and machine learning workloads.
  • Collaborate with data scientists, analysts, and product teams to understand data requirements and translate them into architecture and pipeline solutions.
  • Build and maintain reusable ETL/ELT components, optimizing for performance, cost, and reliability.
  • Enforce data governance, quality, and security standards, leveraging tools like AWS Lake Formation, IAM, and Glue Data Catalog.
  • Monitor and troubleshoot performance of Spark workloads and cluster resources on Databricks.
  • Implement version control and CI/CD for data workflows using tools like Git, Jenkins, or AWS CodePipeline.
  • Document architecture diagrams, data flows, and technical specifications.

Required Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related technical field.
  • 5+ years of data engineering experience, with 3+ years using PySpark and Databricks on AWS.
  • Strong hands-on experience with AWS data services including S3, Glue, Redshift, Lambda, Athena, EMR, and CloudFormation/Terraform.
  • Proficiency in Python and PySpark for scalable data transformation and processing.
  • Experience with Delta Lake and Lakehouse architecture on Databricks.
  • Advanced SQL skills and familiarity with distributed query engines (e.g., Athena, Redshift Spectrum).
  • Knowledge of data security practices, encryption, IAM policies, and compliance (e.g., GDPR, HIPAA).
  • Spark performance tuning and optimization for large-scale datasets.

Preferred Qualifications:

  • Familiarity with data orchestration tools such as Airflow, dbt, or Step Functions.
  • Experience integrating data pipelines with machine learning workflows using tools like MLflow or SageMaker.
  • Exposure to real-time/streaming technologies such as Kafka, Kinesis, or Spark Structured Streaming.
  • AWS and/or Databricks certifications (e.g., AWS Certified Data Analytics, Databricks Data Engineer Associate/Professional).

Soft Skills:

  • Strong analytical thinking and problem-solving capabilities.
  • Excellent verbal and written communication skills for technical and non-technical audiences.
  • Ability to work independently, prioritize effectively, and deliver high-quality results in a fast-paced environment.
  • Passion for continuous learning and staying current with emerging data technologies.