Job Title:
Data Engineer
Company: PropStream
Location: Ajmer, Rajasthan
Created: 2026-03-31
Job Type: Full Time
Job Description:
Role Overview We are looking for a hands-on, senior Databricks Architect to design, build, and govern our Lakehouse data platform from the ground up. You will own the end-to-end architecture of our data infrastructure — from raw ingestion through the Medallion layers to serving — and establish the engineering standards that will guide the entire data organization. This is a highly strategic and technical role focused on driving adoption of Databricks, Unity Catalog, and modern Lakehouse patterns across all data products and pipelines. Key Responsibilities Lakehouse Architecture & Design - Design and implement a production-grade Medallion Architecture (Bronze / Silver / Gold) across all data pipelines. - Establish best practices for Delta Lake table design, partitioning strategies, Z-ordering, and optimization across large-scale datasets. - Define data modeling standards and schema evolution policies across the Lakehouse. - Architect end-to-end data flows from ingestion (streaming and batch) through transformation and serving layers. Unity Catalog & Data Governance - Lead the setup, configuration, and rollout of Unity Catalog as the centralized governance layer for all data assets. - Design metastore hierarchy, catalog/schema/table organization, and tagging standards. - Implement fine-grained access control (row-level, column-level), data masking policies, and audit logging. - Establish data lineage tracking and ensure end-to-end visibility across all pipelines. - Define and enforce data classification and sensitivity frameworks for PII and regulated data assets. Pipeline Development & Orchestration - Build and maintain production-grade data pipelines using PySpark, Delta Live Tables (DLT), and Databricks Workflows / Jobs. - Design modular, reusable pipeline patterns including incremental ingestion, CDC (Change Data Capture), and full-refresh strategies. - Implement robust pipeline observability: logging, alerting, lineage tracking, and SLA monitoring. - Leverage Databricks Repos for CI/CD integration, managing code promotion across dev / staging / production environments. Performance & Compute Optimization - Optimize Spark execution plans, identify and resolve performance bottlenecks across large-scale distributed workloads. - Right-size cluster configurations: Serverless warehouses, auto-scaling job clusters, and photon-enabled SQL warehouses. - Leverage Serverless Warehouses and SQL Warehouses for BI and ad hoc analytics workloads, minimizing cost and cold-start latency. - Manage cost governance for compute, storage, and DBU consumption across workspaces. Developer Experience & Standards - Set up and maintain Databricks Repos with standardized project structures and Git integration. - Define Python coding standards, notebook best practices, and modular library patterns for the data engineering team. - Build reusable Python utility libraries for common patterns: schema validation, data quality checks, Delta operations, and logging. - Establish unit testing and integration testing frameworks for Spark pipelines. Security, Compliance & Networking - Configure workspace-level and account-level security: Private Link, IP access lists, secrets management via Databricks Secrets or AWS Secrets Manager. - Design and enforce network isolation for sensitive data workloads. - Ensure compliance with data residency and access control requirements for customer data. Collaboration & Enablement - Partner with data engineers, data scientists, and analytics engineers to ensure the platform meets diverse workload needs. - Mentor the engineering team on Databricks, Spark optimization, and Lakehouse best practices. - Produce architectural documentation, runbooks, and internal knowledge bases. - Evaluate and recommend new Databricks features and third-party integrations relevant to the organization's data roadmap. Required Qualifications Core Databricks & Lakehouse - 5+ years of hands-on experience with Databricks, with at least 2 years in an architect or senior lead role. - Deep expertise in Unity Catalog: metastore setup, three-level namespace, ACL design, and data governance workflows. - Strong mastery of the Medallion Architecture and Delta Lake: ACID transactions, time travel, compaction, and OPTIMIZE/VACUUM strategies. - Proven experience designing and deploying production pipelines with Databricks Jobs and Workflows, including multi-task job DAGs, retry logic, and notifications. - Hands-on experience with Databricks Repos and CI/CD integration for notebook and Python library deployments. - Experience configuring and operating Serverless SQL Warehouses and Serverless compute for Jobs. Apache Spark - Expert-level PySpark development: DataFrames, Spark SQL, window functions, broadcast joins, and UDFs. - Strong understanding of Spark internals: DAG execution, shuffle optimization, memory management, and speculative execution. - Experience with structured streaming and micro-batch processing patterns. - Proven ability to diagnose and resolve Spark performance issues using Spark UI and event logs. Python & Software Engineering - Advanced Python skills with a strong software engineering background: packaging, testing (pytest), virtual environments, and dependency management. - Experience building modular Python libraries for data engineering use cases. - Familiarity with common data engineering libraries: pandas, pydantic, great_expectations or similar DQ frameworks. Cloud & Infrastructure - Experience deploying Databricks on AWS, including workspace provisioning, IAM integration, and VPC configuration. - Familiarity with cloud-native storage (S3/ADLS), external locations in Unity Catalog, and storage credentials management. - Exposure to infrastructure-as-code tooling (Terraform, Databricks Asset Bundles, or similar). Preferred Qualifications - Databricks Certified Data Engineer Professional or Databricks Certified Associate Developer for Apache Spark certifications. - Experience with Delta Live Tables (DLT) for declarative pipeline authoring. - Familiarity with dbt (data build tool) integrated with Databricks SQL. - Experience with Databricks Feature Store or MLflow for ML platform use cases. - Exposure to Databricks Marketplace and Partner Connect integrations. - Experience with Elasticsearch, Apache Kafka, or other streaming/search technologies complementary to the Lakehouse.