Question 1

How do Kovil AI Databricks engineers prevent runaway cluster cost overruns?

Accepted Answer

Our engineers address cost overruns across four dimensions. First, cluster segmentation: interactive clusters (for notebook exploration) are strictly separated from job clusters (for production pipelines). Interactive clusters are never used for production ETL. Second, auto-termination policies: every interactive cluster gets a hard auto-termination limit (typically 30-60 minutes idle), enforced via cluster policies attached at the workspace level in Unity Catalog. Third, instance sizing: we right-size worker node counts using Databricks cluster utilisation metrics from the Ganglia UI and Spark UI stage timelines before locking configurations. Fourth, spot instance policies: production job clusters run on spot/preemptible workers with an on-demand driver node, reducing DBU-hour spend by 40-70% on fault-tolerant batch workloads. All cluster policies are codified in Terraform (databricks_cluster_policy resources) and version-controlled, not set ad hoc in the UI.

Question 2

How do your Databricks engineers integrate into our existing CI/CD pipelines?

Accepted Answer

Kovil AI engineers work natively inside your existing toolchain. For Databricks-specific deployment, we use Databricks Asset Bundles (DABs) — the modern replacement for dbx — to define jobs, pipelines, and permissions as versioned YAML manifests. These are committed to your Git repository and deployed via GitHub Actions or Azure DevOps pipelines, eliminating manual workspace UI changes. Terraform (with the official Databricks provider) manages workspace configuration: Unity Catalog grants, cluster policies, instance profiles, and secret scopes. For notebook-based workflows, Databricks Repos (now Workspace Files) gives engineers a Git-backed development loop with pull request reviews before any merge to production. Every pipeline promotion (dev → staging → prod) happens through the CI runner, with environment-specific bundle targets controlling which workspace receives the deployment.

Question 3

What is the onboarding process and how quickly does a Kovil AI Databricks engineer reach full velocity?

Accepted Answer

Week 1 is access and architecture onboarding: the engineer gets workspace access, reviews your existing job configurations and cluster policies, audits Delta table schemas and Unity Catalog grants, and attends your sprint kickoff. By the end of Day 5 they have submitted their first pull request — typically a small pipeline fix or a cluster policy tightening. Week 2 is active delivery: they are in your daily standup, working tickets from your backlog (Jira, Linear, or GitHub Issues), and making substantive pipeline commits. Full velocity — meaning they can independently architect, implement, and deploy a new Bronze-to-Gold Delta Live Tables pipeline — is typically reached by sprint 3. The match itself takes under 48 hours: you share your stack requirements (cloud provider, workspace tier, Unity Catalog vs legacy metastore, streaming vs batch workloads) and we return vetted profiles the same business day.

Question 4

What is the correct architecture for Unity Catalog governance in a multi-team Databricks workspace?

Accepted Answer

The standard production pattern our engineers implement is a three-tier catalog hierarchy: a catalog per environment (prod, staging, dev) with schemas (databases) per domain (finance, product, logistics) and tables per entity. Access is granted at the schema level using Unity Catalog GRANT statements: data engineers get CREATE and MODIFY on their domain schema, analysts get SELECT on Gold-layer schemas, and no principal gets SELECT on Bronze-layer raw tables outside the pipeline service principal. Row-level security is implemented via Unity Catalog row filters (Python UDFs registered as filters on a table), and column-level masking via dynamic view functions for PII fields. All grants are managed via Terraform (databricks_grants resources), so access changes go through a pull request approval process rather than ad hoc UI grants. Data lineage is captured automatically by Unity Catalog across all Delta tables accessed via the SQL warehouse or a Databricks cluster with Unity Catalog metastore attached.

Question 5

When should we use Delta Live Tables (DLT) versus standard Databricks Workflows?

Accepted Answer

Delta Live Tables is the right choice when: (1) you need declarative pipeline definitions with built-in data quality expectations (CONSTRAINT clauses that quarantine or fail on bad records), (2) you want automatic dependency graph resolution between Bronze, Silver, and Gold tables without manually ordering tasks, and (3) you are running Structured Streaming sources (Kafka, Auto Loader, Kinesis) that need continuous or triggered refresh with exactly-once semantics via DLT's internal checkpointing. Standard Databricks Workflows are the right choice when: your pipeline includes non-Spark steps (Python scripts, dbt runs, ML training jobs, or external API calls), you need fine-grained control over cluster configuration per task, or you are running one-time or infrequent batch jobs where the DLT cluster startup overhead is not justified. Most mature Lakehouses use both: DLT for the core ingestion and transformation tiers, Workflows for orchestrating dbt transformations on top of Gold tables and triggering downstream ML jobs.

Question 6

How do Kovil AI engineers approach a migration from Snowflake or Redshift to Databricks?

Accepted Answer

We run migrations in four phases. Phase 1 (2 weeks): audit — inventory all Snowflake/Redshift objects (tables, views, stored procedures, tasks/scheduled queries), classify them by complexity (SQL-compatible vs requiring rewrite), and map to a target Databricks object type (Delta table, DLT pipeline, Databricks Workflow). Phase 2 (2-4 weeks): infrastructure — provision Unity Catalog structure, configure cloud storage (S3/ADLS Gen2/GCS) with appropriate IAM roles, set up instance profiles, and deploy cluster policies. Phase 3 (4-8 weeks): pipeline migration — translate Snowflake Streams/Tasks or Redshift Spectrum queries into PySpark or Databricks SQL, implement Auto Loader for incremental ingestion from the source S3/blob, and run parallel validation (row counts, aggregation checks, statistical distribution comparisons between old and new). Phase 4 (1-2 weeks): cutover — redirect upstream producers to the new landing zone, deprecate the legacy connection strings, and monitor the first 5 production pipeline runs before declaring migration complete. We maintain zero pipeline downtime by running old and new systems in parallel through Phase 3.

Question 7

How does Photon engine accelerate Databricks workloads, and when does it not help?

Accepted Answer

Photon is a vectorized query engine written in C++ that replaces Spark's JVM-based Volcano execution model for SQL and DataFrame operations. It accelerates workloads that are CPU-bound on large scans, joins, and aggregations — typically 2-8x faster on Databricks SQL warehouses and Delta Live Tables pipelines on Photon-enabled compute. Photon is most beneficial for: large table scans on Delta tables with Z-ordering (Photon skips files faster), complex aggregations and window functions in SQL analytics queries, and GROUP BY / JOIN operations on Silver-to-Gold transformation pipelines. Photon does not help for: UDF-heavy workloads (Python and Scala UDFs bypass Photon and fall back to JVM), workloads that are I/O-bound rather than compute-bound (where the bottleneck is object store read latency, not CPU), and ML training jobs using MLlib or custom Spark ML pipelines. Our engineers run Spark UI stage-level profiling to determine whether a bottleneck is compute-bound (Photon will help) or I/O/shuffle-bound (requiring partition tuning, Z-ordering, or liquid clustering changes instead).

Question 8

What does a production-grade Medallion Architecture look like on Databricks, and what are the most common structural mistakes?

Accepted Answer

A production Medallion Architecture on Databricks has three layers. Bronze (raw): Delta tables landing raw data exactly as received from sources — no transformations, schema-on-read, append-only. Ingested via Auto Loader (cloudFiles format) with schema inference and evolution enabled. Bronze tables retain data permanently (or per retention policy) for full reprocessing. Silver (cleaned and conformed): DLT streaming tables applying deduplication (EXCEPT with watermarking or MERGE INTO with row_hash keys), type casting, PII masking, and data quality EXPECT constraints. Silver tables are the authoritative, source-of-truth layer for domain entities. Gold (aggregated for consumption): Static or materialised DLT tables aggregating Silver data into business metrics, pre-joined denormalised fact tables for BI tools, or feature tables for ML. Served via Databricks SQL warehouses with serverless auto-scaling. The most common structural mistakes are: (1) skipping Bronze and landing pre-transformed data directly into Silver, removing the ability to replay historical loads; (2) using managed tables in the default Hive metastore instead of Unity Catalog external tables, which prevents cross-workspace access; (3) writing Gold transformations as heavy PySpark jobs instead of using Databricks SQL for analyst-maintainability; and (4) over-partitioning Bronze tables by date on small datasets, causing the "small files problem" that Liquid Clustering solves.

Dimension	Generic Outsourcing	Kovil AI Databricks Engineers
PySpark & SQL proficiency	Basic DataFrame API, limited Catalyst optimizer awareness	Full query plan analysis, stage-level profiling, AQE tuning, broadcast join control
Delta Lake ACID properties	Treats Delta tables as Parquet. No MERGE INTO, no time travel, no VACUUM strategy	MERGE INTO with change data capture, OPTIMIZE with Z-ordering, VACUUM with retention policies, schema evolution config
Unity Catalog governance	Legacy Hive metastore. No column masking. Row filters not implemented	Row filters + dynamic view masking for PII. Attribute-based access via catalog grants. Full lineage tracking
CI/CD pipeline integration	Notebook exports or dbx (deprecated). No Terraform. Manual UI deploys	Databricks Asset Bundles (DABs) in Git. Terraform for workspace config. GitHub Actions or Azure DevOps runners
Cluster cost governance	Ad hoc cluster sizing. No auto-termination enforcement. No spot instance policy	Cluster policies via Terraform. Spot/preemptible workers on batch jobs. DBU spend dashboards per team
DLT and streaming	Batch-only. Structured Streaming used without checkpointing. No EXPECT constraints	DLT declarative pipelines with EXPECT constraints. Auto Loader for cloud file ingestion. Exactly-once guarantees
Sprint embedding	Async delivery. Separate Slack workspace. Updates weekly or on request	Daily standup in your Slack. Tickets in your Jira/Linear. PRs in your Git repo by end of Day 5
Migration capability	Can translate SQL. Cannot handle Snowflake Streams, Redshift Spectrum, or parallel validation	Full audit, object classification, parallel pipeline validation, and zero-downtime cutover protocol

On-Demand Databricks Engineers and Lakehouse Architects

Why Databricks Environments Go Wrong at Scale

Runaway Cluster Spend

Broken Medallion Pipelines

Unity Catalog Debt

Migration Lock-In Risk

Untuned Photon + Spark

No CI/CD for Pipelines

Three Ways Kovil AI Databricks Engineers Deliver

Managed Databricks Engineers

Cloud Lakehouse Projects

Lakehouse Rescue (Audit and Recovery)

Generic Generalist Outsourcing vs Kovil AI Databricks Engineers

What a Production-Grade Medallion Architecture Looks Like

The Problem and What We Built

Request a Databricks Architecture Audit or Engineering Profiles

Databricks Engineering: Technical Questions Answered