Databricks - A Lakehouse unicorn

Databricks advertises with this slogan:

All your data, analytics and AI on one lakehouse platform

Its a SaaS service that provied a collaborative soltuion to the entire machine learning lifecycle. It is build with data pipelines in mind and on top of the lakehouse architecture.

The pricing model of databricks is based on databricks processing units which are billed ontop of the cloud costs you have with your cloud provider. When first setting up databricks, you need to connect it to your cloud account so the workers get provisoned there. The costs are not very transparent at first though, one needs to keep in mind that you are both billed by your cloud provider and by databricks.

This business model has turned the company into one of the fastest growing AI companies according to Gardner.

The collaborative approach

Many data scientists and AI researchers like to work in Jupyter Notebooks. The iterative approach needed for building advanced machine learning solutions is very time consuming, so the interactivity the notebooks allow speed up the process immensly.

The presentation of code and result is also liked by many.

Databricks has used this same approach for the ETL processes in Spark, as can be seen with this GUI screenshot, where we load our police dataset and transform it and then write it into the delta-lake:

On the left hand side one can browse through his Git Repo, pick which file to work on or change the configuration of the Spark Cluster.

Advantages:

  • intuitive UI, which can be used for collaborating in a team

  • good Integration of Source Code Providers like GitHub / BitBucket

  • cluster can be shared by many Editors

  • very good integration of third parties like delta-lake, aws & azure, hadoop etc.

Disadvantages:

  • pricing model

  • dependency between Databricks & Apache Spark

  • spark can be very slow when running commands -> configuration of cluster?

What needs to be explored further?

  • machine Learning pipelines inside the databricks runtime

  • scalability of the platform

  • IAM Integration into corporate environments