Druid is an OLAP engine designed to provide fast real time analytics. Druid adopts a clustered architecture with servers that host various role specific processes. These processes address real time and batch ingestion, indexing, querying of historical and real time data. Apache Druid can be deployed as a virtual machine or a Kubernetes based cluster. Druid does not support a decoupled compute & storage architecture. Deep storage in the form of object storage is used to replicate data to.
Databricks was built by the founders of Spark as an analytics platform to support machine learning use cases. It leverages the Spark framework to process data residing in a data lake and is supported on AWS, GCP and Azure. Databricks coined the marketing term “Lakehouse '' architecture to illustrate the unification of data lake and data warehouse use cases. Customers still manage Spark clusters that process data residing in a Delta lake. Conversion of data to Delta Lake format is required to leverage the functionality of Delta Lake. Databricks Sql is a relatively new addition to simplify access to data stored in a data lake.
Druid provides the ability to handle fast ingest and high concurrency. Custom sizing and cluster tuning are required to balance the compute, memory, storage needs of each process within Druid and to provide high concurrency. Druid clusters can be grown by adding nodes with automatic rebalancing of storage segments assigned to nodes. Self hosted Druid on Kubernetes is an option that users leverage to simplify scaling. Additionally, Cloud based managed Druid offerings are being rolled out. However, these managed offerings are limited in scale and scaling is not granular.
Databricks allow for autoscaling of clusters based on utilization. Additionally, increasing concurrency associated with a sql endpoint can be accomplished through the addition of clusters. Query concurrency per cluster is maxed at 10. However, scaling with additional clusters for concurrency is possible. Databricks provides a choice of instance types.
Druid provides high performance through columnar storage format, parallel processing, bitmap indexes and roll-ups. Druid, however, recommends a denormalized data model for performance needs. Join operations in Druid are a relatively new feature with various limitations, especially if there is a need to join large datasets.
Databricks is designed to leverage the Spark framework for processing large volumes of data. It leverages compressed Parquet files in a Delta Lake. To reduce the amount of data processed, it uses data pruning on partitions and Parquet file metadata. Databricks does not provide any indexes.
Druid is designed as an OLAP engine to provide fast access to aggregations that are run against large volumes of data. Druid is typically used for customer facing analytics and streaming data processing. Druid is used as an add-on with other data warehousing products that are efficient at scaling, joining, and filtering large volumes of data. It is not a suitable option for data warehouse replacement.
Databricks is a mature Spark based platform proven for processing streaming data. It is widely used for Machine Learning use cases by data scientists through the use of integrated notebooks. From a low latency query perspective, while it offers features like Delta Cache, it does not provide specialized indexes that can deliver low latency queries.