While the cloud data warehouse is great b c it s fully manag Airbyte #advice-data-warehouses

While the cloud data warehouse is great b/c it’s f...

Chase Roberts

03/31/2022, 8:06 PM

While the cloud data warehouse is great b/c it’s fully managed, it still struggles for ML/AI/data science/event streaming workloads and gets crazy expensive at scale. Layer on the adjacent tools like ELT, observability, rETL, dbt, BI, etc — all of which push compute into the warehouse — and the costs are are skyrocketing. Conversely, the “data lakehouse” expands the aperture of possible use cases beyond analytics, but it’s still very “DIY” (sorting & clustering, space reclamation, file sizing, CDC, log events ingest, etc). Databricks seems well poised here, but it’s still not great for SQL workloads. Regardless of how this space unfolds, it still seems like a safe bet that Databricks will build an empire. It seems like the best option would be store anywhere and toggle the processing engine according to what makes the most sense given the use case. If there was a non-DIY to do this, that would be 🔥 . What does everyone else think?

Jordan Fox

03/31/2022, 9:05 PM

What makes you say that Databricks isn't great for SQL workloads?

👍 1

William Phillips

03/31/2022, 9:25 PM

I think DW is great for cleaning up data sets and BI reporting for like non technical business users. Also, DW is good for metadata management like data governance. Any large compute ML models and real time analytics are better suited for databricks which focuses on data scientists and engineers.

Noah Kawasaki

04/01/2022, 2:04 AM

Firebolt is basically try to solve much of what you just described, even having different compute engine sizes for certain jobs

👀 2

Rytis Zolubas

04/01/2022, 11:25 AM

There are a lot of dwh that support live streaming. From pricing point might be good to consider dwh that does not charge per usage but has a stable pricing. My go to database would be clickhouse

Jordan Fox

04/01/2022, 4:22 PM

@Rytis Zolubas what makes you say clickhouse is your go to database? Do you actually do transformations IN clickhouse? It's designed to be the end state of your data, your dims and facts, not where you land raw data and do transforms and such.

Rytis Zolubas

04/01/2022, 4:25 PM

@Jordan Fox it is fast, it is cheap (open-source). It is column oriented so you can do all your analytics. Not sure where you get that it is designed to be the end state of your data? Maybe I have missed something...

Jordan Fox

04/01/2022, 4:31 PM

Its an OLAP tool, like Analysis Services.

Rytis Zolubas

04/01/2022, 4:41 PM

@Jordan Fox exactely, you would not use it as transactional database, imagine banking application with lot of transactions and updates of balance, etc. The same applies to snowflake, bigquery, firebolt, etc. There are some hybrid ones eg. TiDB. For clickhouse you can ingest data from Kafka and then transform raw data there. BTW Firebolt is forked clickhouse 😄

🏆 1

Kyle Weller

04/02/2022, 6:51 AM

@Chase Roberts this sounds like what we are building on top of open source Apache Hudi at Onehouse. Creating a fully managed Lakehouse foundation that decouples data lake infrastructure from query engines. I’ve observed first hand the DIY struggles surrounding the data lake and I think this is an exciting problem space ripe to be solved.

🏆 1

5 Views

Open in Slack

Previous Next