While the cloud data warehouse is great b/c it’s fully managed, it still struggles for ML/AI/data science/event streaming workloads and gets crazy expensive at scale. Layer on the adjacent tools like ELT, observability, rETL, dbt, BI, etc — all of which push compute into the warehouse — and the costs are are skyrocketing. Conversely, the “data lakehouse” expands the aperture of possible use cases beyond analytics, but it’s still very “DIY” (sorting & clustering, space reclamation, file sizing, CDC, log events ingest, etc). Databricks seems well poised here, but it’s still not great for SQL workloads. Regardless of how this space unfolds, it still seems like a safe bet that Databricks will build an empire.
It seems like the best option would be store anywhere and toggle the processing engine according to what makes the most sense given the use case. If there was a non-DIY to do this, that would be 🔥 . What does everyone else think?