Hi there, which is the best Cloud Data Warehouse t...
# advice-data-warehouses
s
Hi there, which is the best Cloud Data Warehouse these days? Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse, Firebolt or any others?
❄️ 7
firebolt 2
👍 2
t
I prefer ClickHouse 😄
👍 1
a
Couple of gripes I have with Redshift: • It’s not serverless, so have to choose the # of nodes upfront. • Storage and compute are coupled until one upgrades to ra3, thus from experience new users will run out of storage long b4 compute becomes an issue. • Dealing with semistructured data with SUPER and PartiQL is not very pleasant. It gets the job done, but it’s not quite as exciting as the newer entries on your list @Simon Späti
👌 1
z
Apologies in advance if this opinion comes off strong, but I’m passionate about this. I feel there’s a common misconception on what is the ‘best’ The best for what? General purpose, denormalized compute, relational compute? Best value? In our case data sources were primarily API ; with access infrequent and large storage requirements. In this case we chose BigQuery as most data coming in is JSON raw and we don’t have any SQL instances at all. We had existing apps running in GKE so native logging sinks to BigQuery from GKE were a strong sell as well. No solution is ‘best’ in all categories, you’ll have to compromise somewhere.
💯 8
👍 1
s
My team at ResQ uses Redshift but as others have mentioned its coupling of storage and compute makes problems of multi-tenancy and compute isolation difficult to solve. We tried transitioning to Snowflake but ran into some performance issues. The snowflake serverless model is great but makes pricing difficult to estimate. Also if you’re using any streaming solutions then you either have to always have a micro warehouse on or do the work of setting up a Snowpipe to ingest from S3. Also if you’re a dbt shop with deep dags, we found name resolution of nested views to dramatically affect performance
s
Interesting @Tony Cini, do you see ClickHouse, Pinot and Druid as Cloud Data Warehouses? These are two different things, on the details more on this question. @Zach Brak Yeah, agree, best is always a bad word. For me best relates to DWH capabilities such as joining and SQL window function and fast response time. So more on the analytics BI side where also the customer sits.
🤙🏼 1
a
I'm a huge fan of Snowflake!
j
Redshift and Synapse couple storage and compute, Synapse takes it a step further and couples parallelism to compute tier. I feel these need some modernization to compete in the space. Clickhouse and the like I wouldn't consider cloud data warehouses, they're OLAP solutions like Power BI datasets and Tableau datasets. BigQuery and Snowflake seem to be the most flexible, more importantly, all of the newer open source projects seem to favor these the most, which to me is a huge indicator. I'd say Databricks is missing from the list. Delta with Databricks and dbt along with a serverless SQL on top of delta like Synapse SQL serverless or Athena really lets you decouple compute, storage and access. In terms of flexibility on the T side of ELT, I personally feel Databricks wins. It also comes with native cdc.
j
I'm with @Zach Brak, best is about what's best for YOUR workload. For us, that's been BigQuery. But it's really important to deeply understand your use case today and where you expect to grow as well. I do have a really hard time recommending Redshift or Synapse because of some of the items already mentioned and more—my experiences have just been really poor every time from both a cost and performance standpoint on many different types of workloads. But for us, we often inherit what our clients use 🙃 Firebolt is interesting, but young. Excited to see their roadmap grow. BigQuery is established, but now moves more slowly (like usually happens with maturity in a category). Snowflake strikes me as somewhere in between, but cost and performance models of all of these can vary hugely depending on your use case.
1
z
@Jordan Fox huge +1 on the open source viability standpoint. Extensibility is king imo - a dev-friendly and robust API (with configurability as part of said API) is definite must for a modern data solution. @Justin Beasley interested in your assumption BigQuery feature release is slowing down with maturity - I actually feel it’s speeding up. Upcoming featureset pulled in from dataform is going to blow a lot of competition out of the water if GCP can have a fully managed dbt like featureset built into their offering. Agree on cost and performance, again to reiterate BQ handles wide and deep data tables, UNNEST() > JOINs. Snowflake kind of JOIN and many table. In this way snowflake gets favored by many trying to lift and shift ODBC / MSSQL etc.
j
@Zach Brak I think they're spending less overall time innovating on wide platform features and are going deep on certain niches (e.g. geospatial data). To be clear, I think they're doing well for their category and better than most platforms at this scale—but they do incur levels of technical debt over time, and can't "move fast and break things" when massive enterprises rely on the platform. Still, features in the pipeline like native JSON types and such are great and do have broader appeal. Most are largely inferential based on great ideas from other platforms—which we all want, but speak to the desires of a broad community and aren't necessarily unique to the platform (that is, aren't net new innovations). To me the biggest selling point for BigQuery is the true separation of compute (and the nature of the storage side, which is otherworldly), simple billing, and the things it inherits from GCP like their insane networking. But those are fixed-point wins for them, and while they give Google a certain scale of moat, are also somewhat binding for them. BigQuery when it came out was pure witchcraft; now it's the expected incumbent that everyone should compare and compete against. I'm still hugely bullish on BigQuery, and maybe it's okay that their work right now is largely feature-borrowing to help remove some of the platform objections. But I hope that they're also letting engineers really dream big about the next big thing because I think in the next few year's we're going to need another transformational moment in the space like BQ's original release was.
🔥 2
💯 2
z
That’s a great summary - these are the nuggets of knowledge I come here for!
j
I really think as Databricks evolves we'll see more of that flexibility. Delta, photon, and the native spark ecosystem being capable of switching very easily between code and SQL makes a huge difference in the developers workflow. Native cdc through delta logs has been amazing in regards to designing solutions around master data. You can very easily manage the separation of compute and storage, and having a smaller cluster on to serve the SQL endpoint with a separate cluster to manage the batch transforms is a lower cost solution, with serverless solutions like Athena and Synapse overtop of Delta being even lower cost for end users. The latest feature I like the most is how their COPY INTO statement natively keeps a high watermark on files its seen already.