Learn how to load data to a Databricks Lakehouse and run sim Airbyte #good-reads

Learn how to load data to a Databricks Lakehouse a...

Simon Späti

07/04/2022, 4:54 PM

Learn how to load data to a Databricks Lakehouse and run simple analytics with our Tutorial Load Data into Delta Lake on Databricks Lakehouse.

airbyte rocket 3

Ashwin

07/04/2022, 6:36 PM

Very comprehensive article! When should you use “full refresh“ versus not use it?

octavia thanks 1

Léon Stefani

07/05/2022, 1:59 PM

Will be nice to have incremental deduped sync natively on this connector 🙂

plus1 1

Simon Späti

07/06/2022, 8:29 AM

@Ashwin thanks for the comment and good questions. As always it depends. But normally in the early stage, you full-refresh dimensions (as these can have potential mutations e.g. changing address) and Facts incrementally. But of course, if your dimensions get big or the sync takes a too long time (or too expensive), you'd need to do incremental on dimensions as well. But then you need to figure out CDC or SCD2 or use our deduped (when it is supported for the destination). Hope that helps! 🙂

Simon Späti

07/06/2022, 8:34 AM

@Léon Stefani, I agree that will be an excellent addition. Otherwise, you must write the CDC or SC2, which is much work.

Léon Stefani

07/06/2022, 8:37 AM

@Simon Späti, yep you’re right ! actually for this use case i managed to build a custom databricks connector that handles incremental deduped, but which is not compatible with the current connector ^^ But it’s more work to keep iot up to date

Simon Späti

07/06/2022, 8:43 AM

@Léon Stefani Wow that is awesome! Just curious, how did you do it, did you write a custom SCD2 such as this, or did you use debezium.io to get the changes from OLTP sources, or both 😅? On the Databricks connector, there is a lot of work on-going with a Databricks v2, integration in the announced Unity Catalog, and Partner Connect which will hopefully benefit the integration as well.

Léon Stefani

07/06/2022, 8:49 AM

Actually I just made a version of the databricks connector compatible with dbt transformations using the merge strategy, and then used the generated dbt models on the normalization airbyte image to build the SCD2 (same as for snowflake or postgres connectors). I also use CDC to get data from sources with debezium (postgres), but since replication slots consumption issues can be pretty deadly for these I use the S3 connector and then spark structured streaming and custom generated DBT models for normalization and deduplication. But the custom connector works well for all API sources, and also allows for unnesting api response streams which is great

Simon Späti

07/06/2022, 12:21 PM

That is very valuable. Thanks for sharing Léon. FYI: @Liren Tu @Chris Duong [Airbyte]

Léon Stefani

07/06/2022, 12:32 PM

@Simon Späti since you remind me of this i’m currently cleaning and making a pr with that, in order to have some feedback and be able to make it available

octavia loves 2

airbyte rocket 1

Chris Duong [Airbyte]

07/06/2022, 12:33 PM

since we implemented the incremental models on dbt during normalizatin, it’s totally possible to switch the strategy for the merge strategy, and do a dedupe without having an intermediate table in the middle (without keeping history or scd tables around) as @Léon Stefani is describing i guess

👍🏻 1

Liren Tu

07/06/2022, 4:31 PM

@Léon Stefani, I think Databricks Lakehouse supports incremental dedup natively. If that’s the case, using dbt may not be the desired approach.

Léon Stefani

07/07/2022, 8:43 AM

@Liren Tu, if there is a way to natively handle incremental dedup on databricks i’d be more than happy to discuss how to do and implement it. But I still think being able to run dbt transformations on the databricks destination is good to have, because it allows to use basic normalization as well, which is really handful for HTTP Apis (usually very nested). and it makes it work the same way as other warehouses destinations.

Nazih Kalo

10/12/2022, 10:17 PM

hey @Léon Stefani were you able to get this working? this would be extremely helpful to our team

Léon Stefani

10/13/2022, 7:48 AM

@Nazih Kalo, was able to get a poc working, but which is not up to date with current databricks connector and still have issues on its own (need to be ran on all-purpose cluster / weird normaization issues on some streams). I have an open [PR](https://github.com/airbytehq/airbyte/pull/14445) for it that may help later on, after the databricks connector is refactored as i understood

Dennis Hinnenkamp

11/07/2022, 11:35 AM

@Simon Späti, does the connector also work with the unity catalog? I had a look into the source code and recognized that the create table command is based on the old two-level namespace, but as far as I know the unity catalog has a catalog on top of the schema and table.

Simon Späti

11/07/2022, 11:45 AM

Hi @Dennis Hinnenkamp To my knowledge, the unity catalog is already supported. @Liren Tu please correct me, if you have any other Info to share.

Dennis Hinnenkamp

11/07/2022, 12:09 PM

@Simon Späti, thanks for your reply. I just did a quick test and the sync fails when creating the table.

CREATE TABLE public._airbyte_tmp_yhm_orders USING parquet LOCATION 's3://<bucket-xyz>/data/6fed6526-b350-40a6-8719-1907a58cfe65/orders'.

I would have expected that it would be possible to specify the catalogue in the configuration, but currently I don't know in which catalogue the table would be created.

Simon Späti

11/09/2022, 9:23 AM

Thanks so much for the update. Could you please follow up on the channel #C021JANJ6TY or on our https://discuss.airbyte.io/ as I'm not updated on the latest here.

👍 1

2 Views

Open in Slack

Previous Next