How to use datalake/container/buckets names and fo...
# integrate-iceberg-datahub
m
How to use datalake/container/buckets names and folders with Iceberg source? I think I'm at the point where I could use your help to figure out what to do with datalake and folder names... I can certainly speak for my setup and use-case, but I don't think everyone is using Iceberg the same way. I attended (got distracted a few times by work though) yesterday's townhall and I'll need to watch the video again, but I think what was presented (platform instance, container, etc...) will help. I am using Azure Datalake Gen2 with hierarchical namespaces. Here is how Azure is organized: • Datalake_A ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 ▪︎ Iceberg_Table_2 ◦ Container_Y ▪︎ Folder_3 ◦ Container_Z ▪︎ Folder_4 • Datalake_B ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 So you can have multiple datalakes (or storage accounts) and each datalake can have 1 or multiple containers. Each container can be organized with folders. You can see a container as a root-level folder. It is technically different, but for simplicity, it can be abstracted. Just like databases, multiple datalakes could have table name collisions just like I showed in my example. In this case, there would be 2 Iceberg tables with the same "name":
Folder_1.Iceberg_Table_2
But they would have two different Azure URLs (
abfss://{container_name}@{account_name}.<http://dfs.core.windows.net/{folder}|dfs.core.windows.net/{folder}>
): •
<abfss://Container_X@Datalake_A.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
<abfss://Container_X@Datalake_B.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
My question is how should the Iceberg source deal with this? How does it compare to AWS S3? How would it look for someone using a local filesystem?
@mammoth-bear-12532 or anyone else, any take on this?
l
Hi @modern-monitor-81461! I’ll make sure someone from the team takes a look at this ASAP
b
Yeah this makes sense to me
We should model folders as containers, and the top level differentiator (which is unique either globally or within your datahub instance) can be the platform id
This means we'll need to add logic to emit Container entities for the intermediate nodes
Would you like to see the DataLake itself as a top level container? Would this be useful? If that's the case, then this can even be emitted as a separate container node at the very top level.
DataLake itself is not globally unique, rather unique to your Azure account, I'm assuming. Correct?
m
The datalake name (or Storage Account name) needs to be unique across all Azure tenants. The reason (or at least one of the reason) is that a storage account (
storagesample
) is accessed via a set of URLs where the domain is a Microsoft domain (ex:
<https://storagesample.blob.core.windows.net>
,
<https://storagesample.file.core.windows.net>
, etc...)
I don't know if seeing the datalake should be something configurable or not. In my organization, we use multiple datalakes and even though the odds that we have 2 folder hierarchies and table names be the same on 2 different lakes, just because I said that, it will likely happen! So I'm thinking if someone knows that he will only ever use 1 datalake, maybe it makes sense to configure its Iceberg ingestion source to not use the datalake name as a top-level container? I'm not too sure how containers will work, but I'm a little worried when it will be time to create the lineage... Let me explain what I mean here. We have Iceberg table surfaced by Trino: Iceberg Tables -> Trino Engine -> Superset Superset and Trino know what a database is, and what a schema is. I'm just not sure how a datalake name and a datalake container will fit in. I guess I'll have to apply a transformer in the Superset recipe to add the missing information to enable the lineage... I'm just thinking out loud here in case this scenario has been seen and solved by someone.
l
Hey @modern-monitor-81461! Checking in here - did you get the direction/input you need to move forward?
m
Thanks @little-megabyte-1074. I'm not sure. I guess Friday's release will help clear a few things once I get my hands on the new container feature. For me, the code I have right now is doing the job. It's not perfect and it can be improved, but it works. I don't know how far I need to drive this before I can simply hand over the keys to you if you want to add it to DataHub. We are planning to put it to test in our environment with real users in a month or so. I guess the main direction/input I'm looking for is what is missing to make it fit nicely with the other sources DataHub offers. Supporting containers is probably one of them...
l
Awesome, ok - we’ll definitely be including Containers in the next release; John just merged it into Master if you want to start playing around with it beforehand https://github.com/linkedin/datahub/pull/4037 https://github.com/linkedin/datahub/pull/4019
My guess is that we’ll all learn about how Containers work within these nested file structures together, so IMO this is a perfect opportunity for us to test Containers in a new/different way!
I know that @chilly-holiday-80781 has taken a first pass on your draft PR; my suggestion would be to take some time with exploring how Containers behave within your environment, and then we can come back together & see what might be left before shipping as beta & gathering feedback from the community
👍 1