https://datahubproject.io logo
#integrate-iceberg-datahub
Title
# integrate-iceberg-datahub
l

little-megabyte-1074

03/09/2022, 7:25 PM
Hey @modern-monitor-81461! Hope you had a great ski trip 😎 let me know if there’s anything we can do on our side to help move through Iceberg support! Want to make sure we don’t lose momentum teamwork
m

modern-monitor-81461

03/09/2022, 7:59 PM
Thanks @little-megabyte-1074, it was really nice ⛷️ . Fear not, I always have this contribution on my mind and I want to give it back to datahub . I'm actively working on making datahub part of my org ecosystem and once it happens, it will provide a lot of testing for the Iceberg source. Here is what I have done so far: 1. I can crawl a filesystem (only Azure Datalake for now, but I'd like to abstract that to support other storage like S3, plain old disks, etc...) and discover and ingest Iceberg tables. It supports filtering on names and locations. 2. I can read an Iceberg manifest and extract the table schema and properties (table name, fields and types, table comment, field comments, table ownership) 3. I support platform instances (different Iceberg catalogs) 4. I can profile the dataset using the table metrics computed by the Iceberg table format (min/max, Null count/%. The other stats could be done using an engine like Spark) What's left to be done (maybe optional): 1. Revisit how fields are created. I'm using the AVRO util (
schema_util.avro_schema_to_mce_fields()
) and I'm not sure if the result is right. I would have liked to create the model objects directly as I originally thought, but it's not mandatory 2. Support containers for the intermediate folders of a table location. 3. Migrate my code to the new Iceberg Python API once completed (in-progress and I'm reviewing some parts of it. But definitely not done yet) But the biggest thing is to know if what I have is useful for others. I know it is for my use-case, but I don't have a lot of exposure to how others are using Iceberg.
Do you see a way forward, or would you like to see something different? We can schedule a call if it makes things easier.
l

little-megabyte-1074

03/09/2022, 9:40 PM
Thanks for the overview, Eric!! This is awesome progress. Let’s go ahead and open the PR for review; we’ll have @helpful-optician-78938 look into how you’re creating fields via AVRO. Container support & Python API cutover can be a fast follow!
👍 1
m

modern-monitor-81461

03/10/2022, 2:18 AM
Give me a few days to polish a few things and I will open the PR. Should be there sometime next week.
👍🏻 1
👍 1