I guess I can start with what I ve done so far I m leading a DataHub #integrate-iceberg-datahub

I guess I can start with what I've done so far. I...

modern-monitor-81461

01/04/2022, 7:31 PM

I guess I can start with what I've done so far. I'm leading a data governance initiative at my organization and I first started with Amundsen. It was going relatively well until I've heard about DataHub. Amundsen didn't support Iceberg either, so I created an extractor for it. My organization is using Microsoft Azure datalake and a HadoopCatalog (for technical reasons, we could not use Hive... I can give details if interested). Unfortunately, Iceberg's Python API did not support: 1- HadoopCatalog (limited support) 2- Azure Storage account and it is also being re-written right now (ETA is unknown, but I would guess Q3-Q4 2022). So I forked the Iceberg repo and added the Azure support (they asked me if I could contribute this work back, but In Java... it's on my todo list, but not at the top). So in Amundsen, I was able to read Azure storage account locations, scan directories for tables and extract their schema and some other information like row counts, storage space used, etc...). I then heard about you guys and got very excited about your product. It seemed much more production-grade and the community was VERY active on Slack, especially your core team. I found this not to be the case in Amundsen. So I "ported" my Amundsen code to a DataHub IcebergSource and I am currently able to reproduce what I had in Amundsen, minus the nested fields. That's what I'm working on right now.

2 Views

Open in Slack

Previous Next