I guess I can start with what I've done so far. I...
# integrate-iceberg-datahub
m
I guess I can start with what I've done so far. I'm leading a data governance initiative at my organization and I first started with Amundsen. It was going relatively well until I've heard about DataHub. Amundsen didn't support Iceberg either, so I created an extractor for it. My organization is using Microsoft Azure datalake and a HadoopCatalog (for technical reasons, we could not use Hive... I can give details if interested). Unfortunately, Iceberg's Python API did not support: 1- HadoopCatalog (limited support) 2- Azure Storage account and it is also being re-written right now (ETA is unknown, but I would guess Q3-Q4 2022). So I forked the Iceberg repo and added the Azure support (they asked me if I could contribute this work back, but In Java... it's on my todo list, but not at the top). So in Amundsen, I was able to read Azure storage account locations, scan directories for tables and extract their schema and some other information like row counts, storage space used, etc...). I then heard about you guys and got very excited about your product. It seemed much more production-grade and the community was VERY active on Slack, especially your core team. I found this not to be the case in Amundsen. So I "ported" my Amundsen code to a DataHub IcebergSource and I am currently able to reproduce what I had in Amundsen, minus the nested fields. That's what I'm working on right now.