<@UV0M2EB8Q> &amp; all, I need to give you an upda...
# integrate-iceberg-datahub
m
@mammoth-bear-12532 & all, I need to give you an update on the Iceberg source initiative. When I started coding a Iceberg source (started in Amundsen, than transitioned to datahub), I immediately forked the Iceberg git repo as I needed to add an Azure datalake connector (only S3 was supported). It's the first thing I did and while doing it, I realized that this library (Iceberg Python) was rather incomplete and buggy! There were features that simply did not work and I started doubting that it had ever been used... I contributed a few PRs back to fix a number of issues (and even finished incomplete classes filled with TODOs) and while doing this, I got to know a few Iceberg devs and they filled me in on a refactoring initiative. The Python legacy (that's how they call the 1st incomplete implementation) was an unfinished work with Netflix (they have their own internal fork) and it simply needed to be re-written. That refactoring started just before I started my work on Amundsen, so there wasn't much to be used. I had to rely on python legacy to do my work. I managed to fix everything on my code path and now I have something that works. I contributed everything back to Iceberg and they are using my code (and making it better) in the new implementation. When I was asked to "finalize" my datahub PR, I started to write test cases and updated the Python dependencies. That's when I realized that iceberg Python is not even published on Pypi... They don't have a build for it and I brought it up on their Slack and they said they do not want to publish it since the new API is in the works. I asked when release 0.1 would be available (0.1 contains pretty much what I need for the Iceberg source) and if everything goes as planned, it would be this summer. I see 2 options: 1. I build Iceberg python legacy and we save a copy into datahub git. We use it as an internal/private dependency (that's my setup right now). We use my fork as a repo and we hope not a lot of users will request changes! Then I re-write the source using the new API when 0.1 is released. 2. We put this integration on ice until 0.1 is released. My org will be using it meanwhile and I will maintain it, but it will not be available to datahub users. Your roadmap will be delayed... The re-write using 0.1 shouldn't be too hard since the logic will remain the same and the APIs are somewhat similar. I would still like to have my PR reviewed by @helpful-optician-78938 since it will improve my current implementation and odds are that I will be able to re-use a good chunk. But I know time is precious, so I totally understand if you would prefer saving Ravindra's time for the final implementation.
m
Thanks for the note Eric. /cc @dazzling-judge-80093 also
@modern-monitor-81461: can you remind me what the dependency would be between the
iceberg source
and this code?
m
The Iceberg Python legacy (code here) is providing readers and writers for Iceberg tables. When you point that library to a location (local filesystem, S3 bucket, Azure Datalake...), it will read the table metadata and its manifests. It returns a
Table
abstraction that you can then use to access various table properties like its schema, snapshots, etc... The
iceberg source
uses that library to do its thing. In order to point the library to the proper location, I use the Azure Datalake library to crawl a datalake and find the Iceberg tables. I DO NOT leverage a catalog (some would say I use the HadoopCatalog, but there is no such thing in the Iceberg Python implementation right now... that's coming in the new version). The crawling section of the source could be abstracted and shared maybe with the datalake source and it's something I'd like to discuss down the road with you guys, but I'm getting off topic. I'm not sure I'm answering your question though...?
m
got it... so you're looking for a home for the python legacy code + possibly a pip package that the datahub iceberg source can depend on
m
Correct.
m
and the non-legacy version (https://github.com/apache/iceberg/tree/master/python) is not ready for primetime?
m
Right. They have a few milestones planned and the first one (version 0.1) is all about "reading" Iceberg tables, which is pretty much all I need. That 0.1 milestone is hopefully going to be ready this summer. So whether we find a home for the legacy version, or we wait until this summer.
m
and in terms of capabilities, is there a diff between the legacy version and the currently in-dev version in terms of what metadata they can read?
the legacy version is ahead currently?
@modern-monitor-81461: I've opened up
<https://github.com/acryldata/py-iceberg>
for you to house the code for the legacy implementation.
b
So to clarify- we are putting this integration on hold?
m
@big-carpet-38439 Nope, we're going forward with it. I'm waiting on two things to be resolved right now: 1. Have a build to publish the Iceberg legacy implementation to pypi (Shirshanka is on it) 2. Waiting on a fix/help for a problem related to Avro (@helpful-optician-78938 was supposed to give me a hand, but I think he got side tracked. I'm waiting for the pypi iceberg dependency to be published on pypi to request his help again. Running the unit test in my PR will expose the Avro issue)
b
so we're going forward with the legacy api , and that may become deprecated this summer?
m
Correct. I'm in contact with some Iceberg python devs to follow their progress.
By having our own Iceberg build, the Iceberg source I'm contributing could stay alive until we wish to refactor it since the Iceberg table format hasn't changed.