Hi guys, thank you so much for the prior efforts o...
# integrate-iceberg-datahub
n
Hi guys, thank you so much for the prior efforts on making the Iceberg source for DataHub! My team is also trying to integrate our Iceberg tables from S3 and want to see if there’s anything we could possibly help here. By reading though the threads and docs, my current understanding is that: 1. Current Iceberg source is based on legacy python code, which is deprecated and limits the extension to other data lake (e.g. S3) and catalog (e.g. HiveCatalog). 2. @modern-monitor-81461 has a new PR to switch to the new SDK pyiceberg and remove the limitation, but it’s currently blocked by pyiceberg 0.4.0 release. We’re very excited to see the new source come alive, with three questions in mind. 1. How extensible the new source will be? Let’s say we added a few table properties in our Iceberg fork and want to pull them in through the Iceberg source, would it be something easy to extend? 2. How backward-compatible the new source will be? While we are trying to upgrade, our current Iceberg version is on 0.12.0, would it be compatible with the new source? 3. While it’s super hard to predict OSS release plan given the review and publish cycles, is it crazy to expect the new source would likely land somewhere between Q2 and Q3, 2023?
m
Hi @numerous-byte-87938, your understanding of the situation is spot on. pyiceberg 0.4.0 still has 8 Open issues: https://github.com/apache/iceberg/milestone/27 1. Any table properties will be pulled by the Iceberg source. See code following code (this hasn't been pushed yet, it's coming from my pending PR):
Copy code
# Dataset properties aspect.
        custom_properties = table.metadata.properties.copy()
        custom_properties["location"] = table.metadata.location
        custom_properties["format-version"] = str(table.metadata.format_version)
        if table.current_snapshot():
            custom_properties["snapshot-id"] = str(table.current_snapshot().snapshot_id)
            custom_properties["manifest-list"] = table.current_snapshot().manifest_list
        dataset_properties = DatasetPropertiesClass(
            tags=[],
            description=table.metadata.properties.get("comment", None),
            customProperties=custom_properties,
        )
        dataset_snapshot.aspects.append(dataset_properties)
2. I did not try with 0.12.0, but I can tell you that it works with 0.14. The spec hasn't changed in a while, so odds are it will work. 3. In my organization, we are still using the HadoopCatalog (I know, it's bad!) but we have a plan to migrate to REST catalog. Other work duties have prevented me to get to this task. I was hoping to migrate to REST catalog before pyiceberg 0.4.0 would be out, so I can test the new source with REST catalog. If 0.4.0 gets out before I'm ready, I don't mind creating a PR so you can test it. I can also stand up a REST catalog using Tabular's docker compose setup and test it, but I'm just short on time. All that to say that if need be, I can submit a PR for others to have a look.