integration-iceberg-datahub
  • m

    mammoth-bear-12532

    01/04/2022, 7:25 PM
    I worked with Iceberg as part of Gobblin streaming ingestion, so I have an okay understanding of its capabilities
  • m

    modern-monitor-81461

    01/05/2022, 7:07 PM
    I'm looking at DataHub and there is something I cannot explain. When you mouse-over a type in the UI, I think you are supposed to see the
    native_data_type
    from Avro. At least that's what I see (I have attached a screenshot with a
    Time
    type and a tooltip of
    Timestampz
    , which is the Iceberg
    native_data_type
    ). But there is a field that is mapped to
    Time
    and the tooltip shows
    Date
    . I was expecting to see the type as
    Date
    and not
    Time
    . Here is the Iceberg metadata for that field:
    }, {
                    "id" : 227,
                    "name" : "date",
                    "required" : false,
                    "type" : "date"
                  }, {
    As you can see, its type is
    date
    and it will be mapped to
    DateType
    in Python. In my IcebergSource, I create the following Avro schema:
    elif isinstance(type, IcebergTypes.DateType):
            dateType : IcebergTypes.DateType = type
            return {
                "type": "int",
                "logicalType": "date",
                "native_data_type": repr(dateType),
                "_nullable": True,
            }
    where
    repr(dateType)
    is
    def __repr__(self):
            return "date"
    Is it because a logical Avro type of
    date
    is mapped to a
    Time
    type in the UI, or there is something broken on my side? I don't know if all of this makes sense without demo-ing it! Sorry if it's confusing.
  • m

    modern-monitor-81461

    01/05/2022, 7:43 PM
    I also have a question regarding platforms. I always assumed that I would end up creating a
    iceberg
    platform. I looked at data_platforms.json as well as your demo instance and saw a
    hive
    and a
    AWS S3
    platform. I'm confused by the S3 one... Does it exist for organization simply storing files? What about orgs like mine who store Iceberg tables in Azure Storage Account? In my mind, S3 is equivalent as Azure Storage accounts, so which one should I use then? Iceberg seems like the logical choice, but I'm curious to know more about platforms.
  • m

    modern-monitor-81461

    01/18/2022, 10:33 PM
    Just a quick update about where I'm at with my work. I have been working on profiling for the past few days and I managed to get this. You can ignore the
    Distinct Count
    and
    Distinct %
    since it is not possible to compute those by only using the manifest metrics (there is a set of metrics for each data file, so a distinct value in file A and a distinct value in file B do not mean that we have 2 distinct values in the table... it could be the same value, so the distinct count for the table would be 1).
    Min
    ,
    Max
    ,
    Null Count
    and
    Null %
    are reliable though. Is it a problem if the Iceberg source profiling does not provide a full picture? My code would greatly benefit from a review since I don't think I leveraged all the tooling from DataHub ingestion. What would you recommend? That I try to polish it as much as I think I can and then ask for a review, so do this sooner in case I need to do a big refactoring? I don't want to waste your time too much, but I don't want to waste mine either! 😉
  • m

    modern-monitor-81461

    01/28/2022, 2:19 AM
    @chilly-holiday-80781 and all, I have opened a draft PR about Iceberg: https://github.com/linkedin/datahub/pull/3999. It's far from being complete, but I have quite a few things to discuss and I think it would be easier if you guys can look at the code before we start any of these discussions. I am very opened for criticism, so I'll take any feedback, good or bad 😅
  • m

    modern-monitor-81461

    01/29/2022, 3:15 PM
    How to use datalake/container/buckets names and folders with Iceberg source? I think I'm at the point where I could use your help to figure out what to do with datalake and folder names... I can certainly speak for my setup and use-case, but I don't think everyone is using Iceberg the same way. I attended (got distracted a few times by work though) yesterday's townhall and I'll need to watch the video again, but I think what was presented (platform instance, container, etc...) will help. I am using Azure Datalake Gen2 with hierarchical namespaces. Here is how Azure is organized: • Datalake_A ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 ▪︎ Iceberg_Table_2 ◦ Container_Y ▪︎ Folder_3 ◦ Container_Z ▪︎ Folder_4 • Datalake_B ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 So you can have multiple datalakes (or storage accounts) and each datalake can have 1 or multiple containers. Each container can be organized with folders. You can see a container as a root-level folder. It is technically different, but for simplicity, it can be abstracted. Just like databases, multiple datalakes could have table name collisions just like I showed in my example. In this case, there would be 2 Iceberg tables with the same "name":
    Folder_1.Iceberg_Table_2
    But they would have two different Azure URLs (
    abfss://{container_name}@{account_name}.<http://dfs.core.windows.net/{folder}|dfs.core.windows.net/{folder}>
    ): •
    <abfss://Container_X@Datalake_A.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
    <abfss://Container_X@Datalake_B.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
    My question is how should the Iceberg source deal with this? How does it compare to AWS S3? How would it look for someone using a local filesystem?
  • m

    modern-monitor-81461

    03/01/2022, 4:45 AM
    Hi @little-megabyte-1074 I'm currently on vacation for a ski trip, I will get back to you next week ⛷️ 😉
  • l

    little-megabyte-1074

    03/09/2022, 7:25 PM
    Hey @modern-monitor-81461! Hope you had a great ski trip 😎 let me know if there’s anything we can do on our side to help move through Iceberg support! Want to make sure we don’t lose momentum :teamwork:
  • h

    helpful-optician-78938

    03/15/2022, 11:40 PM
    Hi @modern-monitor-81461, thanks for reporting it. This is indeed a bug. Just built a fix and tested it. Will clean it up, add test coverage and raise the PR to OSS soon with the fix. In the meantime, if you want to unblock yourself, you can change this code to
    type=self._converter._get_column_type(
             actual_schema.type,
              (
                  getattr(actual_schema, "logical_type", None)
                  or actual_schema.props.get("logicalType")
               ),
    ),
  • m

    modern-monitor-81461

    03/15/2022, 11:46 PM
    Thanks for the bug confirmation @helpful-optician-78938. It will probably fix DecimalType, as well as TimeType, Timestamp and TimestampZ types... They all use avro logical types. I have another thing that bugs me regarding field descriptions and Avro. This will be easier to show once I update my PR. I will do that tomorrow morning and let you know.
  • m

    modern-monitor-81461

    03/16/2022, 5:02 PM
    @mammoth-bear-12532 & all, I need to give you an update on the Iceberg source initiative. When I started coding a Iceberg source (started in Amundsen, than transitioned to :datahub:), I immediately forked the Iceberg git repo as I needed to add an Azure datalake connector (only S3 was supported). It's the first thing I did and while doing it, I realized that this library (Iceberg Python) was rather incomplete and buggy! There were features that simply did not work and I started doubting that it had ever been used... I contributed a few PRs back to fix a number of issues (and even finished incomplete classes filled with TODOs) and while doing this, I got to know a few Iceberg devs and they filled me in on a refactoring initiative. The Python legacy (that's how they call the 1st incomplete implementation) was an unfinished work with Netflix (they have their own internal fork) and it simply needed to be re-written. That refactoring started just before I started my work on Amundsen, so there wasn't much to be used. I had to rely on python legacy to do my work. I managed to fix everything on my code path and now I have something that works. I contributed everything back to Iceberg and they are using my code (and making it better) in the new implementation. When I was asked to "finalize" my :datahub: PR, I started to write test cases and updated the Python dependencies. That's when I realized that iceberg Python is not even published on Pypi... They don't have a build for it and I brought it up on their Slack and they said they do not want to publish it since the new API is in the works. I asked when release 0.1 would be available (0.1 contains pretty much what I need for the Iceberg source) and if everything goes as planned, it would be this summer. I see 2 options:1. I build Iceberg python legacy and we save a copy into :datahub: git. We use it as an internal/private dependency (that's my setup right now). We use my fork as a repo and we hope not a lot of users will request changes! Then I re-write the source using the new API when 0.1 is released. 2. We put this integration on ice until 0.1 is released. My org will be using it meanwhile and I will maintain it, but it will not be available to :datahub: users. Your roadmap will be delayed... The re-write using 0.1 shouldn't be too hard since the logic will remain the same and the APIs are somewhat similar. I would still like to have my PR reviewed by @helpful-optician-78938 since it will improve my current implementation and odds are that I will be able to re-use a good chunk. But I know time is precious, so I totally understand if you would prefer saving Ravindra's time for the final implementation.
  • r

    red-lizard-30438

    04/26/2022, 5:38 AM
    Hi Team, I am looking for Iceberg solution in Datahub and ingest metadata from Iceberg. I came across this channel, so wanted to know does we have working solution in Datahub? How can we integrate and please share the integration document.
  • b

    big-carpet-38439

    05/02/2022, 3:51 PM
    @modern-monitor-81461 Can you raise a PR with what you have so far? @red-lizard-30438 and folks are interested in trying to extend for S3 🙂