• m

    modern-monitor-81461

    8 months ago
    Just a quick update about where I'm at with my work. I have been working on profiling for the past few days and I managed to get this. You can ignore the
    Distinct Count
    and
    Distinct %
    since it is not possible to compute those by only using the manifest metrics (there is a set of metrics for each data file, so a distinct value in file A and a distinct value in file B do not mean that we have 2 distinct values in the table... it could be the same value, so the distinct count for the table would be 1).
    Min
    ,
    Max
    ,
    Null Count
    and
    Null %
    are reliable though. Is it a problem if the Iceberg source profiling does not provide a full picture? My code would greatly benefit from a review since I don't think I leveraged all the tooling from DataHub ingestion. What would you recommend? That I try to polish it as much as I think I can and then ask for a review, so do this sooner in case I need to do a big refactoring? I don't want to waste your time too much, but I don't want to waste mine either! 😉
    m
    m
    +1
    3 replies
    Copy to Clipboard
  • m

    modern-monitor-81461

    7 months ago
    @chilly-holiday-80781 and all, I have opened a draft PR about Iceberg: https://github.com/linkedin/datahub/pull/3999. It's far from being complete, but I have quite a few things to discuss and I think it would be easier if you guys can look at the code before we start any of these discussions. I am very opened for criticism, so I'll take any feedback, good or bad 😅
    m
    1 replies
    Copy to Clipboard
  • m

    modern-monitor-81461

    7 months ago
    How to use datalake/container/buckets names and folders with Iceberg source? I think I'm at the point where I could use your help to figure out what to do with datalake and folder names... I can certainly speak for my setup and use-case, but I don't think everyone is using Iceberg the same way. I attended (got distracted a few times by work though) yesterday's townhall and I'll need to watch the video again, but I think what was presented (platform instance, container, etc...) will help. I am using Azure Datalake Gen2 with hierarchical namespaces. Here is how Azure is organized: • Datalake_A ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 ▪︎ Iceberg_Table_2 ◦ Container_Y ▪︎ Folder_3 ◦ Container_Z ▪︎ Folder_4 • Datalake_B ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 So you can have multiple datalakes (or storage accounts) and each datalake can have 1 or multiple containers. Each container can be organized with folders. You can see a container as a root-level folder. It is technically different, but for simplicity, it can be abstracted. Just like databases, multiple datalakes could have table name collisions just like I showed in my example. In this case, there would be 2 Iceberg tables with the same "name":
    Folder_1.Iceberg_Table_2
    But they would have two different Azure URLs (
    abfss://{container_name}@{account_name}.<http://dfs.core.windows.net/{folder}|dfs.core.windows.net/{folder}>
    ): •
    <abfss://Container_X@Datalake_A.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
    <abfss://Container_X@Datalake_B.dfs.core.windows.net/Folder_1/Iceberg_Table_1>
    My question is how should the Iceberg source deal with this? How does it compare to AWS S3? How would it look for someone using a local filesystem?
    m
    l
    +1
    14 replies
    Copy to Clipboard
  • m

    modern-monitor-81461

    6 months ago
    Hi @little-megabyte-1074 I'm currently on vacation for a ski trip, I will get back to you next week ⛷️ 😉
    m
    1 replies
    Copy to Clipboard
  • l

    little-megabyte-1074

    6 months ago
    Hey @modern-monitor-81461! Hope you had a great ski trip 😎 let me know if there’s anything we can do on our side to help move through Iceberg support! Want to make sure we don’t lose momentum :teamwork:
    l
    m
    4 replies
    Copy to Clipboard
  • h

    helpful-optician-78938

    6 months ago
    Hi @modern-monitor-81461, thanks for reporting it. This is indeed a bug. Just built a fix and tested it. Will clean it up, add test coverage and raise the PR to OSS soon with the fix. In the meantime, if you want to unblock yourself, you can change this code to
    type=self._converter._get_column_type(
             actual_schema.type,
              (
                  getattr(actual_schema, "logical_type", None)
                  or actual_schema.props.get("logicalType")
               ),
    ),
    h
    m
    2 replies
    Copy to Clipboard
  • m

    modern-monitor-81461

    6 months ago
    Thanks for the bug confirmation @helpful-optician-78938. It will probably fix DecimalType, as well as TimeType, Timestamp and TimestampZ types... They all use avro logical types. I have another thing that bugs me regarding field descriptions and Avro. This will be easier to show once I update my PR. I will do that tomorrow morning and let you know.
    m
    h
    12 replies
    Copy to Clipboard
  • m

    modern-monitor-81461

    6 months ago
    @mammoth-bear-12532 & all, I need to give you an update on the Iceberg source initiative. When I started coding a Iceberg source (started in Amundsen, than transitioned to :datahub:), I immediately forked the Iceberg git repo as I needed to add an Azure datalake connector (only S3 was supported). It's the first thing I did and while doing it, I realized that this library (Iceberg Python) was rather incomplete and buggy! There were features that simply did not work and I started doubting that it had ever been used... I contributed a few PRs back to fix a number of issues (and even finished incomplete classes filled with TODOs) and while doing this, I got to know a few Iceberg devs and they filled me in on a refactoring initiative. The Python legacy (that's how they call the 1st incomplete implementation) was an unfinished work with Netflix (they have their own internal fork) and it simply needed to be re-written. That refactoring started just before I started my work on Amundsen, so there wasn't much to be used. I had to rely on python legacy to do my work. I managed to fix everything on my code path and now I have something that works. I contributed everything back to Iceberg and they are using my code (and making it better) in the new implementation. When I was asked to "finalize" my :datahub: PR, I started to write test cases and updated the Python dependencies. That's when I realized that iceberg Python is not even published on Pypi... They don't have a build for it and I brought it up on their Slack and they said they do not want to publish it since the new API is in the works. I asked when release 0.1 would be available (0.1 contains pretty much what I need for the Iceberg source) and if everything goes as planned, it would be this summer. I see 2 options:1. I build Iceberg python legacy and we save a copy into :datahub: git. We use it as an internal/private dependency (that's my setup right now). We use my fork as a repo and we hope not a lot of users will request changes! Then I re-write the source using the new API when 0.1 is released. 2. We put this integration on ice until 0.1 is released. My org will be using it meanwhile and I will maintain it, but it will not be available to :datahub: users. Your roadmap will be delayed... The re-write using 0.1 shouldn't be too hard since the logic will remain the same and the APIs are somewhat similar. I would still like to have my PR reviewed by @helpful-optician-78938 since it will improve my current implementation and odds are that I will be able to re-use a good chunk. But I know time is precious, so I totally understand if you would prefer saving Ravindra's time for the final implementation.
    m
    m
    +1
    16 replies
    Copy to Clipboard
  • r

    red-lizard-30438

    5 months ago
    Hi Team, I am looking for Iceberg solution in Datahub and ingest metadata from Iceberg. I came across this channel, so wanted to know does we have working solution in Datahub? How can we integrate and please share the integration document.
    r
    m
    +2
    6 replies
    Copy to Clipboard
  • b

    big-carpet-38439

    4 months ago
    @modern-monitor-81461 Can you raise a PR with what you have so far? @red-lizard-30438 and folks are interested in trying to extend for S3 🙂
    b
    1 replies
    Copy to Clipboard