DataHub #integrate-iceberg-datahub

modern-monitor-81461

01/28/2022, 2:19 AM

@chilly-holiday-80781 and all, I have opened a draft PR about Iceberg: https://github.com/linkedin/datahub/pull/3999. It's far from being complete, but I have quite a few things to discuss and I think it would be easier if you guys can look at the code before we start any of these discussions. I am very opened for criticism, so I'll take any feedback, good or bad 😅

🙌 2

modern-monitor-81461

01/28/2022, 2:25 AM

The biggest issue I had lately was to fill the gaps in the Iceberg Python library. I contributed 2 or 3 PRs this week only to make the code work. The Iceberg devs are refactoring this library since they know it is incomplete and they want something better. But in the meantime, we'll have to live with it... That's one of the items I'd like to discuss, but we can do that later.

modern-monitor-81461

01/29/2022, 3:15 PM

How to use datalake/container/buckets names and folders with Iceberg source? I think I'm at the point where I could use your help to figure out what to do with datalake and folder names... I can certainly speak for my setup and use-case, but I don't think everyone is using Iceberg the same way. I attended (got distracted a few times by work though) yesterday's townhall and I'll need to watch the video again, but I think what was presented (platform instance, container, etc...) will help. I am using Azure Datalake Gen2 with hierarchical namespaces. Here is how Azure is organized: • Datalake_A ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 ▪︎ Iceberg_Table_2 ◦ Container_Y ▪︎ Folder_3 ◦ Container_Z ▪︎ Folder_4 • Datalake_B ◦ Container_X ▪︎ Folder_1 • Iceberg_Table_1 So you can have multiple datalakes (or storage accounts) and each datalake can have 1 or multiple containers. Each container can be organized with folders. You can see a container as a root-level folder. It is technically different, but for simplicity, it can be abstracted. Just like databases, multiple datalakes could have table name collisions just like I showed in my example. In this case, there would be 2 Iceberg tables with the same "name":

Folder_1.Iceberg_Table_2

But they would have two different Azure URLs (

abfss://{container_name}@{account_name}.<http://dfs.core.windows.net/{folder}|dfs.core.windows.net/{folder}>

): •

<abfss://Container_X@Datalake_A.dfs.core.windows.net/Folder_1/Iceberg_Table_1>

•

<abfss://Container_X@Datalake_B.dfs.core.windows.net/Folder_1/Iceberg_Table_1>

My question is how should the Iceberg source deal with this? How does it compare to AWS S3? How would it look for someone using a local filesystem?

little-megabyte-1074

02/28/2022, 8:00 PM

Hey @modern-monitor-81461! Hope you had an amazing weekend 🙂 quick check-in to see how things are going here. I know you have a draft PR open; is there anything we can do to help you move forward this week?

modern-monitor-81461

03/01/2022, 4:45 AM

Hi @little-megabyte-1074 I'm currently on vacation for a ski trip, I will get back to you next week ⛷️ 😉

🎿 1

little-megabyte-1074

03/09/2022, 7:25 PM

Hey @modern-monitor-81461! Hope you had a great ski trip 😎 let me know if there’s anything we can do on our side to help move through Iceberg support! Want to make sure we don’t lose momentum teamwork

modern-monitor-81461

03/15/2022, 1:20 PM

@helpful-optician-78938 There is something wrong with the Avro mapping when it comes to logical types. I might be doing something wrong, so I'd like to know if you can take a look. I created a test case that mimics what my source is doing. -- The Setup -- In Iceberg, there is a

DecimalType

that I am trying to map to a

NumberTypeClass

. I think this mapping makes sense and it's what I can see in schema_util.py. This map is relying on the Avro

logical_type

property. The logical map is being used here. -- The Problem -- What I see when my test runs is

actual_schema

is of type

avro.schema.BytesDecimalSchema

(class definition here). It is not setting a

logicalType

property with

set_prop()

, so when

schema_util

tries to use it, the returned value is

None

and the

decimal

key mapping is never used. If I change the code in schema_util to use the

logical_type

Python property, everything works. I don't know if I explained it well with all my code references 😅 , but here is a simple test case to reproduce the problem:

Copy code

def test_avro():
    avro_schema = {'type': 'record', 'name': '__struct_', 'fields': [{'name': 'name', 'type': {'type': 'bytes', 'logicalType': 'decimal', 'precision': 3, 'scale': 2, 'native_data_type': 'decimal(3, 2)', '_nullable': True}}]}
    newfields = schema_util.avro_schema_to_mce_fields(
        json.dumps(avro_schema), default_nullable=True
    )
    assert len(newfields) == 1
    schema_field: SchemaField = newfields[0]
    assert isinstance(schema_field.type.type, NumberTypeClass)

In this code,

avro_schema

is what my Iceberg source is generating for an Iceberg DecimalType. I expect to get a

NumberTypeClass

as a Datahub type, but I get a

BytesTypeClass

helpful-optician-78938

03/15/2022, 9:34 PM

Hi @modern-monitor-81461, I'll take a look and get back to you soon.

helpful-optician-78938

03/15/2022, 11:40 PM

Hi @modern-monitor-81461, thanks for reporting it. This is indeed a bug. Just built a fix and tested it. Will clean it up, add test coverage and raise the PR to OSS soon with the fix. In the meantime, if you want to unblock yourself, you can change this code to

Copy code

type=self._converter._get_column_type(
         actual_schema.type,
          (
              getattr(actual_schema, "logical_type", None)
              or actual_schema.props.get("logicalType")
           ),
),

modern-monitor-81461

03/15/2022, 11:46 PM

Thanks for the bug confirmation @helpful-optician-78938. It will probably fix DecimalType, as well as TimeType, Timestamp and TimestampZ types... They all use avro logical types. I have another thing that bugs me regarding field descriptions and Avro. This will be easier to show once I update my PR. I will do that tomorrow morning and let you know.

👍 1

modern-monitor-81461

03/16/2022, 5:02 PM

@mammoth-bear-12532 & all, I need to give you an update on the Iceberg source initiative. When I started coding a Iceberg source (started in Amundsen, than transitioned to datahub), I immediately forked the Iceberg git repo as I needed to add an Azure datalake connector (only S3 was supported). It's the first thing I did and while doing it, I realized that this library (Iceberg Python) was rather incomplete and buggy! There were features that simply did not work and I started doubting that it had ever been used... I contributed a few PRs back to fix a number of issues (and even finished incomplete classes filled with TODOs) and while doing this, I got to know a few Iceberg devs and they filled me in on a refactoring initiative. The Python legacy (that's how they call the 1st incomplete implementation) was an unfinished work with Netflix (they have their own internal fork) and it simply needed to be re-written. That refactoring started just before I started my work on Amundsen, so there wasn't much to be used. I had to rely on python legacy to do my work. I managed to fix everything on my code path and now I have something that works. I contributed everything back to Iceberg and they are using my code (and making it better) in the new implementation. When I was asked to "finalize" my datahub PR, I started to write test cases and updated the Python dependencies. That's when I realized that iceberg Python is not even published on Pypi... They don't have a build for it and I brought it up on their Slack and they said they do not want to publish it since the new API is in the works. I asked when release 0.1 would be available (0.1 contains pretty much what I need for the Iceberg source) and if everything goes as planned, it would be this summer. I see 2 options: 1. I build Iceberg python legacy and we save a copy into datahub git. We use it as an internal/private dependency (that's my setup right now). We use my fork as a repo and we hope not a lot of users will request changes! Then I re-write the source using the new API when 0.1 is released. 2. We put this integration on ice until 0.1 is released. My org will be using it meanwhile and I will maintain it, but it will not be available to datahub users. Your roadmap will be delayed... The re-write using 0.1 shouldn't be too hard since the logic will remain the same and the APIs are somewhat similar. I would still like to have my PR reviewed by @helpful-optician-78938 since it will improve my current implementation and odds are that I will be able to re-use a good chunk. But I know time is precious, so I totally understand if you would prefer saving Ravindra's time for the final implementation.

red-lizard-30438

04/26/2022, 5:38 AM

Hi Team, I am looking for Iceberg solution in Datahub and ingest metadata from Iceberg. I came across this channel, so wanted to know does we have working solution in Datahub? How can we integrate and please share the integration document.

big-carpet-38439

05/02/2022, 3:51 PM

@modern-monitor-81461 Can you raise a PR with what you have so far? @red-lizard-30438 and folks are interested in trying to extend for S3 🙂

modern-monitor-81461

12/19/2022, 11:43 AM

Hi all, I am looking at upgrading the Datahub Iceberg source to the new Iceberg Python SDK (so we can get rid of https://github.com/acryldata/py-iceberg). I have just added support for Azure Datalake in the new pyiceberg, so migrating to this new package brings us on par with what we have right now. But it sets the Iceberg source in position to fully support Iceberg and all the catalogs once those are added to pyiceberg (currently being worked on by the Iceberg team). I have refactored the Datahub Iceberg ingest source code, but I'm running into package version issues related to pydantic:

Copy code

ERROR: Cannot install acryl-datahub[dev]==0.0.0.dev0 and pyiceberg==0.2.0 because these package versions have conflicting dependencies.

The conflict is caused by:
    acryl-datahub[dev] 0.0.0.dev0 depends on pydantic>=1.5.1
    acryl-datahub[dev] 0.0.0.dev0 depends on pydantic<1.10 and >=1.9.0; extra == "dev"
    acryl-datahub[dev] 0.0.0.dev0 depends on pydantic>=1.5.1; extra == "dev"
    pyiceberg 0.2.0 depends on pydantic==1.10.2

pyiceberg requires pydantic 1.10.2, but DataHub seems to have a type issue with 1.10+ according to this comment. What is this about? Is it something we can fix? @gray-shoe-75895

✅ 1

wide-optician-47025

03/21/2023, 4:04 PM

hello

wide-optician-47025

03/21/2023, 4:05 PM

we are implementing a S3 lakehouse on athena/spark with iceberg tables; I would like to be able to ingest the iceberg tables

wide-optician-47025

03/21/2023, 4:05 PM

currently when ingesting athena tables the metadata is not ingested since we have icebeerg tables

numerous-byte-87938

04/13/2023, 9:34 PM

Hi guys, thank you so much for the prior efforts on making the Iceberg source for DataHub! My team is also trying to integrate our Iceberg tables from S3 and want to see if there’s anything we could possibly help here. By reading though the threads and docs, my current understanding is that: 1. Current Iceberg source is based on legacy python code, which is deprecated and limits the extension to other data lake (e.g. S3) and catalog (e.g. HiveCatalog). 2. @modern-monitor-81461 has a new PR to switch to the new SDK pyiceberg and remove the limitation, but it’s currently blocked by pyiceberg 0.4.0 release. We’re very excited to see the new source come alive, with three questions in mind. 1. How extensible the new source will be? Let’s say we added a few table properties in our Iceberg fork and want to pull them in through the Iceberg source, would it be something easy to extend? 2. How backward-compatible the new source will be? While we are trying to upgrade, our current Iceberg version is on 0.12.0, would it be compatible with the new source? 3. While it’s super hard to predict OSS release plan given the review and publish cycles, is it crazy to expect the new source would likely land somewhere between Q2 and Q3, 2023?

dazzling-london-20492

06/10/2023, 2:00 AM

Hi team i wanted to ingest iceberg built in s3 but documentation only mention azure , are we going to supporting s3 in future

dazzling-london-20492

06/10/2023, 2:00 AM

for metadata we use hive

modern-monitor-81461

07/04/2023, 7:04 PM

Hi all, just a quick message to let you know that I'm working on a PR that will modify the current Iceberg source to use the new 0.4.0 pyiceberg library. With this, the Iceberg source will now be able to ingest tables from any Iceberg catalog currently supported by pyiceberg (REST, Hive, Glue and DynamoDB... I also have a JDBCCatalog draft implementation that I am working on with Fokko Driesprong and this should be available in pyiceberg end of summer). It will also introduce support for S3 🎉 and make https://github.com/acryldata/py-iceberg obsolete. Here is the PR: https://github.com/datahub-project/datahub/pull/8357

🧊 1

🤗 3

lively-appointment-50242

09/29/2023, 8:20 AM

Hi everybody. I am quite new to the DataHub. I have a task on my plate to ingest Iceberg metadata that is on Azure Data Lake Storage (ADLS) https://datahubproject.io/docs/generated/ingestion/sources/iceberg There is a config example for S3 but not for ADLS:

Copy code

source:
  type: "iceberg"
  config:
    env: PROD
    catalog:
      name: my_iceberg_catalog
      type: rest
      # Catalog configuration follows pyiceberg's documentation (<https://py.iceberg.apache.org/configuration>)
      config:
        uri: <http://localhost:8181>
        s3.access-key-id: admin
        s3.secret-access-key: password
        s3.region: us-east-1
        warehouse: <s3a://warehouse/wh/>
        s3.endpoint: <http://localhost:9000>
    platform_instance: my_iceberg_catalog
    table_pattern:
      allow:
        - marketing.*
    profiling:
      enabled: true

Can smb provide an Iceberg config example for the ADLS? Thanks in advance

bulky-shoe-65107

10/16/2023, 12:39 AM

has renamed the channel from "integration-iceberg-datahub" to "integrate-iceberg-datahub"

victorious-car-1170

11/22/2023, 12:37 PM

I'm also facing same issue like @lively-appointment-50242 , I have data in s3 and using postgress or mysql for backened jdbc catalog. so what should we use for type in iceberg soure. and URI. source: type: iceberg config: env: PROD catalog: name: my_catalog type: 🙄-------------------A BIG QUESTION MARK HERE--------------🙄 config: uri: 😐---------------A BIG QUESTION TO ON THIS................................😐 s3.access-key-id: XXXXXXXXXXXX s3.secret-access-key: XXXXXXXXXXXXX s3.region: us-east-1 warehouse: 's3a://XXXXXXXXXXXXXXXXX' s3.endpoint: 'XXXXXXXXXXXXXX' sink: type: datahub-rest config: server: 'http://10.128.3.95:8080'

victorious-car-1170

11/27/2023, 8:54 AM

Anyone here faced this kind of issue.

acoustic-hospital-48865

11/29/2023, 10:46 AM

Hey all, I would like to know if I can push table metadata from iceberg tables to datahub. I'm specifically interested about: • how many underlying data objects a specific iceberg table store • how many partitions (and which) a specific table has • how many manifest files a specific table has I'm fine to hack around - means I can get this data already via trino/athena/spark, I just want to push to the data catalog, but I'm wondering on the catalog side how can this be done? - therefore I'm wondering if there are some extra metadata fields that can eb set per table.

able-pilot-25899

11/30/2023, 8:11 AM

Hi team, We have a use case where user is ingesting iceberg table into datahub. Currently user manages table column comments via spark/trino query(that gets updated in iceberg snapshot file), 1. with ingestion do we import those comments to datahub field comments ? 2. Now once user start to improvise that comment in datahub ui, next iteration of ingestion will overwrite comment updated by user ? How are we handling this scenario ?

victorious-car-1170

12/01/2023, 9:42 AM

Hi @lively-appointment-50242 thanks for suggesting this, I have tried this but, I got different error! failures': {'get-catalog': ["Failed to get catalog iceberg: Apache Hive support not installed: pip install 'pyiceberg[hive]'"]},

full-alligator-99452

12/18/2023, 1:57 PM

Hi Everyone! I am trying to establish connection to iceberg using SQL type of catalog. I tried lots of combinations of attributes in yaml file for iceberg, not working. Error: “fail to get catalog …: ‘SQL’”. What I’ve searched through slack - is that DataHub last version uses v0.4.0 pyicberg library, is it true? Then does it mean that “sql” type of iceberg catalog is not supported? @lively-appointment-50242 Hi! I can see lots of messages from you related to data hub to iceberg connection. Have you achieved any success in it?

full-alligator-99452

12/21/2023, 12:30 PM

@modern-monitor-81461 Hi! Seems like lots of people here are facing the same issue with integration between DataHub and Apache Iceberg using SQL or Hive meta catalog. As I understood, in the end of summer pyiceberg library was updated to 0.4.0 for DataHub and, I guess that connection should be possible at least with Hive Met Catalog. Would you be able please to let us know if you have any suggestions how we can resolve that error: “Apache Hive support are not installed”? I was able to “pip install pyicberg[hive]==0.4.0” on server itself and into docker container “DataHub-actions” if it makes sense, but error remains the same.