Hi everyone I want to ingest metada from airflow into datahu DataHub #all-things-deployment

Hi everyone. I want to ingest metada from airflow ...

adamant-sugar-28445

01/04/2022, 2:05 PM

Hi everyone. I want to ingest metada from airflow into datahub. In my airflow py. file, I gave inlets and outlets and configured the connection to datahub (as guided in https://datahubproject.io/docs/lineage/airflow). The inlets and outlets are about HDFS and they looked like outlets={"datasets": [Dataset("hdfs", "/general/project1/folder1/file1.parquet")]}. The problem is file1's schema didn't show up in the datahub UI, and I saw the whole path rather the file alone in the UI. Can anyone tell me what's the cause here?

helpful-optician-78938

01/04/2022, 5:05 PM

Hi @adamant-sugar-28445, what you are seeing is the expected behavior. Configuring inlets and outlets only provides you with the dataset lineage as described by the shallow Dataset object itself. If you would like to see the schema of the dataset represented by the parquet file, (1) you need one of the tasks in your airflow parse the file and emit the schema separately and (2) the outlet should point to the actual table instead of the parquet file.

adamant-sugar-28445

01/05/2022, 9:21 AM

Thanks @helpful-optician-78938 . Can you give me (a novice) a working guide on how to do this? There's scant how-to about the particulars.

helpful-optician-78938

01/06/2022, 6:06 PM

Hi @adamant-sugar-28445, I am not sure if we support parsing parquet schemas yet. Let me loop in our expert. @mammoth-bear-12532, could you look into Godel's use-case above?

mammoth-bear-12532

01/07/2022, 7:23 PM

Hi @adamant-sugar-28445: usually folks will catalog their structured data using Hive (for HDFS) and Glue (for S3) based lakes. Then they connect DataHub to Hive/Glue to ingest the structure. Would that not work for you?

adamant-sugar-28445

01/10/2022, 2:13 PM

Thanks a lot for the suggestion, @mammoth-bear-12532. I haven't tried that, but will take first steps in following the implemention. That's quite a challenge for me though to get to know HDFS and do all the integration across servers.

2 Views

Open in Slack

Previous Next