Hi everyone. I want to ingest metada from airflow ...
# all-things-deployment
a
Hi everyone. I want to ingest metada from airflow into datahub. In my airflow py. file,  I gave inlets and outlets and configured the connection to datahub (as guided in https://datahubproject.io/docs/lineage/airflow). The inlets and outlets are about HDFS and they looked like outlets={"datasets": [Dataset("hdfs", "/general/project1/folder1/file1.parquet")]}. The problem is file1's schema didn't show up in the datahub UI, and I saw the whole path rather the file alone in the UI. Can anyone tell me what's the cause here?
h
Hi @adamant-sugar-28445, what you are seeing is the expected behavior. Configuring inlets and outlets only provides you with the dataset lineage as described by the shallow Dataset object itself. If you would like to see the schema of the dataset represented by the parquet file, (1) you need one of the tasks in your airflow parse the file and emit the schema separately and (2) the outlet should point to the actual table instead of the parquet file.
a
Thanks @helpful-optician-78938 . Can you give me (a novice) a working guide on how to do this? There's scant how-to about the particulars.
h
Hi @adamant-sugar-28445, I am not sure if we support parsing parquet schemas yet. Let me loop in our expert. @mammoth-bear-12532, could you look into Godel's use-case above?
m
Hi @adamant-sugar-28445: usually folks will catalog their structured data using Hive (for HDFS) and Glue (for S3) based lakes. Then they connect DataHub to Hive/Glue to ingest the structure. Would that not work for you?
a
Thanks a lot for the suggestion, @mammoth-bear-12532. I haven't tried that, but will take first steps in following the implemention. That's quite a challenge for me though to get to know HDFS and do all the integration across servers.