Hi, actually I don't know anything about datahub y...
# getting-started
c
Hi, actually I don't know anything about datahub yet. What documents should I start with? I ran Quickstart and added sample data.
m
Hi @calm-motorcycle-16283, welcome!
What kinds of metadata are you looking to integrate into Datahub?
c
Well, in fact, the data I am trying to integrate are data made of parquets. Is such data possible?
m
Is that stored as hive tables
Or just raw parquet files
c
For now, It is just raw parquet files. But I can move to hive tables.
m
Check out the metadata-ingestion scripts here. https://github.com/linkedin/datahub/tree/master/metadata-ingestion
There is a hive source
c
I have various datasources stored as parquets, and I join and clean them daily and save them as new parquets. Can this also be expressed in lineage? I want to manage these through datahub, but I am not sure how to do it.
m
What orchestration system are you using? Something like airflow?
c
Sorry for a very basic question.๐Ÿ˜…
Yes I use airflow.
m
Right now, we support coarse grain lineage where you have to emit an event connecting your inputs to your output tables.
If you can emit that, Datahub will store it and reflect it in the ui
There isnโ€™t an airflow recipe for this yet, but it should be possible to create one.
c
If you don't mind, can you explain more deatil about?
Copy code
If you can emit that, Datahub will store it and reflect it in the ui.
m
So there are three pieces to this: 1. The Event that you emit 2. DataHub stores the lineage in its backend 3. You can see the lineage in the UI
which of the three do you want to understand more about
c
1. The Event that you emit
specifically: L214
so the event that you emit would need to include the upstream-lineage(s) of the output dataset that you are creating
c
Oh good. Thanks.๐Ÿ‘ Quick question, should someone make this json file manual? Or is it auto-generated from datasources or other stored data?
m
This json file is just an example of some sample metadata that you can quickly ingest into datahub
if you want to do it programmatically, you would probably create the event in memory and then emit it to datahub (either over Kafka or over http/REST)
if you want to figure out how to generate the event in Python, you should chat with @gray-shoe-75895 as he has been contributing changes in that area
c
Okay thanks, I think I understand little by little now.
m
๐Ÿ‘ we're here to help!