c

    calm-motorcycle-16283

    1 year ago
    Hi, actually I don't know anything about datahub yet. What documents should I start with? I ran Quickstart and added sample data.
    m

    mammoth-bear-12532

    1 year ago
    Hi @calm-motorcycle-16283, welcome!
    What kinds of metadata are you looking to integrate into Datahub?
    c

    calm-motorcycle-16283

    1 year ago
    Well, in fact, the data I am trying to integrate are data made of parquets. Is such data possible?
    m

    mammoth-bear-12532

    1 year ago
    Is that stored as hive tables
    Or just raw parquet files
    c

    calm-motorcycle-16283

    1 year ago
    For now, It is just raw parquet files. But I can move to hive tables.
    m

    mammoth-bear-12532

    1 year ago
    Check out the metadata-ingestion scripts here. https://github.com/linkedin/datahub/tree/master/metadata-ingestion
    There is a hive source
    c

    calm-motorcycle-16283

    1 year ago
    I have various datasources stored as parquets, and I join and clean them daily and save them as new parquets. Can this also be expressed in lineage? I want to manage these through datahub, but I am not sure how to do it.
    m

    mammoth-bear-12532

    1 year ago
    What orchestration system are you using? Something like airflow?
    c

    calm-motorcycle-16283

    1 year ago
    Sorry for a very basic question.๐Ÿ˜…
    Yes I use airflow.
    m

    mammoth-bear-12532

    1 year ago
    Right now, we support coarse grain lineage where you have to emit an event connecting your inputs to your output tables.
    If you can emit that, Datahub will store it and reflect it in the ui
    There isnโ€™t an airflow recipe for this yet, but it should be possible to create one.
    c

    calm-motorcycle-16283

    1 year ago
    If you don't mind, can you explain more deatil about?
    If you can emit that, Datahub will store it and reflect it in the ui.
    m

    mammoth-bear-12532

    1 year ago
    So there are three pieces to this:1. The Event that you emit 2. DataHub stores the lineage in its backend 3. You can see the lineage in the UI
    which of the three do you want to understand more about
    c

    calm-motorcycle-16283

    1 year ago
    1. The Event that you emit
    specifically: L214
    so the event that you emit would need to include the upstream-lineage(s) of the output dataset that you are creating
    c

    calm-motorcycle-16283

    1 year ago
    Oh good. Thanks.๐Ÿ‘ Quick question, should someone make this json file manual? Or is it auto-generated from datasources or other stored data?
    m

    mammoth-bear-12532

    1 year ago
    This json file is just an example of some sample metadata that you can quickly ingest into datahub
    if you want to do it programmatically, you would probably create the event in memory and then emit it to datahub (either over Kafka or over http/REST)
    if you want to figure out how to generate the event in Python, you should chat with @gray-shoe-75895 as he has been contributing changes in that area
    c

    calm-motorcycle-16283

    1 year ago
    Okay thanks, I think I understand little by little now.
    m

    mammoth-bear-12532

    1 year ago
    ๐Ÿ‘ we're here to help!