Hi Team I am working on dataset to dataset lineage It suppor DataHub #troubleshoot

Hi Team I am working on dataset to dataset lineage...

numerous-account-62719

03/03/2023, 4:33 AM

Hi Team I am working on dataset to dataset lineage It supports one dataset as input and one as output But i have 2 inputs and 1 output. How to handle this? Please help me out here

numerous-account-62719

03/03/2023, 4:37 AM

@hundreds-photographer-13496 @dazzling-judge-80093 Can you guys help me ?

hundreds-photographer-13496

03/03/2023, 6:28 AM

Its possible. Here is an example - https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_emitter_rest.py

numerous-account-62719

03/03/2023, 7:12 AM

but this is only for big query right?

numerous-account-62719

03/03/2023, 7:12 AM

I am using this right now https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/lineage_dataset_job_dataset.py

hundreds-photographer-13496

03/03/2023, 7:25 AM

Its supported here as well. inputDatasets, inputdataJobs all are lists so they support multiple urns (therefore multiple upstreams)

numerous-account-62719

03/03/2023, 7:30 AM

ok thanks I will try it

numerous-account-62719

03/06/2023, 5:54 AM

@hundreds-photographer-13496 Wanted to ask one thing I have datahub deployed on 2 environments so I have 2 sinks. Can I add lineage to both the sinks at same time by using single code?

hundreds-photographer-13496

03/06/2023, 7:13 AM

Hey, you'll have to emit events to each sink once. o.e. repeat these lines for each sink, everytime with different gms server url.

Copy code

# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(datajob_input_output_mcp)

numerous-account-62719

03/06/2023, 8:04 AM

so basically I will have to write 2 sinks? @hundreds-photographer-13496

hundreds-photographer-13496

03/06/2023, 8:22 AM

yes

numerous-account-62719

03/06/2023, 10:12 AM

@hundreds-photographer-13496 If i enable the data profiling option then will I be able to view all of it in some dashboard?

numerous-account-62719

03/06/2023, 10:13 AM

And what is the difference between normal data profiling that we use in recipe and the one using great expectations

hundreds-photographer-13496

03/06/2023, 10:36 AM

The normal data profiling enabled from recipe typically uses great expectations library itself under the hood for all sql sources. I am not sure what you mean by later.

numerous-account-62719

03/06/2023, 10:47 AM

I thought that the great expectations library and the one which is used in recipe is different

numerous-account-62719

03/06/2023, 10:49 AM

@hundreds-photographer-13496 I have 2 questions 1. After enabling the profiling option, the pipelines are taking too much time to execute, is there any way to reduce this time and do profiling 2. How to do profiling in the NoSQL data sources like mongo and kafka

hundreds-photographer-13496

03/07/2023, 6:50 AM

1. which source are you using ? You can use

turn_off_expensive_profiling_metrics: True

to disable some profiles (quantiles, etc). If you are using Bigquery or snowflake source, you can also disable profiling for large tables using

profile_table_row_limit

profile_table_size_limit

configurations. These sources also support smart profiling mode where one can only profile tables that haven't been updated since last profiling time using

store_last_profiling_timestamps

and

stateful_ingestion

enabled. Refer "Config Details" section from Bigquery source - https://datahubproject.io/docs/generated/ingestion/sources/bigquery/ to know more about these configurations, 2. This is not possible at the moment. Would be great if you would like to contribute this support.

numerous-account-62719

03/07/2023, 7:14 AM

Untitled

numerous-account-62719

03/07/2023, 7:14 AM

@hundreds-photographer-13496 I am using the source Hive for profiling The pipeline took 14.35 hrs to execute, refer the screenshot below This time is insane Above are the logs snippet

hundreds-photographer-13496

03/07/2023, 9:18 AM

Got it. Are there any additional logs ? Ideally there are also some logs in format "Finished profiling <table>; took <sec> seconds". This can help understand if any particular tables take large amount of time. I would recommend using the turn_off_expensive_profiling config option to reduce profiling time for now. If you have any suggestions wrt improving profiling timing for Hive, do let us know.

3 Views

Open in Slack

Previous Next