Hi Team I am working on dataset to dataset lineage...
# troubleshoot
n
Hi Team I am working on dataset to dataset lineage It supports one dataset as input and one as output But i have 2 inputs and 1 output. How to handle this? Please help me out here
@hundreds-photographer-13496 @dazzling-judge-80093 Can you guys help me ?
h
n
but this is only for big query right?
h
Its supported here as well. inputDatasets, inputdataJobs all are lists so they support multiple urns (therefore multiple upstreams)
n
ok thanks I will try it
@hundreds-photographer-13496 Wanted to ask one thing I have datahub deployed on 2 environments so I have 2 sinks. Can I add lineage to both the sinks at same time by using single code?
h
Hey, you'll have to emit events to each sink once. o.e. repeat these lines for each sink, everytime with different gms server url.
Copy code
# Create an emitter to the GMS REST API.
emitter = DatahubRestEmitter("<http://localhost:8080>")

# Emit metadata!
emitter.emit_mcp(datajob_input_output_mcp)
n
so basically I will have to write 2 sinks? @hundreds-photographer-13496
h
yes
n
@hundreds-photographer-13496 If i enable the data profiling option then will I be able to view all of it in some dashboard?
And what is the difference between normal data profiling that we use in recipe and the one using great expectations
h
The normal data profiling enabled from recipe typically uses great expectations library itself under the hood for all sql sources. I am not sure what you mean by later.
n
I thought that the great expectations library and the one which is used in recipe is different
@hundreds-photographer-13496 I have 2 questions 1. After enabling the profiling option, the pipelines are taking too much time to execute, is there any way to reduce this time and do profiling 2. How to do profiling in the NoSQL data sources like mongo and kafka
h
1. which source are you using ? You can use
turn_off_expensive_profiling_metrics: True
to disable some profiles (quantiles, etc). If you are using Bigquery or snowflake source, you can also disable profiling for large tables using
profile_table_row_limit
,
profile_table_size_limit
configurations. These sources also support smart profiling mode where one can only profile tables that haven't been updated since last profiling time using
store_last_profiling_timestamps
and
stateful_ingestion
enabled. Refer "Config Details" section from Bigquery source - https://datahubproject.io/docs/generated/ingestion/sources/bigquery/ to know more about these configurations, 2. This is not possible at the moment. Would be great if you would like to contribute this support.
n
Untitled
@hundreds-photographer-13496 I am using the source Hive for profiling The pipeline took 14.35 hrs to execute, refer the screenshot below This time is insane Above are the logs snippet
h
Got it. Are there any additional logs ? Ideally there are also some logs in format "Finished profiling <table>; took <sec> seconds". This can help understand if any particular tables take large amount of time. I would recommend using the turn_off_expensive_profiling config option to reduce profiling time for now. If you have any suggestions wrt improving profiling timing for Hive, do let us know.