Hello We integrated datahub in our spark job scheduled with DataHub #troubleshoot

Hello, We integrated datahub in our spark job (sc...

salmon-manchester-60485

04/11/2022, 2:28 PM

Hello, We integrated datahub in our spark job (scheduled with airflow) which is reading data from our s3 bucket and writing data to SQL database. At the end of the spark job, the job is blocked after receiving the MetadataWriteResponse. The spark job correctly loaded the data into the SQL table and the metadata is ok on datahub but the job is not ending and it fails after a timeout. [2022-04-05, 163321 ] {spark_submit.py:488} INFO - 22/04/05 143321 INFO McpEmitter: MetadataWriteResponse(success=true, responseContent={"value":"urnlidataJob:(urnlidataFlow:(spark,ANACOUNTERPARTY,local[*]),QueryExecId_6)"}, underlyingResponse=HTTP/1.1 200 OK [Content-Length: 91, Content-Type: application/json, Date: Tue, 05 Apr 2022 143321 GMT, Server: nginx/1.21.6, X-Restli-Protocol-Version: 2.0.0] [Content-Length: 91,Chunked: false]) [2022-04-05, 163805 ] {timeout.py:36} ERROR - Process timed out, PID: 662 [2022-04-05, 163805 ] {spark_submit.py:623} INFO - Sending kill signal to spark-submit Any idea to solve this issue ? we opened a ticket here : https://github.com/datahub-project/datahub/issues/4583

early-lamp-41924

04/11/2022, 4:39 PM

Hey! Do you know if your spark cluster has access to the datahub end points? Is there a way to test this? Like a simple curl request with the token to make sure there is no network issues here

salmon-manchester-60485

04/12/2022, 11:36 AM

we will have a look but the pipeline (with input/output) appears in the datahub UI

careful-pilot-86309

04/18/2022, 9:02 AM

@salmon-manchester-60485 Which platform you are using to run spark jobs? Have you called spark.stop() at the end of your code?

salmon-manchester-60485

04/19/2022, 8:23 AM

we are using a docker image no we don't call spark.stop at the end (because was not needed before using datahub)

careful-pilot-86309

04/19/2022, 9:17 AM

Can you try putting it at the end?

salmon-manchester-60485

04/19/2022, 1:35 PM

the airflow task is green (so it is considered as succeeded) I checked the airflow log :

Copy code

Exception in thread "datahub-emit-pool" java.lang.NullPointerException

at datahub.spark.model.LineageUtils.parseSparkConfig(LineageUtils.java:81)

careful-pilot-86309

04/19/2022, 1:44 PM

ok. Can you get detailed exception like if any stacktrace is available? Full debug level logs will be helpful

Open in Slack

Previous Next