Hi team for spark data hub lineage version released 0 8 23 a DataHub #advice-metadata-modeling

Hi team, for spark data hub lineage version releas...

loud-musician-49912

01/28/2022, 2:26 PM

Hi team, for spark data hub lineage version released 0.8.23 and 0.8.24 we are receiving NullPointerException from DataHubSparkListener class. We are working with spark 2.4.0 and Scala 2.11.12. and python 2.7.5. Can you please help?

loud-island-88694

01/28/2022, 2:46 PM

Can you please post the full stack trace in this thread?

loud-island-88694

01/28/2022, 2:47 PM

Cc @careful-pilot-86309

loud-musician-49912

01/28/2022, 4:08 PM

loud-musician-49912

01/28/2022, 4:09 PM

Please refer image for stack trace. Even pyspark word count is throwing same exception to me

loud-musician-49912

01/28/2022, 4:21 PM

@careful-pilot-86309 @loud-island-88694 refer above

careful-pilot-86309

01/31/2022, 9:42 AM

@loud-musician-49912 Can you please share the complete setup ( if using spark-submit with python scripts or jupyter notebook, sample code etc) ? Also are you using RDDs for work count? Please note that RDD operations are not yet supported. They will come in future release.

loud-musician-49912

01/31/2022, 11:32 AM

@careful-pilot-86309 I am using spark submit with pyspark. Spark sql is also not working for me. I shall share code soon

loud-musician-49912

01/31/2022, 12:22 PM

spark-submit --num-executors 5 --executor-cores 5 --executor-memory 4g --driver-memory 2g --jars /tmp/datahub-spark-lineage-0.8.24.jar emp.py

loud-musician-49912

01/31/2022, 12:24 PM

import time import sys import subprocess ## Getting start time of a job start_time= time.time() ## importing spark and Hive Context to run queries from pyspark.sql import SparkSession spark = (SparkSession.builder.appName('abc').config("spark.jars.packages","io.acryldatahub spark lineage0.8.24").config("spark.extraListeners","datahub.spark.DatahubSparkListener").config("spark.datahub.rest.server", "http//<hostname>8080").enableHiveSupport().getOrCreate()) query=spark.sql("select * from default.emp_test") query.write.mode("overwrite").csv("hdfs://nameservice1/tmp/outputemp/")

loud-musician-49912

01/31/2022, 12:24 PM

@careful-pilot-86309 above code we are calling

loud-musician-49912

01/31/2022, 12:25 PM

from emp.sh we are calling emp.py

careful-pilot-86309

01/31/2022, 12:58 PM

Thanks a lot. I will check and get back in some time

careful-pilot-86309

01/31/2022, 1:31 PM

I tried exactly same setup on my side and could not reproduce the issue

careful-pilot-86309

01/31/2022, 1:32 PM

Is it possible to have call and check issue on your setup?

loud-musician-49912

01/31/2022, 3:19 PM

I will get back to on this tomorrow. Thanks

loud-musician-49912

02/01/2022, 7:58 AM

@careful-pilot-86309 we can set up meeting today at 7 pm IST if it is fine with you or else suggest any other timings comfortable with you. Can you send meeting invite at aikansh.manchanda@airtel.com

careful-pilot-86309

02/01/2022, 8:25 AM

7pm is good for me. Will send out invite

loud-musician-49912

02/01/2022, 12:41 PM

I received the same. Will join

careful-pilot-86309

02/01/2022, 3:27 PM

@loud-island-88694 we went on call and was able to resolve the issue with spark-lineage. But we have some issue with UI. Lineages are getting sent to datahub server successfully and we can get them using curl but they are not visible on UI. Datajobs can be viewed but not pipelines. Can someone from UI team take a look?

loud-musician-49912

02/01/2022, 3:39 PM

Thanks @careful-pilot-86309 for getting it resolved. @loud-island-88694 please align someone to get it debugged from UI end if possible

loud-musician-49912

02/01/2022, 6:06 PM

loud-musician-49912

02/01/2022, 6:07 PM

@loud-island-88694 pipelines don't show spark however under platforms in spark we see jobs are executed successfully

loud-musician-49912

02/01/2022, 6:07 PM

Pipelines should show spark??

careful-pilot-86309

02/02/2022, 5:37 AM

@loud-musician-49912 can you try bellow to fix UI issue? https://datahubproject.io/docs/how/restore-indices/

loud-musician-49912

02/02/2022, 5:42 AM

@careful-pilot-86309 will try and let you know

rich-policeman-92383

02/02/2022, 2:42 PM

We have tried restoring indices but it fails. We have tried removing the custom model by removing the directory from the plugins directory but still getting this error. How can we delete this aspect from the database.

rich-policeman-92383

02/03/2022, 9:29 AM

Even after directory restoration the cli is unable to delete entity "airtel_dq:0.0.1".

Copy code

$ datahub delete --registry-id "airtel_dq:0.0.1" --hard
This will permanently delete data from DataHub. Do you want to continue? [y/N]: y
No entities found. Payload used: {"registryId": "airtel_dq:0.0.1", "dryRun": false}
Took 32.076 seconds to hard delete 0 rows for 0 entities

32 Views

Open in Slack

Previous Next