stale-architect-93411
05/17/2023, 12:58 PMNullPointerException
. On #integration-databricks-datahub I found a specific jar for databricks: https://datahubspace.slack.com/archives/C033H1QJ28Y/p1646937756282179 but it s more than a year old..
With this Jar my workflow is in success, I find data on datahub, however the lineage is empty, there are no references to input/output data (more info in thread)
Does anyone manage to have lineage with spark on databricks? 😕 blob helpstale-architect-93411
05/17/2023, 12:58 PMsome_input = spark.read.format("delta").load("<s3://some-input-bucket/some-input-data/>")
def build_stats(some_input):
return some_input.groupBy("some_col").count()
final_df = build_stats(some_input)
final_df.write.mode("overwrite").parquet("<s3://some-output-bucket/some-output-location>")
With very basic spark conf
spark.datahub.databricks.cluster databricks-poc-datahub
spark.datahub.rest.server http:/...:8080
spark.extraListeners datahub.spark.DatahubSparkListener
stale-architect-93411
05/17/2023, 12:58 PMdazzling-judge-80093
05/18/2023, 8:49 PMstale-architect-93411
05/22/2023, 7:36 AM23/05/17 12:46:11 INFO McpEmitter: MetadataWriteResponse(success=true, responseContent={"value":"urn:li:dataJob:(urn:li:dataFlow:(spark,databricks-poc-datahub_app-20230517124405-0000,spark_10.18.65.237_7077),QueryExecId_1)"}, underlyingResponse=HTTP/1.1
200 OK [Date: Wed, 17 May 2023 12:46:11 GMT, Content-Type: application/json, X-RestLi-Protocol-Version: 2.0.0, Content-Length: 137, Server: Jetty(9.4.46.v20220331)] [Content-Length: 137,Chunked: false])
23/05/17 12:46:11 INFO SQLAppStatusListener: Recording cache-related metrics in usage logs:
ioCacheWorkersDiskUsage={"10.18.65.94":{"diskUsage":184688,"lifetime":117384}}, ioCacheNumScanTasks={"numLocalScanTasks":6,"numNonLocalScanTasks":0}, ioCacheDiskUsageLimit=63350767616
23/05/17 12:46:11 INFO SQLAppStatusListener: Recording DV cache-related metrics in usage logs: {"10.18.65.94":{"memCacheUsage":0,"diskCacheUsage":0,"lifetimeMs":117385}}
23/05/17 12:46:11 INFO DriverConf: Configured feature flag data source LaunchDarkly
23/05/17 12:46:11 INFO DriverConf: Configured feature flag data source LaunchDarkly
23/05/17 12:46:11 WARN DriverConf: REGION environment variable is not defined. getConfForCurrentRegion will always return default value
23/05/17 12:46:11 WARN QueryProfileListener: Exception while logging query profile
java.lang.NullPointerException
at com.databricks.logging.proto.Annotations.__computeSerializedValue(Annotations.scala:17)
at com.databricks.logging.proto.Annotations.serializedSize(Annotations.scala:26)
at com.databricks.logging.proto.SparkPlan.__computeSerializedValue(SparkPlan.scala:101)
at com.databricks.logging.proto.SparkPlan.serializedSize(SparkPlan.scala:180)
at com.databricks.logging.proto.QueryProfileLog.$anonfun$__computeSerializedValue$3(QueryProfileLog.scala:137)
at com.databricks.logging.proto.QueryProfileLog.$anonfun$__computeSerializedValue$3$adapted(QueryProfileLog.scala:135)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at com.databricks.logging.proto.QueryProfileLog.__computeSerializedValue(QueryProfileLog.scala:135)
at com.databricks.logging.proto.QueryProfileLog.serializedSize(QueryProfileLog.scala:148)
at grpc_shaded.scalapb.GeneratedMessage.toByteArray(GeneratedMessageCompanion.scala:159)
at grpc_shaded.scalapb.GeneratedMessage.toByteArray$(GeneratedMessageCompanion.scala:158)
at com.databricks.logging.proto.QueryProfileLog.toByteArray(QueryProfileLog.scala:47)
at com.databricks.common.logentry.LogEntryFactory$.$anonfun$createLogEntry$1(LogEntryFactory.scala:22)
at scala.Option.flatMap(Option.scala:271)
at com.databricks.common.logentry.LogEntryFactory$.createLogEntry(LogEntryFactory.scala:21)
at com.databricks.logging.structured.ProtoLogger.recordProto(ProtoLogger.scala:35)
at com.databricks.logging.qpl.util.QueryProfileLogging.recordQueryProfile(QueryProfileLogging.scala:21)
at com.databricks.logging.qpl.util.QueryProfileLogging.recordQueryProfile$(QueryProfileLogging.scala:17)
at org.apache.spark.sql.QueryProfileListener.recordQueryProfile(QueryProfileListener.scala:32)
at org.apache.spark.sql.QueryProfileListener.onQueryProfileParamsReady(QueryProfileListener.scala:54)
at org.apache.spark.sql.QueryProfileListener.onOtherEvent(QueryProfileListener.scala:40)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:102)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:42)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:118)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:102)
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:114)
at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:114)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at <http://org.apache.spark.scheduler.AsyncEventQueue.org|org.apache.spark.scheduler.AsyncEventQueue.org>$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:109)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:105)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1655)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:105)
I wonder if the NPE is the reason why I do not have any information 🤔stale-architect-93411
05/26/2023, 7:48 AMNullPointerException
is not there, but I still don't have any lineage 😕