In our org, we will use spark to read from kafka a...
# ingestion
g
In our org, we will use spark to read from kafka and write to kafka/hive/files. Can datahub extract these lineage info out from spark streaming jobs using DatahubSparkListener?
o
Hi! The initial release of the Spark listener is limited in terms of sources, but we do plan on expanding its functionality to support more sources. Full capabilities for the spark listener can be found here: https://datahubproject.io/docs/metadata-integration/java/spark-lineage/
g
Hi @orange-night-91387 I was not able to extract the spark lineage while running Spark Listener from a notebook. I could not locate any log anywhere to find out about the status of the spark agent. We are using spark 2.4 and scala version 2.11. What can I do to troubleshoot further?
o
Do you have logging setup for your Jupyter notebook? Something like: https://towardsdatascience.com/building-and-exporting-python-logs-in-jupyter-notebooks-87b6d7a86c4
g
Mmmm... i went to look for the application master logs instead. The error was "General SSLEngine Problem". I think the problem was that we enabled https for gms service using our self-signed cert. Any idea how I should proceed from here
l
Hi @glamorous-microphone-33484! Specifically to Spark Streaming - we don’t support this yet; would you mind opening a feature request? https://feature-requests.datahubproject.io/
@careful-pilot-86309 & @elegant-doctor-86344 - can you take a look at the open questions around Spark lineage?
c
@glamorous-microphone-33484 Current version of DatahubSparkListener ( 0.8.25) doesnt support https enabled gms server. Its work in progress and will get out in next release.