Hi there, My usecase is to capture the data lineag...
# ingestion
b
Hi there, My usecase is to capture the data lineage from the Spark jobs which run using KubernetedPodOpearator in Airflow. is this integration with Airflow being supported in Datahub. I am a newbie to Datahub. Any help is appreciated! Thnks
l
@gray-shoe-75895 probably has more thoughts here.
Here is what comes to mind for me. Extend something like Spline Spark Agent to emit Lineage events using the DataHub emitter library
āœ… 1
Please let us know if you have questions about how to model the lineage events
Having said that, we are also looking into providing native lineage support for Spark. Happy to take contributions here!
g
Yes we've heard of some others who successfully emitted lineage information from Spark, so it's definitely possible - the spark agent seems like the way to go
b
Thanks for the response! I am trying to load data from postgresql to s3 after applying some transformations in my Spark job ( which run in k8s, via Airflow -KubernetesPodOperator). The dataset lineage with the job information and the transformation details needs to be pushed to Datahub. Can you please suggest how to model the lineage events.
l
@gray-shoe-75895 ^ can you please provide pointers to existing usages of Lineage event and how to use the emitter API?
@brave-appointment-76997 have you already installed DataHub? What are the main use-cases you are trying to tackle?
a
I have used Spline agent to extract lineage information from Spline agent to Datahub. For @brave-appointment-76997, I think there are two problems. one is that Airflow's operator lineage information. this has been discussed and proposed as a new entity called Jobs. Secondly it's the Spark lineage. Spark lineage is more about the relationship among datasets and its columns.
šŸ‘€ 1
g
Hey @brave-appointment-76997 - sorry I missed this message. I have a PR up right now that should simplify this process - emitting lineage from Python will look like this (once the PR is merged): https://github.com/linkedin/datahub/blob/5b10691e97b21b776718a416b66219766ab9a7bd/metadata-ingestion/examples/library/lineage_emitter.py
b
@loud-island-88694 Yes, I have installed datahub. We are exploring the features of datahub and current usecase in plan is getting the lineage of datasets which undergoes certain transformation and stored back to S3 via spark job. Datasources currently dealt with are Kafka,Postgres etc.. and destination is S3. The spark jobs are being executed in k8s using kubernetespod operator.
@gray-shoe-75895 Thank you for the lineage emitter example. Could you please let me know which all sources currently have native lineage support?
g
Currently, the majority of the sources only support datasets/descriptions/schemas/etc. We have Airflow operators available that make it easier to declare and emit lineage. For Spark, we don't have anything prebuilt, but it shouldn't be too hard to emit using the Spline agent
b
g
Yep
And support for native airflow lineage is coming very soon - I’m currently working on it https://airflow.apache.org/docs/apache-airflow/stable/lineage.html
b
okay..thanks for the response. is there a kafka emitter example (I really could not find one in Github)
b
Thank you for this example