Hello all! I've been playing with Datahub today to...
# getting-started
m
Hello all! I've been playing with Datahub today to see if could be a good fit for my company. I could get up and running with the Docker setup - and I could ingest our tables from Athena and Glue, very nice experience overall! Now a newbie question: I can see the Athena tables in the UI, but how do I set up Upstreams and Downstreams for each of the tables, in order to produce the lineage graph? I browsed the docs and could not find anything, but I might be blind šŸ™ˆ In fact, how do I set up Upstreams and Downstreams even across eterogeneous entities? For example, in the tutorial data I can see a Kafka message linked to HDFS datasets linked to Airflow jobs. Thank you!
plus1 2
q
Just commenting that I have the same question! To add to this, I noticed there is a lineage from Airflow feature which is awesome. One challenge we are having, is that not all jobs are orchestrated with Airflow (we’re on Databricks and use its AutoLoader functionality for parts of our data pipelines). What would be the recommendation on how to provide the end to end lineage data to DataHub?
m
Hi @millions-notebook-72121 and @quick-animal-47381: you would just have to emit ā€œlineage eventsā€ from the right place in your custom pipelines
šŸ‘ 1
e.g. in the Airflow lineage example, the DAG Tasks are annotated with inlets and outlets which Airflow faithfully sends down to the lineage backend impl… which in this case just emits lineage events to DataHub over REST / Kafka
šŸ‘ 1
m
Oh ok got it, thank you @mammoth-bear-12532! One question, would it be possible to see the source code behind the example data and lineage that is generated in the Quickstart Guide with the command
datahub docker ingest-sample-data
? It would be a good reference!
plus1 1
b
@millions-notebook-72121 Those are coming from a sample file containing json metadata which was pre-generated - not directly from the source system
q
Hi @big-carpet-38439, we’re currently performing a POC with Datahub, and due to company policies, we are only able to set it up in our AWS Dev account that does not have any connectivity to our production systems. We were thinking to mock the data by preparing the json metadata with our own specific objects and use cases, and would love to hear your thoughts on whether this would be feasible and suggestions in general, thanks! And thanks @mammoth-bear-12532 for the response earlier!
b
@quick-animal-47381 This is totally possible. A great file to look at is bootstrap_mce.json, which represents the sample data we load on quickstart deployments
Let me know if you have questions!
m
Yes thanks I did see the file @big-carpet-38439 - but was wondering how the file was generated as I'd imagine it was not by hand, given it's a 3000-line file?
šŸ‘ 1
q
To add to the above, for our POC we are thinking to install DataHub locally and pull from production system sources into file sinks, transfer those files over to AWS test, and run another recipe that would load those files as sources into the DataHub on AWS test account as the sink. 1. Does the above sound reasonable? 2. Some parts of our tech stack do not have native integrations yet (Databricks), and we may have to mock that data by hand for now, any guidance on this or alternatives would be great! 3. We currently do not use Airflow centrally for our pipelines, and because of this, any examples on how to use the REST API to manually insert mocked cross-stack lineage data for now or alternatives would be very helpful! Thanks again.