Hi friends! Sorry, I'm a real newbie to data catal...
# ingestion
h
Hi friends! Sorry, I'm a real newbie to data catalogs and especially Datahub, but I really need your help: we have a project that downloads and collects files from FTP/SFTP servers and then moves them to GCS after some transformations, to finally send them to HDFS. The whole process is also saved in a postgres database. I managed to ingest metadata from Postgres to our datahub in Kubernetes, but I think it's not the right architecture. Here is the last picture I need: To see in Lineage how data passed from FTP/SFTP servers to GCP and later to HDFS. The problem I have is I still don't understand how exactly lineage is created, whether it's after ingestion or during, or automatically. I have seen examples of lineage code but I still can't exactly understand how/where to implement it in our project.
o
Generally we create lineage of entities within DataHub by hooking into process pipelines like Airflow: https://datahubproject.io/docs/lineage/airflow/ or Spark: https://datahubproject.io/docs/metadata-integration/java/spark-lineage If you're running a custom pipeline outside of one of our lineage integrations, you would need to emit MetadataChangeProposals including the lineage details from the pipeline inflection points to be able to see it in DataHub. If you use a tool other than the ones we support, consider putting up a feature request here: https://feature-requests.datahubproject.io/
h
Cool thanks for the response. If I get you right, it means we imperatively need either Airflow or Spark for lineage right? Because I was confused by these Lineage examples that didn't use those tools.
l
You can also use Python and Java emitters to emit lineage events in you custom services (if you are not using Airflow or Spark)
plus1 2
👍 1
The examples show how you can programmatically do that
h
Thanks for the help