Hi, I have some questions : 1. I wonder whether he...
# ingestion
g
Hi, I have some questions : 1. I wonder whether he user have to manually specify upstream and downstream and then ingest into DataHub ? Can DataHub have special mechanism to identify which is the upstream and downstream ? If I select all records form table A in order to insert table B, did DataHub can realize table A is the upstream of table B? May i have to make something like scripts to parse the query to identify all the upstream, downstream things and then ingest into DataHub 2. How can I use Spark or Airflow to push metadata into DataHub? Can you give me an example, please ? 3. What is the mechanism operation of Metadata Audit Event (MAE) ? 4. How can i change the Status of entity into true in order to remove the undesirable entity ?
g
1. Right now its primarily manual, but you could definitely use something like Python sqlparse or sqllineage to collect that information automatically 2. Yep both are possible. You can collect lineage information from airflow https://datahubproject.io/docs/metadata-ingestion/#lineage-with-airflow, or even push other metadata information on an ad-hoc basis within a DAG using emitters (https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library). @acceptable-architect-70237 has had some success using Spline to get lineage information from Spark to Datahub https://firststr.com/2021/04/26/spark-compute-lineage-to-datahub/ 3. We’ve got decent docs about MCEs and MAEs here https://datahubproject.io/docs/what/mxe#metadata-audit-event-mae - let me know if you’ve got more specific questions 4. Right now the best way to handle deletes is to emit the Status aspect with removed=True. DataHub currently only supports “soft deletes”, but we’re also looking mechanisms for hard deletes as well - happy to chat more about this if its something you’re interested in