Hi folks,
As we're integrating some of our metadata sources, we had some questions that we thought we'd run by you. The use-cases that folks within our company (Stripe) wanted to start off with were centered around dataset landing times / timeliness, so we decided to start off with publishing events as part of some Airflow completion hooks we have in place. Essentially our plan is to infer the relevant datasets (S3, Trino, Iceberg) based on our custom Airflow task extension code and write out metadata change proposal events to indicate the Airflow task / dag (DataJob, DataFlow) as well as the input and output dataset entities. The part we're trying to ensure is future-proof is when we add additional connectors in the future (e.g. the Iceberg connector to pull schema, stats related metadata for the Iceberg table / dataset) they metadata they emit matches up to the datasets we constructed and emitted as part of our Airflow hooks.
Some details in ๐งต