high-hospital-85984
04/14/2022, 3:02 PMmammoth-bear-12532
big-carpet-38439
04/18/2022, 4:04 PMhigh-hospital-85984
05/03/2022, 6:03 PMmammoth-bear-12532
billions-lawyer-59647
05/04/2022, 1:42 AMbillions-lawyer-59647
05/04/2022, 1:45 AMhigh-hospital-85984
05/04/2022, 8:02 AMLineageBackend
that emits metadata events to Datahub. Airflow's lineage functionality takes the "output" object of an upstream task and makes it the input of any downstream task, but the "inlet"/"outlet" objects are not parsed from the arguments to the task, but need to be defined manually (or passed through task relationships).
This works, but requires manual interaction, and is a bit simplistic when it comes to more dynamic inputs (say a complicated SQL query to a generic Snowflake operator). For this reason, what we've done at Wolt is built functionality around some key helper functions (upload_to_s3
, run_snowflake_query
) that parses the inputs of the functions and deduces the S3 buckets or the Snowflake tables that are actually involved in the process, and appends the task's inlets
/`outlets` lists. Making our users do this manually is out of the question 😅 We've also helped the lineage backend separate between Snowflake tables, S3 buckets, and say external APIs, even though they all are Datasets in Datahub's eyes.
So we'd like to replicate some of this functionality in Flyte. The bare minimum would be to be able to define the inputs/outputs (and their types) to tasks, including external sources (Snowflake, S3, etc), in a manual way. The next step would be to be able to build custom plugins/transformers that can parse the arguments to the task and create the metadata entities. If we can dynamically add metadata to the task from inside the task, at runtime, it would be even better.high-hospital-85984
05/04/2022, 8:04 AMbillions-lawyer-59647
05/05/2022, 6:05 PMbillions-lawyer-59647
05/05/2022, 6:05 PMbillions-lawyer-59647
05/05/2022, 6:09 PMinputs
and outputs
. That being said you may still want to parse the query probably to figure out table joins etc
3. So Flyte currently has an event egress. This is still early work. This should send as an event the following things,
a. Indicate when a workflow, starts, stops, fails
b. Indicate when a task starts, stops, fails
c. you should be able to get the actual task itself for example the snowflake query
as the version is tied to the event.
d. you will also always get the inputs and outputs for every task & workflow execution
e. Should this not be enough to construct arbitrary lineage?high-hospital-85984
05/06/2022, 5:48 AMbillions-lawyer-59647
05/09/2022, 1:54 PMbillions-lawyer-59647
05/09/2022, 1:54 PMbillions-lawyer-59647
05/10/2022, 12:04 AMmodern-pilot-97597
05/10/2022, 12:47 AMhigh-hospital-85984
05/10/2022, 6:32 AMmammoth-bear-12532
billions-apartment-18513
05/11/2022, 10:20 PMmammoth-bear-12532
high-hospital-85984
05/12/2022, 5:15 AMbillions-apartment-18513
05/12/2022, 6:24 AMhigh-hospital-85984
05/12/2022, 7:05 AMbillions-apartment-18513
05/12/2022, 5:46 PMbillions-apartment-18513
05/12/2022, 5:59 PMbillions-apartment-18513
05/12/2022, 8:34 PMbillions-apartment-18513
05/17/2022, 6:39 PMdazzling-judge-80093
05/17/2022, 6:46 PMhigh-hospital-85984
05/17/2022, 7:30 PMbillions-apartment-18513
05/17/2022, 9:17 PMhigh-hospital-85984
05/18/2022, 3:34 AMdazzling-judge-80093
05/18/2022, 5:10 AMbillions-apartment-18513
05/18/2022, 3:15 PMdazzling-judge-80093
05/25/2022, 11:08 AMbillions-apartment-18513
05/25/2022, 10:52 PMbillions-apartment-18513
05/25/2022, 10:58 PM