Apologies for what might be a philosophical question Let s s DataHub #ingestion

Apologies for what might be a philosophical questi...

steep-pizza-15641

06/10/2021, 7:14 PM

Apologies for what might be a philosophical question. Let's say we use Airflow to perform ML operations, there are a number of operations that are a good fit for DataHub - for example any data engineering that creates tabular data as inputs to model training or any tabular data outputed as part of model inference. DataHub gives me a great way of visualizing the pipeline and data. There are aspects of the pipeline that DataHub does not allow me to vizualize however: 1. Lets say I download zipped data from an FTP site using airflow for example, we do not seem to have an emitter for FTP sites with raw files or raw files sitting in S3 2. Once my data ie engineered, I might use it to train an ML model. It would be good to be able to visualize the output ML model in my dependencies as well Thoughts?

loud-island-88694

06/10/2021, 8:21 PM

Hello Grant! Let me start with 2) We definitely will add integrations to the ML ecosystem (feature stores, training runs, models, inference etc.). We're adding support for Feast and are planning to add support for PyTorch, TensorFlow etc. What ML stack do you use? We will add outgoing edges from Airflow into these tools if these are being triggered from Airflow

loud-island-88694

06/10/2021, 8:24 PM

for 1) For files on S3, we want to model them as datasets. We're adding support for Glue based modeling of datasets and will add support for our own S3 crawler in the future. For files sourced from external sources like FTP, it will still be useful to model them as datasets. We can add support for Airflow operators to capture the lineage edge of data movement @gray-shoe-75895 do you think that makes sense?

gray-shoe-75895

06/11/2021, 5:10 AM

Yep pretty much right. We have a doc on ingesting from S3 via glue in progress https://github.com/linkedin/datahub/pull/2672, and Feast has already been merged https://github.com/linkedin/datahub/pull/2605. “dataset” is a pretty broad type right now, and we’d like to capture lineage in its most general form as well.

steep-pizza-15641

06/11/2021, 3:51 PM

@loud-island-88694 @gray-shoe-75895 Hi from an ML stack persective, AWS SageMaker is of primary interest. Glue support is cool.

👍 1

loud-island-88694

06/11/2021, 3:52 PM

good to know. Sagemaker was also on our list

steep-pizza-15641

06/11/2021, 3:54 PM

@loud-island-88694 @gray-shoe-75895 I understand the dataset limitation - there are occasional edge cases where data is manipulated before it can be really called a dataset - for example a binary file that needs to be decrypted before it can be parsed to structured data or an encrypted file that needs to be decrypted.

loud-island-88694

06/11/2021, 3:55 PM

Interesting. @big-carpet-38439 may be worth modeling these "staging datasets" separately or via an aspect.

big-carpet-38439

06/11/2021, 4:37 PM

Or just as "files"

bored-finland-65840

06/13/2021, 12:06 AM

It would be interesting to have data provenance metadata built up at the first ingestion. Ideally the metadata comply with

PROV-O: The PROV Ontology

https://www.w3.org/TR/prov-o/ and from there you trace back the lineage

Open in Slack

Previous Next