Apologies for what might be a philosophical questi...
# ingestion
s
Apologies for what might be a philosophical question. Let's say we use Airflow to perform ML operations, there are a number of operations that are a good fit for DataHub - for example any data engineering that creates tabular data as inputs to model training or any tabular data outputed as part of model inference. DataHub gives me a great way of visualizing the pipeline and data. There are aspects of the pipeline that DataHub does not allow me to vizualize however: 1. Lets say I download zipped data from an FTP site using airflow for example, we do not seem to have an emitter for FTP sites with raw files or raw files sitting in S3 2. Once my data ie engineered, I might use it to train an ML model. It would be good to be able to visualize the output ML model in my dependencies as well Thoughts?
l
Hello Grant! Let me start with 2) We definitely will add integrations to the ML ecosystem (feature stores, training runs, models, inference etc.). We're adding support for Feast and are planning to add support for PyTorch, TensorFlow etc. What ML stack do you use? We will add outgoing edges from Airflow into these tools if these are being triggered from Airflow
for 1) For files on S3, we want to model them as datasets. We're adding support for Glue based modeling of datasets and will add support for our own S3 crawler in the future. For files sourced from external sources like FTP, it will still be useful to model them as datasets. We can add support for Airflow operators to capture the lineage edge of data movement @gray-shoe-75895 do you think that makes sense?
g
Yep pretty much right. We have a doc on ingesting from S3 via glue in progress https://github.com/linkedin/datahub/pull/2672, and Feast has already been merged https://github.com/linkedin/datahub/pull/2605. “dataset” is a pretty broad type right now, and we’d like to capture lineage in its most general form as well.
s
@loud-island-88694 @gray-shoe-75895 Hi from an ML stack persective, AWS SageMaker is of primary interest. Glue support is cool.
👍 1
l
good to know. Sagemaker was also on our list
s
@loud-island-88694 @gray-shoe-75895 I understand the dataset limitation - there are occasional edge cases where data is manipulated before it can be really called a dataset - for example a binary file that needs to be decrypted before it can be parsed to structured data or an encrypted file that needs to be decrypted.
l
Interesting. @big-carpet-38439 may be worth modeling these "staging datasets" separately or via an aspect.
b
Or just as "files"
b
It would be interesting to have data provenance metadata built up at the first ingestion. Ideally the metadata comply with
PROV-O: The PROV Ontology
https://www.w3.org/TR/prov-o/ and from there you trace back the lineage